FWIW, I've throw up an alternative strawman PR [1].
Ed
[1] https://github.com/apache/parquet-format/pull/440
On 6/25/24 11:56 AM, Ed Seidl wrote:
I've now tested three implemenations (parquet-java,
pyarrow/parquet-cpp, arrow-rs) to see what they all do. For brevity,
I'll refer to the ColumnMetaData struct following the chunk data as
IMD (inline metadata), and the copy in the footer as FMD (footer
metadata).
parquet-java, as reported, does not write IMD at all. file_offset
points to the location of the first page in the chunk. If FMD is
missing, the reader throws a NPE.
pyarrow/parquet-cpp writes IMD and FMD, and sets file_offset, but the
location that file_offset points to is 4 bytes of something that
precedes the actual thrift serialized struct. When accounting for the
added 4 bytes, the IMD can be deserialized and looks to be correct.
When reading a file that lacks FMD, no exception is thrown, but the
resulting table is empty.
arrow-rs writes IMD and FMD, and set file_offset correctly. The IMD
has incorrect values for dictionary_page_offset and
data_page_offset...they are relative to the start of the chunk, rather
than the start of the file. For a file lacking FMD, the reader throws
an exception indicating that the meta_data field is not populated.
So that makes 3 implementations for which file_offset/IMD is not
terribly useful. Actually 4, because I know cudf behaves like
parquet-java.
So if the consensus is to not change requiredness, then I do think
then the spec and thrift need to use very strong language to the
effect that file_offset is deprecated and should be ignored and will
be set to 0 if IDL is not present, and that meta_data, while optional
in the IDL, is in fact required by every major implementation. The
diagram in the README should also be updated to indicate that the IMD
is almost never correct and should be ignored by readers. Perhaps it
can be removed from the diagram altogether.
Cheers,
Ed
On 6/25/24 7:29 AM, Ed Seidl wrote:
The issue I have is that we're currently in a position where a file
written to the letter of the specification will likely be readable by
none of the major parquet implementations. (I'm going to test this
hypothesis today). If the ColumnMetaData in the footer is de facto
required, then I think we should at a minimum change the thrift to
make it so. Similarly, the reference implementation (parquet-java)
currently does not write the required metadata, and sets file_offset
to an invalid (but valid seeming) value. If we don't change the
requiredness of file_offset, then either parquet-java needs to start
writing the metadata inline with the chunk data and set file_offset
correctly, or, as I've proposed elsewhere[1], simply write 0 for the
required field, with the understanding that this means the metadata
is not present (and modify the wording in the spec to make this
approach valid).
So, to my mind, the goal isn't to avoid confusion, it's to have the
specification match current reality.
Regards,
Ed
[1] https://github.com/apache/parquet-java/pull/1369
On 6/25/24 6:33 AM, Andrew Lamb wrote:
If the goal of this exercise is to avoid confusion, I agree with Michah
that updating parquet.thrift is best. Here [3] is a PR to update the
thrift
file to clarify that the field is not written by all writers and is not
read by many.
In my opinion any backwards incompatible changes do nothing other risk
making parquet files less compatible with the ecosystem
While removing the field is a technically more elegant solution
(would make
code cleaner), it could only cause potential incompatibilities for
users. I
prefer to have more complex code but a better user experience.
BTW the Rust parquet writer sets the file_offset field[1] but does not
appear to use it on read. Instead it assumes column_metadata is
present[2]
Andrew
[1]:
https://github.com/apache/arrow-rs/blob/b3f06f6cc4d4f4431a1f86cfc9f30de3a1fc1a1b/parquet/src/column/writer/mod.rs#L904-L907
[2]:
https://github.com/apache/arrow-rs/blob/ed018a34d996590544fe5e833cb601bf46e9758e/parquet/src/file/metadata.rs#L673-L672
[3]: https://github.com/apache/parquet-format/pull/439
On Tue, Jun 25, 2024 at 4:40 AM Alkis Evlogimenos
<alkis.evlogime...@databricks.com.invalid> wrote:
We need a mechanism to remove fields. Typically this would involve
some
time horizon.
I suggest we establish a deprecation horizon now, say 3y, and start
the
clocks ticking. Plus some convention for marking deprecated fields
because
the thrift IDL lacks a way to do this in code. I propose the
annotation `//
DEPRECATED-EOL-20270421` followed by a description on what happens
in the
interim. For example for this field:
```
/** Byte offset in file_path to the ColumnMetaData **/
// DEPRECATED-EOL-20270421
// Before 20240625 this field was required. Since then it was made
optional. New writers MUST write it util EOL to support old readers.
2: optional i64 file_offset
```
On Tue, Jun 25, 2024 at 9:23 AM Micah Kornfield
<emkornfi...@gmail.com>
wrote:
but I'm not clear on how that will
impact existing parsers.
This can break older parsers, that validate required fields are in
fact
present. I think it would be best to just update documentation on
the
current state of affairs, and let implementations update
accordingly if
necessary.
On Mon, Jun 24, 2024 at 3:21 PM Ed Seidl <etse...@live.com> wrote:
Resurrecting a thread from earlier in the month regarding
inconsistent
use of the file_offset field [1][2]. It seems like the preferred
path
forward is to deprecate this (AFAICT) unused field to prevent
further
confusion. If there are no violent objections, I'll submit a PR
to do
so
in a few days.
One question I have, though, is how to handle the requiredness of
the
file_offset (currently required) and meta_data (currently optional)
fields in ColumnChunk. I'd prefer to switch them, and make
file_offset
optional and meta_data required, but I'm not clear on how that will
impact existing parsers. I believe most (all) implementations ignore
file_offset anyway, and expect meta_data to be present, so maybe
this
is
a non-issue.
Thanks,
Ed
[1] https://lists.apache.org/thread/q5r43ks61q4wcbvwsk1jyw4h30fvg68t
[2] https://github.com/apache/parquet-java/pull/1369