https://github.com/apache/parquet-format/pull/440 has a few approvals already. I think unless there are objections we can aim to merge mid next week (I'm not sure if this warrants a vote as it captures the reality as it stands today)?
Thanks, Micah On Tue, Jun 25, 2024 at 12:54 PM Ed Seidl <etse...@live.com> wrote: > FWIW, I've throw up an alternative strawman PR [1]. > > Ed > > [1] https://github.com/apache/parquet-format/pull/440 > > On 6/25/24 11:56 AM, Ed Seidl wrote: > > I've now tested three implemenations (parquet-java, > > pyarrow/parquet-cpp, arrow-rs) to see what they all do. For brevity, > > I'll refer to the ColumnMetaData struct following the chunk data as > > IMD (inline metadata), and the copy in the footer as FMD (footer > > metadata). > > > > parquet-java, as reported, does not write IMD at all. file_offset > > points to the location of the first page in the chunk. If FMD is > > missing, the reader throws a NPE. > > > > pyarrow/parquet-cpp writes IMD and FMD, and sets file_offset, but the > > location that file_offset points to is 4 bytes of something that > > precedes the actual thrift serialized struct. When accounting for the > > added 4 bytes, the IMD can be deserialized and looks to be correct. > > When reading a file that lacks FMD, no exception is thrown, but the > > resulting table is empty. > > > > arrow-rs writes IMD and FMD, and set file_offset correctly. The IMD > > has incorrect values for dictionary_page_offset and > > data_page_offset...they are relative to the start of the chunk, rather > > than the start of the file. For a file lacking FMD, the reader throws > > an exception indicating that the meta_data field is not populated. > > > > So that makes 3 implementations for which file_offset/IMD is not > > terribly useful. Actually 4, because I know cudf behaves like > > parquet-java. > > > > So if the consensus is to not change requiredness, then I do think > > then the spec and thrift need to use very strong language to the > > effect that file_offset is deprecated and should be ignored and will > > be set to 0 if IDL is not present, and that meta_data, while optional > > in the IDL, is in fact required by every major implementation. The > > diagram in the README should also be updated to indicate that the IMD > > is almost never correct and should be ignored by readers. Perhaps it > > can be removed from the diagram altogether. > > > > Cheers, > > Ed > > > > On 6/25/24 7:29 AM, Ed Seidl wrote: > >> The issue I have is that we're currently in a position where a file > >> written to the letter of the specification will likely be readable by > >> none of the major parquet implementations. (I'm going to test this > >> hypothesis today). If the ColumnMetaData in the footer is de facto > >> required, then I think we should at a minimum change the thrift to > >> make it so. Similarly, the reference implementation (parquet-java) > >> currently does not write the required metadata, and sets file_offset > >> to an invalid (but valid seeming) value. If we don't change the > >> requiredness of file_offset, then either parquet-java needs to start > >> writing the metadata inline with the chunk data and set file_offset > >> correctly, or, as I've proposed elsewhere[1], simply write 0 for the > >> required field, with the understanding that this means the metadata > >> is not present (and modify the wording in the spec to make this > >> approach valid). > >> > >> So, to my mind, the goal isn't to avoid confusion, it's to have the > >> specification match current reality. > >> > >> Regards, > >> Ed > >> > >> [1] https://github.com/apache/parquet-java/pull/1369 > >> > >> On 6/25/24 6:33 AM, Andrew Lamb wrote: > >>> If the goal of this exercise is to avoid confusion, I agree with Michah > >>> that updating parquet.thrift is best. Here [3] is a PR to update the > >>> thrift > >>> file to clarify that the field is not written by all writers and is not > >>> read by many. > >>> > >>> In my opinion any backwards incompatible changes do nothing other risk > >>> making parquet files less compatible with the ecosystem > >>> > >>> While removing the field is a technically more elegant solution > >>> (would make > >>> code cleaner), it could only cause potential incompatibilities for > >>> users. I > >>> prefer to have more complex code but a better user experience. > >>> > >>> BTW the Rust parquet writer sets the file_offset field[1] but does not > >>> appear to use it on read. Instead it assumes column_metadata is > >>> present[2] > >>> > >>> Andrew > >>> > >>> [1]: > >>> > https://github.com/apache/arrow-rs/blob/b3f06f6cc4d4f4431a1f86cfc9f30de3a1fc1a1b/parquet/src/column/writer/mod.rs#L904-L907 > >>> > >>> [2]: > >>> > https://github.com/apache/arrow-rs/blob/ed018a34d996590544fe5e833cb601bf46e9758e/parquet/src/file/metadata.rs#L673-L672 > >>> > >>> [3]: https://github.com/apache/parquet-format/pull/439 > >>> > >>> > >>> On Tue, Jun 25, 2024 at 4:40 AM Alkis Evlogimenos > >>> <alkis.evlogime...@databricks.com.invalid> wrote: > >>> > >>>> We need a mechanism to remove fields. Typically this would involve > >>>> some > >>>> time horizon. > >>>> > >>>> I suggest we establish a deprecation horizon now, say 3y, and start > >>>> the > >>>> clocks ticking. Plus some convention for marking deprecated fields > >>>> because > >>>> the thrift IDL lacks a way to do this in code. I propose the > >>>> annotation `// > >>>> DEPRECATED-EOL-20270421` followed by a description on what happens > >>>> in the > >>>> interim. For example for this field: > >>>> > >>>> ``` > >>>> /** Byte offset in file_path to the ColumnMetaData **/ > >>>> // DEPRECATED-EOL-20270421 > >>>> // Before 20240625 this field was required. Since then it was made > >>>> optional. New writers MUST write it util EOL to support old readers. > >>>> 2: optional i64 file_offset > >>>> ``` > >>>> > >>>> > >>>> On Tue, Jun 25, 2024 at 9:23 AM Micah Kornfield > >>>> <emkornfi...@gmail.com> > >>>> wrote: > >>>> > >>>>>> but I'm not clear on how that will > >>>>>> impact existing parsers. > >>>>> > >>>>> This can break older parsers, that validate required fields are in > >>>>> fact > >>>>> present. I think it would be best to just update documentation on > >>>>> the > >>>>> current state of affairs, and let implementations update > >>>>> accordingly if > >>>>> necessary. > >>>>> > >>>>> On Mon, Jun 24, 2024 at 3:21 PM Ed Seidl <etse...@live.com> wrote: > >>>>> > >>>>>> Resurrecting a thread from earlier in the month regarding > >>>>>> inconsistent > >>>>>> use of the file_offset field [1][2]. It seems like the preferred > >>>>>> path > >>>>>> forward is to deprecate this (AFAICT) unused field to prevent > >>>>>> further > >>>>>> confusion. If there are no violent objections, I'll submit a PR > >>>>>> to do > >>>> so > >>>>>> in a few days. > >>>>>> > >>>>>> One question I have, though, is how to handle the requiredness of > >>>>>> the > >>>>>> file_offset (currently required) and meta_data (currently optional) > >>>>>> fields in ColumnChunk. I'd prefer to switch them, and make > >>>>>> file_offset > >>>>>> optional and meta_data required, but I'm not clear on how that will > >>>>>> impact existing parsers. I believe most (all) implementations ignore > >>>>>> file_offset anyway, and expect meta_data to be present, so maybe > >>>>>> this > >>>> is > >>>>>> a non-issue. > >>>>>> > >>>>>> Thanks, > >>>>>> Ed > >>>>>> > >>>>>> [1] > https://lists.apache.org/thread/q5r43ks61q4wcbvwsk1jyw4h30fvg68t > >>>>>> [2] https://github.com/apache/parquet-java/pull/1369 > >>>>>> > >> > > > >