https://github.com/apache/parquet-format/pull/440 has a few approvals
already.  I think unless there are objections we can aim to merge mid next
week (I'm not sure if this warrants  a vote as it captures the reality as
it stands today)?

Thanks,
Micah

On Tue, Jun 25, 2024 at 12:54 PM Ed Seidl <etse...@live.com> wrote:

> FWIW, I've throw up an alternative strawman PR [1].
>
> Ed
>
> [1] https://github.com/apache/parquet-format/pull/440
>
> On 6/25/24 11:56 AM, Ed Seidl wrote:
> > I've now tested three implemenations (parquet-java,
> > pyarrow/parquet-cpp, arrow-rs) to see what they all do. For brevity,
> > I'll refer to the ColumnMetaData struct following the chunk data as
> > IMD (inline metadata), and the copy in the footer as FMD (footer
> > metadata).
> >
> > parquet-java, as reported, does not write IMD at all. file_offset
> > points to the location of the first page in the chunk. If FMD is
> > missing, the reader throws a NPE.
> >
> > pyarrow/parquet-cpp writes IMD and FMD, and sets file_offset, but the
> > location that file_offset points to is 4 bytes of something that
> > precedes the actual thrift serialized struct. When accounting for the
> > added 4 bytes, the IMD can be deserialized and looks to be correct.
> > When reading a file that lacks FMD, no exception is thrown, but the
> > resulting table is empty.
> >
> > arrow-rs writes IMD and FMD, and set file_offset correctly.  The IMD
> > has incorrect values for dictionary_page_offset and
> > data_page_offset...they are relative to the start of the chunk, rather
> > than the start of the file. For a file lacking FMD, the reader throws
> > an exception indicating that the meta_data field is not populated.
> >
> > So that makes 3 implementations for which file_offset/IMD is not
> > terribly useful.  Actually 4, because I know cudf behaves like
> > parquet-java.
> >
> > So if the consensus is to not change requiredness, then I do think
> > then the spec and thrift need to use very strong language to the
> > effect that file_offset is deprecated and should be ignored and will
> > be set to 0 if IDL is not present, and that meta_data, while optional
> > in the IDL, is in fact required by every major implementation. The
> > diagram in the README should also be updated to indicate that the IMD
> > is almost never correct and should be ignored by readers. Perhaps it
> > can be removed from the diagram altogether.
> >
> > Cheers,
> > Ed
> >
> > On 6/25/24 7:29 AM, Ed Seidl wrote:
> >> The issue I have is that we're currently in a position where a file
> >> written to the letter of the specification will likely be readable by
> >> none of the major parquet implementations. (I'm going to test this
> >> hypothesis today). If the ColumnMetaData in the footer is de facto
> >> required, then I think we should at a minimum change the thrift to
> >> make it so. Similarly, the reference implementation (parquet-java)
> >> currently does not write the required metadata, and sets file_offset
> >> to an invalid (but valid seeming) value. If we don't change the
> >> requiredness of file_offset, then either parquet-java needs to start
> >> writing the metadata inline with the chunk data and set file_offset
> >> correctly, or, as I've proposed elsewhere[1], simply write 0 for the
> >> required field, with the understanding that this means the metadata
> >> is not present (and modify the wording in the spec to make this
> >> approach valid).
> >>
> >> So, to my mind, the goal isn't to avoid confusion, it's to have the
> >> specification match current reality.
> >>
> >> Regards,
> >> Ed
> >>
> >> [1] https://github.com/apache/parquet-java/pull/1369
> >>
> >> On 6/25/24 6:33 AM, Andrew Lamb wrote:
> >>> If the goal of this exercise is to avoid confusion, I agree with Michah
> >>> that updating parquet.thrift is best. Here [3] is a PR to update the
> >>> thrift
> >>> file to clarify that the field is not written by all writers and is not
> >>> read by many.
> >>>
> >>> In my opinion any backwards incompatible changes do nothing other risk
> >>> making parquet files less compatible with the ecosystem
> >>>
> >>> While removing the field is a technically more elegant solution
> >>> (would make
> >>> code cleaner), it could only cause potential incompatibilities for
> >>> users. I
> >>> prefer to have more complex code but a better user experience.
> >>>
> >>> BTW the Rust parquet writer sets the file_offset field[1] but does not
> >>> appear to use it on read. Instead it assumes column_metadata is
> >>> present[2]
> >>>
> >>> Andrew
> >>>
> >>> [1]:
> >>>
> https://github.com/apache/arrow-rs/blob/b3f06f6cc4d4f4431a1f86cfc9f30de3a1fc1a1b/parquet/src/column/writer/mod.rs#L904-L907
> >>>
> >>> [2]:
> >>>
> https://github.com/apache/arrow-rs/blob/ed018a34d996590544fe5e833cb601bf46e9758e/parquet/src/file/metadata.rs#L673-L672
> >>>
> >>> [3]: https://github.com/apache/parquet-format/pull/439
> >>>
> >>>
> >>> On Tue, Jun 25, 2024 at 4:40 AM Alkis Evlogimenos
> >>> <alkis.evlogime...@databricks.com.invalid> wrote:
> >>>
> >>>> We need a mechanism to remove fields. Typically this would involve
> >>>> some
> >>>> time horizon.
> >>>>
> >>>> I suggest we establish a deprecation horizon now, say 3y, and start
> >>>> the
> >>>> clocks ticking. Plus some convention for marking deprecated fields
> >>>> because
> >>>> the thrift IDL lacks a way to do this in code. I propose the
> >>>> annotation `//
> >>>> DEPRECATED-EOL-20270421` followed by a description on what happens
> >>>> in the
> >>>> interim. For example for this field:
> >>>>
> >>>> ```
> >>>>    /** Byte offset in file_path to the ColumnMetaData **/
> >>>>    // DEPRECATED-EOL-20270421
> >>>>    // Before 20240625 this field was required. Since then it was made
> >>>> optional. New writers MUST write it util EOL to support old readers.
> >>>>    2: optional i64 file_offset
> >>>> ```
> >>>>
> >>>>
> >>>> On Tue, Jun 25, 2024 at 9:23 AM Micah Kornfield
> >>>> <emkornfi...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>>> but I'm not clear on how that will
> >>>>>> impact existing parsers.
> >>>>>
> >>>>> This can break older parsers, that validate required fields are in
> >>>>> fact
> >>>>> present.  I think it would be best to just update documentation on
> >>>>> the
> >>>>> current state of affairs, and let implementations update
> >>>>> accordingly if
> >>>>> necessary.
> >>>>>
> >>>>> On Mon, Jun 24, 2024 at 3:21 PM Ed Seidl <etse...@live.com> wrote:
> >>>>>
> >>>>>> Resurrecting a thread from earlier in the month regarding
> >>>>>> inconsistent
> >>>>>> use of the file_offset field [1][2]. It seems like the preferred
> >>>>>> path
> >>>>>> forward is to deprecate this (AFAICT) unused field to prevent
> >>>>>> further
> >>>>>> confusion. If there are no violent objections, I'll submit a PR
> >>>>>> to do
> >>>> so
> >>>>>> in a few days.
> >>>>>>
> >>>>>> One question I have, though, is how to handle the requiredness of
> >>>>>> the
> >>>>>> file_offset (currently required) and meta_data (currently optional)
> >>>>>> fields in ColumnChunk. I'd prefer to switch them, and make
> >>>>>> file_offset
> >>>>>> optional and meta_data required, but I'm not clear on how that will
> >>>>>> impact existing parsers. I believe most (all) implementations ignore
> >>>>>> file_offset anyway, and expect meta_data to be present, so maybe
> >>>>>> this
> >>>> is
> >>>>>> a non-issue.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Ed
> >>>>>>
> >>>>>> [1]
> https://lists.apache.org/thread/q5r43ks61q4wcbvwsk1jyw4h30fvg68t
> >>>>>> [2] https://github.com/apache/parquet-java/pull/1369
> >>>>>>
> >>
> >
>
>

Reply via email to