Hi Antoine,
Thanks for the great points.  Responses inline.

> I like your attempt to put the "new" file metadata after the legacy
> one in https://github.com/apache/parquet-format/pull/250, and I hope it
> can actually be made to work (it requires current Parquet readers to
> allow/ignore arbitrary padding at the end of the v1 Thrift metadata).


Thanks (I hope so too).  I think the idea is originally from Alkis.  If it
doesn't work then there is always an option of doing a little more involved
process of making the footer look like an unknown binary field (an approach
I know you have objections to).

I'm biased, but I find it much cleaner to define new Thrift
>   structures (FileMetadataV3, etc.), rather than painstakinly document
>   which fields are to be omitted in V3. That would achieve three goals:
>   1) make the spec easier to read (even though it would be physically
>   longer); 2) make it easier to produce a conformant implementation
>   (special rules increase the risks of misunderstandings and
>   disagreements); 3) allow a later cleanup of the spec once we agree to
>   get rid of V1 structs.

There are trade-offs here.  I agree with the benefits you listed here.  The
benefits of reusing existing structs are:
1. Lowers the amount of boiler plate code mapping from one to the other
(i.e. simpler initial implementation), since I expect it will be a while
before we have standalone PAR3 files.
2. Allows for lower maintenance burden if there is useful new metadata that
we would like to see added to both structures original and "V3" structures.

- The new encoding in that PR seems like it should be moved to a
>   separate PR and be discussed in the encodings thread?


I'll cross post on that thread.  The main reason I included it in my
proposal is I think it provides random access for members out of the box
(as compared to the existing encodings).  I think this mostly goes to your
third-point so I'll discuss below.

- I'm a bit skeptical about moving Thrift lists into data pages, rather
>   than, say, just embed the corresponding Thrift serialization as
>   binary fields for lazy deserialization.

I think this falls into 2 different concerns:
1.  The format of how we serialize metadata.
2.  Where the serialized metadata lives.

For concern #1, I think we should be considering treating these lists as
actual parquet data pages.  This allows users to tune this to their needs
for size vs decoding speed, and make use of any improvements to encoding
that happen in the future without a spec change. I think this is likely
fairly valuable given the number of systems that cache this data.  The
reason I introduced the new encoding was to provide an option that could be
as efficient as possible from a compute perspective.

For concern #2, there is no reason encoding a page as a thrift Binary field
would not work. The main reason I raised putting them outside of thrift is
for greater control on deserialization (the main benefit being avoiding
copies) for implementations that have a Thrift parser that doesn't allow
these optimizations.  In terms of a path forward here, I think
understanding the performance and memory characteristics of each approach.
I agree, if there isn't substantial savings from having them be outside the
page, then it just adds complexity.

Thanks,
Micah





On Tue, May 28, 2024 at 7:06 AM Antoine Pitrou <[email protected]> wrote:

>
> Hello Micah,
>
> First, kudos for doing this!
>
> I like your attempt to put the "new" file metadata after the legacy
> one in https://github.com/apache/parquet-format/pull/250, and I hope it
> can actually be made to work (it requires current Parquet readers to
> allow/ignore arbitrary padding at the end of the v1 Thrift metadata).
>
> Some assorted comments on other changes that PR is doing:
>
> - I'm biased, but I find it much cleaner to define new Thrift
>   structures (FileMetadataV3, etc.), rather than painstakinly document
>   which fields are to be omitted in V3. That would achieve three goals:
>   1) make the spec easier to read (even though it would be physically
>   longer); 2) make it easier to produce a conformant implementation
>   (special rules increase the risks of misunderstandings and
>   disagreements); 3) allow a later cleanup of the spec once we agree to
>   get rid of V1 structs.
>
> - The new encoding in that PR seems like it should be moved to a
>   separate PR and be discussed in the encodings thread?
>
> - I'm a bit skeptical about moving Thrift lists into data pages, rather
>   than, say, just embed the corresponding Thrift serialization as
>   binary fields for lazy deserialization.
>
> Regards
>
> Antoine.
>
>
>
> On Mon, 27 May 2024 23:06:37 -0700
> Micah Kornfield <[email protected]>
> wrote:
> > As a follow-up to the "V3" Discussions [1][2] I wanted to start a thread
> on
> > improvements to the footer metadata.
> >
> > Based on conversation so far, there have been a few proposals [3][4][5]
> to
> > help better support files with wide schemas and many row-groups.  I think
> > there are a lot of interesting ideas in each. It would be good to get
> > further feedback on these to make sure we aren't missing anything and
> > define a minimal first iteration for doing experimental benchmarking to
> > prove out an approach.
> >
> > I think the next steps would ideally be:
> > 1.  Come to a consensus on the overall approach.
> > 2.  Prototypes to Benchmark/test to validate the approaches defined (if
> we
> > can't come to consensus in item #1, this might help choose a direction).
> > 3.  Divide up any final approach into as fine-grained features as
> possible.
> > 4.  Implement across parquet-java, parquet-cpp, parquet-rs (and any other
> > implementations that we can get volunteers for).  Additionally, if new
> APIs
> > are needed to make use of the new structure, it would be good to try to
> > prototype against consumers of Parquet.
> >
> > Knowing that we have enough people interested in doing #3 is critical to
> > success, so if you have time to devote, it would be helpful to chime in
> > here (I know some people already noted they could help in the original
> > thread).
> >
> > I think it is likely we will need either an in person sync or another
> more
> > focused design document could help. I am happy to try to facilitate this
> > (once we have a better sense of who wants to be involved and what time
> > zones they are in I can schedule a sync if necessary).
> >
> > Thanks,
> > Micah
> >
> > [1] https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo
> > [2]
> >
> https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit
> > [3] https://github.com/apache/parquet-format/pull/242
> > [4] https://github.com/apache/parquet-format/pull/248
> > [5] https://github.com/apache/parquet-format/pull/250
> >
>
>
>
>

Reply via email to