Hello Micah,

First, kudos for doing this!

I like your attempt to put the "new" file metadata after the legacy
one in https://github.com/apache/parquet-format/pull/250, and I hope it
can actually be made to work (it requires current Parquet readers to
allow/ignore arbitrary padding at the end of the v1 Thrift metadata).

Some assorted comments on other changes that PR is doing:

- I'm biased, but I find it much cleaner to define new Thrift
  structures (FileMetadataV3, etc.), rather than painstakinly document
  which fields are to be omitted in V3. That would achieve three goals:
  1) make the spec easier to read (even though it would be physically
  longer); 2) make it easier to produce a conformant implementation
  (special rules increase the risks of misunderstandings and
  disagreements); 3) allow a later cleanup of the spec once we agree to
  get rid of V1 structs.

- The new encoding in that PR seems like it should be moved to a
  separate PR and be discussed in the encodings thread?

- I'm a bit skeptical about moving Thrift lists into data pages, rather
  than, say, just embed the corresponding Thrift serialization as
  binary fields for lazy deserialization.

Regards

Antoine.



On Mon, 27 May 2024 23:06:37 -0700
Micah Kornfield <[email protected]>
wrote:
> As a follow-up to the "V3" Discussions [1][2] I wanted to start a thread on
> improvements to the footer metadata.
> 
> Based on conversation so far, there have been a few proposals [3][4][5] to
> help better support files with wide schemas and many row-groups.  I think
> there are a lot of interesting ideas in each. It would be good to get
> further feedback on these to make sure we aren't missing anything and
> define a minimal first iteration for doing experimental benchmarking to
> prove out an approach.
> 
> I think the next steps would ideally be:
> 1.  Come to a consensus on the overall approach.
> 2.  Prototypes to Benchmark/test to validate the approaches defined (if we
> can't come to consensus in item #1, this might help choose a direction).
> 3.  Divide up any final approach into as fine-grained features as possible.
> 4.  Implement across parquet-java, parquet-cpp, parquet-rs (and any other
> implementations that we can get volunteers for).  Additionally, if new APIs
> are needed to make use of the new structure, it would be good to try to
> prototype against consumers of Parquet.
> 
> Knowing that we have enough people interested in doing #3 is critical to
> success, so if you have time to devote, it would be helpful to chime in
> here (I know some people already noted they could help in the original
> thread).
> 
> I think it is likely we will need either an in person sync or another more
> focused design document could help. I am happy to try to facilitate this
> (once we have a better sense of who wants to be involved and what time
> zones they are in I can schedule a sync if necessary).
> 
> Thanks,
> Micah
> 
> [1] https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo
> [2]
> https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit
> [3] https://github.com/apache/parquet-format/pull/242
> [4] https://github.com/apache/parquet-format/pull/248
> [5] https://github.com/apache/parquet-format/pull/250
> 



Reply via email to