Hello Micah,
First, kudos for doing this! I like your attempt to put the "new" file metadata after the legacy one in https://github.com/apache/parquet-format/pull/250, and I hope it can actually be made to work (it requires current Parquet readers to allow/ignore arbitrary padding at the end of the v1 Thrift metadata). Some assorted comments on other changes that PR is doing: - I'm biased, but I find it much cleaner to define new Thrift structures (FileMetadataV3, etc.), rather than painstakinly document which fields are to be omitted in V3. That would achieve three goals: 1) make the spec easier to read (even though it would be physically longer); 2) make it easier to produce a conformant implementation (special rules increase the risks of misunderstandings and disagreements); 3) allow a later cleanup of the spec once we agree to get rid of V1 structs. - The new encoding in that PR seems like it should be moved to a separate PR and be discussed in the encodings thread? - I'm a bit skeptical about moving Thrift lists into data pages, rather than, say, just embed the corresponding Thrift serialization as binary fields for lazy deserialization. Regards Antoine. On Mon, 27 May 2024 23:06:37 -0700 Micah Kornfield <[email protected]> wrote: > As a follow-up to the "V3" Discussions [1][2] I wanted to start a thread on > improvements to the footer metadata. > > Based on conversation so far, there have been a few proposals [3][4][5] to > help better support files with wide schemas and many row-groups. I think > there are a lot of interesting ideas in each. It would be good to get > further feedback on these to make sure we aren't missing anything and > define a minimal first iteration for doing experimental benchmarking to > prove out an approach. > > I think the next steps would ideally be: > 1. Come to a consensus on the overall approach. > 2. Prototypes to Benchmark/test to validate the approaches defined (if we > can't come to consensus in item #1, this might help choose a direction). > 3. Divide up any final approach into as fine-grained features as possible. > 4. Implement across parquet-java, parquet-cpp, parquet-rs (and any other > implementations that we can get volunteers for). Additionally, if new APIs > are needed to make use of the new structure, it would be good to try to > prototype against consumers of Parquet. > > Knowing that we have enough people interested in doing #3 is critical to > success, so if you have time to devote, it would be helpful to chime in > here (I know some people already noted they could help in the original > thread). > > I think it is likely we will need either an in person sync or another more > focused design document could help. I am happy to try to facilitate this > (once we have a better sense of who wants to be involved and what time > zones they are in I can schedule a sync if necessary). > > Thanks, > Micah > > [1] https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo > [2] > https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit > [3] https://github.com/apache/parquet-format/pull/242 > [4] https://github.com/apache/parquet-format/pull/248 > [5] https://github.com/apache/parquet-format/pull/250 >
