I have now merged this PR. Thank you all for the feedback. (esp: Micah, Marc, Andrew, Ryan)
On Tue, Sep 23, 2025 at 11:29 AM Julien Le Dem <[email protected]> wrote: > Hello, > Micah approved the PR and I made the last tweaks based on the feedback > (Thank you Micah and Marc). > I am planning to merge the PR soon. > https://github.com/apache/parquet-format/pull/513 > This is your chance to chime in. (or, you know, open a PR later if you > want to make changes afterwards.) > Once this is merged, I heard some people who are looking forward to > test-driving this with proposals for new encodings. > I am looking forward to it! > Julien > > On Tue, Sep 2, 2025 at 4:54 PM Julien Le Dem <[email protected]> wrote: > >> FYI: I'm hoping to get closer to a conclusion in the meeting tomorrow. >> If you could take a second look, I would appreciate it. >> Thank you ! >> >> On Fri, Aug 29, 2025 at 3:54 PM Julien Le Dem <[email protected]> wrote: >> >>> Thank you for the feedback. >>> I have updated the PR with all the feedback and introduced language to >>> remove gatekeeping as much as possible and encourage people to feel >>> empowered to propose and contribute new things. >>> >>> https://github.com/apache/parquet-format/pull/513 >>> See in tree here: >>> https://github.com/apache/parquet-format/tree/proposals/proposals >>> >>> >>> On Mon, Aug 11, 2025 at 6:57 AM Andrew Lamb <[email protected]> >>> wrote: >>> >>>> I think the PR[1][2] that Julien created is a pretty nice high level >>>> flow >>>> as it: >>>> 1. Mostly documents clearly what is already done in practice >>>> 2. Postpones concerns and consensus about potentially overly restrictive >>>> requirements for new features (but not trying to exhaustively specify >>>> the >>>> criteria) >>>> 3. Gives a location to list active proposals >>>> >>>> We could make progress with his PR without having to come to a >>>> consensus on >>>> the criteria for inclusion. >>>> >>>> Once we had that high level flow up, we could try it out and formalize >>>> some of the criteria that are used for changes. >>>> >>>> Andrew >>>> >>>> >>>> [1]: https://github.com/apache/parquet-format/pull/513 >>>> [2]: https://github.com/apache/parquet-format/tree/proposals/proposals >>>> >>>> On Mon, Aug 11, 2025 at 12:31 AM Micah Kornfield <[email protected] >>>> > >>>> wrote: >>>> >>>> > > >>>> > > In this situation, it's great to say that we want people to run >>>> > benchmarks >>>> > > on some representative datasets and I agree that we probably want a >>>> > > substantial performance improvement to justify the cost of support. >>>> But I >>>> > > think we need to see these things as guidelines and not require >>>> running >>>> > 20 >>>> > > >>>> > > The intention at least in the doc was to require 20 plus datasets >>>> but to >>>> > collect at least a list of open datasets that we can narrow down. >>>> What I >>>> > would at least like to see is a fairly standard set of data to make >>>> > comparisons consistent. We also discussed this in the sync. I >>>> think it >>>> > will be up to someone who has bandwidth to help at least designate a >>>> subset >>>> > of what we want to include. >>>> > >>>> > benchmarks or not considering features with 9% improvements across the >>>> > > board. >>>> > >>>> > Sure, we can maybe make the language softer language on having a >>>> target >>>> > percentage be a target goal but there can be trade-offs. >>>> > >>>> > I actually think having some sort of baseline helps to function as >>>> making >>>> > things easier in some ways as long as other requirements are met >>>> because it >>>> > removes some amount of subjectivity. >>>> > >>>> > Cheers, >>>> > Micah >>>> > >>>> > >>>> > >>>> > On Fri, Aug 8, 2025 at 2:40 PM Julien Le Dem <[email protected]> >>>> wrote: >>>> > >>>> > > I agree that the goal is to make contributions easier and not a >>>> daunting >>>> > > process. >>>> > > We could start the process by separating bigger projects that are >>>> > impacting >>>> > > the format in a non backward compatible way (new encodings, new >>>> footer, >>>> > > etc), versus things that are not as impacting (for example adding >>>> > metadata >>>> > > that can be ignored by older readers). >>>> > > The goal of the "proposals" list I'm outlining above is really only >>>> for >>>> > > bigger projects where we need collaboration across the ecosystem >>>> (like we >>>> > > just did for Variant). >>>> > > I'm taking inspiration from other projects here: Airflow Improvement >>>> > > Proposals >>>> > > < >>>> > > >>>> > >>>> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals >>>> > > > >>>> > > or Flink Improvement Proposals >>>> > > < >>>> > > >>>> > >>>> https://cwiki.apache.org/confluence/display/Flink/Flink+Improvement+Proposals >>>> > > > >>>> > > I think it's also useful to have a central place to find those. >>>> > > >>>> > > On Fri, Aug 8, 2025 at 12:11 PM Ryan Blue <[email protected]> wrote: >>>> > > >>>> > > > I like many things about the write up, but I want to call out one >>>> > > potential >>>> > > > pitfall. >>>> > > > >>>> > > > I think that this is needed so that we can evolve the project and >>>> so we >>>> > > > have a well-understood path for adding new encodings and >>>> improvements. >>>> > If >>>> > > > we can't add new things, then the project will become outdated and >>>> > > > irrelevant. >>>> > > > >>>> > > > I'd like to keep that goal in mind when discussing the path that >>>> we are >>>> > > > documenting because there is a risk of having the opposite >>>> effect: by >>>> > > > adding so much process or so many requirements to satisfy that >>>> people >>>> > > > choose not to contribute or can't make it through to the end. >>>> > > > >>>> > > > You can see this risk at play with many ASF projects that have a >>>> > > > well-defined "path to committer". Often these docs start with >>>> > guidelines >>>> > > > that say something like "you'll generally need to contribute for >>>> about >>>> > a >>>> > > > year" to give context, but those things turn into rules and the >>>> > community >>>> > > > doesn't add anyone that hasn't been around for a year. >>>> > > > >>>> > > > In this situation, it's great to say that we want people to run >>>> > > benchmarks >>>> > > > on some representative datasets and I agree that we probably want >>>> a >>>> > > > substantial performance improvement to justify the cost of >>>> support. >>>> > But I >>>> > > > think we need to see these things as guidelines and not require >>>> running >>>> > > 20 >>>> > > > benchmarks or not considering features with 9% improvements >>>> across the >>>> > > > board. >>>> > > > >>>> > > > Ryan >>>> > > > >>>> > > > On Thu, Aug 7, 2025 at 5:10 PM Julien Le Dem <[email protected]> >>>> > wrote: >>>> > > > >>>> > > > > I opened a Draft PR to illustrate what this could look like. >>>> > > > > https://github.com/apache/parquet-format/pull/513 >>>> > > > > See in tree here: >>>> > > > > >>>> https://github.com/apache/parquet-format/tree/proposals/proposals >>>> > > > > >>>> > > > > On Wed, Aug 6, 2025 at 3:30 PM Julien Le Dem <[email protected] >>>> > >>>> > > wrote: >>>> > > > > >>>> > > > > > IMO, this doc is pretty close to being ready to be published. >>>> We >>>> > can >>>> > > > > > always improve it as we go. >>>> > > > > > >>>> > > > > > I think that one important part of the whole process is to >>>> make it >>>> > > easy >>>> > > > > > for everyone to see what proposals are ongoing and their >>>> status >>>> > and a >>>> > > > > clear >>>> > > > > > step to move from proposal/evaluation to implementation. >>>> > > > > > >>>> > > > > > Once we agree the doc is close enough, I would propose to >>>> publish >>>> > it >>>> > > in >>>> > > > > > markdown on the parquet-format repo, organized as follows: >>>> > > > > > - The section "Baseline Requirements for new additions" as >>>> its own >>>> > > > page, >>>> > > > > > documenting how to approach the design of a parquet change >>>> and the >>>> > > > > > underlying constraints. >>>> > > > > > - We add a physical process to list proposals in the >>>> parquet-format >>>> > > > > github >>>> > > > > > Repo as follows. >>>> > > > > > - The steps described in the section "Incorporating >>>> > > > encoding/compression >>>> > > > > > improvements" become the process on how someone creates a >>>> proposal >>>> > > and >>>> > > > > > starts a POC. >>>> > > > > > - I would complement it by the following steps for people to >>>> > publish >>>> > > > > their >>>> > > > > > proposals: >>>> > > > > > - We create a folder in the parquet-format repo to hold the >>>> > > > proposals. >>>> > > > > > - a Readme in the folder tracks the ongoing POCs and >>>> status. >>>> > > > > > - Initiating a proposal starts with a github issue. We >>>> create a >>>> > > > > > template for it based on what's outlined in that section of >>>> the >>>> > doc. >>>> > > > > > - If the discussion concludes that the proposal is worth a >>>> POC, >>>> > > > > > the author opens a PR to add the proposal in markdown in the >>>> > > proposals >>>> > > > > > folder. It links to the Github issue where the discussion >>>> preceding >>>> > > the >>>> > > > > > proposal occurred. More people can contribute to the POC as >>>> needed. >>>> > > > > > - POC and perf evaluation are implemented as part of the >>>> > proposal. >>>> > > > > > - a vote by the PMC moves the proposal to actual feature >>>> in the >>>> > > > format >>>> > > > > > (based on the criteria outlined in this doc). >>>> > > > > > - As part of the implementation step, we make sure we have >>>> cross >>>> > > > > > compatible implementations as we did for Variant. >>>> > > > > > - The section "Measuring improvements" becomes part of that >>>> process >>>> > > > > > section to explain how we'll decide if the addition is worth >>>> adding >>>> > > to >>>> > > > > the >>>> > > > > > spec for the complexity it is adding. >>>> > > > > > >>>> > > > > > If that makes sense to you all, I can draft a PR to make this >>>> > > proposal >>>> > > > a >>>> > > > > > little more concrete. >>>> > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > > On Wed, Aug 6, 2025 at 11:08 AM Andrew Lamb < >>>> > [email protected]> >>>> > > > > > wrote: >>>> > > > > > >>>> > > > > >> I would like to bump this thread as it came up again on the >>>> > parquet >>>> > > > sync >>>> > > > > >> call today >>>> > > > > >> >>>> > > > > >> Specifically, it seems like there is increasing interest in >>>> adding >>>> > > new >>>> > > > > >> encodings to the Parquet, so getting consensus on what that >>>> > process >>>> > > > > looks >>>> > > > > >> like and what is required is more important. >>>> > > > > >> >>>> > > > > >> If you are interested in this topic, please leave comments >>>> on the >>>> > > > Google >>>> > > > > >> Doc[1] or reply to this email chain. >>>> > > > > >> >>>> > > > > >> Thank you, >>>> > > > > >> Andrew >>>> > > > > >> >>>> > > > > >> [1] >>>> > > > > >> >>>> > > > > >> >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0 >>>> > > > > >> >>>> > > > > >> On Thu, May 29, 2025 at 2:42 AM Micah Kornfield < >>>> > > > [email protected]> >>>> > > > > >> wrote: >>>> > > > > >> >>>> > > > > >> > I wrote up a long overdue draft >>>> > > > > >> > < >>>> > > > > >> > >>>> > > > > >> >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0 >>>> > > > > >> > > >>>> > > > > >> > [1] >>>> > > > > >> > on how we can move forward with additional features (it >>>> provides >>>> > > > some >>>> > > > > >> > proposed requirements on both consuming third-party code, >>>> as >>>> > well >>>> > > as >>>> > > > > >> some >>>> > > > > >> > more specific guidance on new encodings, and some >>>> orthogonal >>>> > work >>>> > > > that >>>> > > > > >> > would be nice to see). >>>> > > > > >> > >>>> > > > > >> > The doc still lacks some details, and might be too >>>> opinionated >>>> > in >>>> > > > > places >>>> > > > > >> > but I think it serves as a good basis for conversation >>>> (and at >>>> > > least >>>> > > > > >> gets >>>> > > > > >> > me out of the critical path for evolving Parquet). >>>> > > > > >> > >>>> > > > > >> > I'm very excited to start moving forward with improvements. >>>> > > > > >> > >>>> > > > > >> > Thanks, >>>> > > > > >> > Micah >>>> > > > > >> > >>>> > > > > >> > [1] >>>> > > > > >> > >>>> > > > > >> > >>>> > > > > >> >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0 >>>> > > > > >> > >>>> > > > > >> >>>> > > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> >>>
