Hello, Micah approved the PR and I made the last tweaks based on the feedback (Thank you Micah and Marc). I am planning to merge the PR soon. https://github.com/apache/parquet-format/pull/513 This is your chance to chime in. (or, you know, open a PR later if you want to make changes afterwards.) Once this is merged, I heard some people who are looking forward to test-driving this with proposals for new encodings. I am looking forward to it! Julien
On Tue, Sep 2, 2025 at 4:54 PM Julien Le Dem <[email protected]> wrote: > FYI: I'm hoping to get closer to a conclusion in the meeting tomorrow. > If you could take a second look, I would appreciate it. > Thank you ! > > On Fri, Aug 29, 2025 at 3:54 PM Julien Le Dem <[email protected]> wrote: > >> Thank you for the feedback. >> I have updated the PR with all the feedback and introduced language to >> remove gatekeeping as much as possible and encourage people to feel >> empowered to propose and contribute new things. >> >> https://github.com/apache/parquet-format/pull/513 >> See in tree here: >> https://github.com/apache/parquet-format/tree/proposals/proposals >> >> >> On Mon, Aug 11, 2025 at 6:57 AM Andrew Lamb <[email protected]> >> wrote: >> >>> I think the PR[1][2] that Julien created is a pretty nice high level flow >>> as it: >>> 1. Mostly documents clearly what is already done in practice >>> 2. Postpones concerns and consensus about potentially overly restrictive >>> requirements for new features (but not trying to exhaustively specify the >>> criteria) >>> 3. Gives a location to list active proposals >>> >>> We could make progress with his PR without having to come to a consensus >>> on >>> the criteria for inclusion. >>> >>> Once we had that high level flow up, we could try it out and formalize >>> some of the criteria that are used for changes. >>> >>> Andrew >>> >>> >>> [1]: https://github.com/apache/parquet-format/pull/513 >>> [2]: https://github.com/apache/parquet-format/tree/proposals/proposals >>> >>> On Mon, Aug 11, 2025 at 12:31 AM Micah Kornfield <[email protected]> >>> wrote: >>> >>> > > >>> > > In this situation, it's great to say that we want people to run >>> > benchmarks >>> > > on some representative datasets and I agree that we probably want a >>> > > substantial performance improvement to justify the cost of support. >>> But I >>> > > think we need to see these things as guidelines and not require >>> running >>> > 20 >>> > > >>> > > The intention at least in the doc was to require 20 plus datasets >>> but to >>> > collect at least a list of open datasets that we can narrow down. >>> What I >>> > would at least like to see is a fairly standard set of data to make >>> > comparisons consistent. We also discussed this in the sync. I think >>> it >>> > will be up to someone who has bandwidth to help at least designate a >>> subset >>> > of what we want to include. >>> > >>> > benchmarks or not considering features with 9% improvements across the >>> > > board. >>> > >>> > Sure, we can maybe make the language softer language on having a target >>> > percentage be a target goal but there can be trade-offs. >>> > >>> > I actually think having some sort of baseline helps to function as >>> making >>> > things easier in some ways as long as other requirements are met >>> because it >>> > removes some amount of subjectivity. >>> > >>> > Cheers, >>> > Micah >>> > >>> > >>> > >>> > On Fri, Aug 8, 2025 at 2:40 PM Julien Le Dem <[email protected]> >>> wrote: >>> > >>> > > I agree that the goal is to make contributions easier and not a >>> daunting >>> > > process. >>> > > We could start the process by separating bigger projects that are >>> > impacting >>> > > the format in a non backward compatible way (new encodings, new >>> footer, >>> > > etc), versus things that are not as impacting (for example adding >>> > metadata >>> > > that can be ignored by older readers). >>> > > The goal of the "proposals" list I'm outlining above is really only >>> for >>> > > bigger projects where we need collaboration across the ecosystem >>> (like we >>> > > just did for Variant). >>> > > I'm taking inspiration from other projects here: Airflow Improvement >>> > > Proposals >>> > > < >>> > > >>> > >>> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals >>> > > > >>> > > or Flink Improvement Proposals >>> > > < >>> > > >>> > >>> https://cwiki.apache.org/confluence/display/Flink/Flink+Improvement+Proposals >>> > > > >>> > > I think it's also useful to have a central place to find those. >>> > > >>> > > On Fri, Aug 8, 2025 at 12:11 PM Ryan Blue <[email protected]> wrote: >>> > > >>> > > > I like many things about the write up, but I want to call out one >>> > > potential >>> > > > pitfall. >>> > > > >>> > > > I think that this is needed so that we can evolve the project and >>> so we >>> > > > have a well-understood path for adding new encodings and >>> improvements. >>> > If >>> > > > we can't add new things, then the project will become outdated and >>> > > > irrelevant. >>> > > > >>> > > > I'd like to keep that goal in mind when discussing the path that >>> we are >>> > > > documenting because there is a risk of having the opposite effect: >>> by >>> > > > adding so much process or so many requirements to satisfy that >>> people >>> > > > choose not to contribute or can't make it through to the end. >>> > > > >>> > > > You can see this risk at play with many ASF projects that have a >>> > > > well-defined "path to committer". Often these docs start with >>> > guidelines >>> > > > that say something like "you'll generally need to contribute for >>> about >>> > a >>> > > > year" to give context, but those things turn into rules and the >>> > community >>> > > > doesn't add anyone that hasn't been around for a year. >>> > > > >>> > > > In this situation, it's great to say that we want people to run >>> > > benchmarks >>> > > > on some representative datasets and I agree that we probably want a >>> > > > substantial performance improvement to justify the cost of support. >>> > But I >>> > > > think we need to see these things as guidelines and not require >>> running >>> > > 20 >>> > > > benchmarks or not considering features with 9% improvements across >>> the >>> > > > board. >>> > > > >>> > > > Ryan >>> > > > >>> > > > On Thu, Aug 7, 2025 at 5:10 PM Julien Le Dem <[email protected]> >>> > wrote: >>> > > > >>> > > > > I opened a Draft PR to illustrate what this could look like. >>> > > > > https://github.com/apache/parquet-format/pull/513 >>> > > > > See in tree here: >>> > > > > >>> https://github.com/apache/parquet-format/tree/proposals/proposals >>> > > > > >>> > > > > On Wed, Aug 6, 2025 at 3:30 PM Julien Le Dem <[email protected]> >>> > > wrote: >>> > > > > >>> > > > > > IMO, this doc is pretty close to being ready to be published. >>> We >>> > can >>> > > > > > always improve it as we go. >>> > > > > > >>> > > > > > I think that one important part of the whole process is to >>> make it >>> > > easy >>> > > > > > for everyone to see what proposals are ongoing and their status >>> > and a >>> > > > > clear >>> > > > > > step to move from proposal/evaluation to implementation. >>> > > > > > >>> > > > > > Once we agree the doc is close enough, I would propose to >>> publish >>> > it >>> > > in >>> > > > > > markdown on the parquet-format repo, organized as follows: >>> > > > > > - The section "Baseline Requirements for new additions" as its >>> own >>> > > > page, >>> > > > > > documenting how to approach the design of a parquet change and >>> the >>> > > > > > underlying constraints. >>> > > > > > - We add a physical process to list proposals in the >>> parquet-format >>> > > > > github >>> > > > > > Repo as follows. >>> > > > > > - The steps described in the section "Incorporating >>> > > > encoding/compression >>> > > > > > improvements" become the process on how someone creates a >>> proposal >>> > > and >>> > > > > > starts a POC. >>> > > > > > - I would complement it by the following steps for people to >>> > publish >>> > > > > their >>> > > > > > proposals: >>> > > > > > - We create a folder in the parquet-format repo to hold the >>> > > > proposals. >>> > > > > > - a Readme in the folder tracks the ongoing POCs and status. >>> > > > > > - Initiating a proposal starts with a github issue. We >>> create a >>> > > > > > template for it based on what's outlined in that section of the >>> > doc. >>> > > > > > - If the discussion concludes that the proposal is worth a >>> POC, >>> > > > > > the author opens a PR to add the proposal in markdown in the >>> > > proposals >>> > > > > > folder. It links to the Github issue where the discussion >>> preceding >>> > > the >>> > > > > > proposal occurred. More people can contribute to the POC as >>> needed. >>> > > > > > - POC and perf evaluation are implemented as part of the >>> > proposal. >>> > > > > > - a vote by the PMC moves the proposal to actual feature in >>> the >>> > > > format >>> > > > > > (based on the criteria outlined in this doc). >>> > > > > > - As part of the implementation step, we make sure we have >>> cross >>> > > > > > compatible implementations as we did for Variant. >>> > > > > > - The section "Measuring improvements" becomes part of that >>> process >>> > > > > > section to explain how we'll decide if the addition is worth >>> adding >>> > > to >>> > > > > the >>> > > > > > spec for the complexity it is adding. >>> > > > > > >>> > > > > > If that makes sense to you all, I can draft a PR to make this >>> > > proposal >>> > > > a >>> > > > > > little more concrete. >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > On Wed, Aug 6, 2025 at 11:08 AM Andrew Lamb < >>> > [email protected]> >>> > > > > > wrote: >>> > > > > > >>> > > > > >> I would like to bump this thread as it came up again on the >>> > parquet >>> > > > sync >>> > > > > >> call today >>> > > > > >> >>> > > > > >> Specifically, it seems like there is increasing interest in >>> adding >>> > > new >>> > > > > >> encodings to the Parquet, so getting consensus on what that >>> > process >>> > > > > looks >>> > > > > >> like and what is required is more important. >>> > > > > >> >>> > > > > >> If you are interested in this topic, please leave comments on >>> the >>> > > > Google >>> > > > > >> Doc[1] or reply to this email chain. >>> > > > > >> >>> > > > > >> Thank you, >>> > > > > >> Andrew >>> > > > > >> >>> > > > > >> [1] >>> > > > > >> >>> > > > > >> >>> > > > > >>> > > > >>> > > >>> > >>> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0 >>> > > > > >> >>> > > > > >> On Thu, May 29, 2025 at 2:42 AM Micah Kornfield < >>> > > > [email protected]> >>> > > > > >> wrote: >>> > > > > >> >>> > > > > >> > I wrote up a long overdue draft >>> > > > > >> > < >>> > > > > >> > >>> > > > > >> >>> > > > > >>> > > > >>> > > >>> > >>> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0 >>> > > > > >> > > >>> > > > > >> > [1] >>> > > > > >> > on how we can move forward with additional features (it >>> provides >>> > > > some >>> > > > > >> > proposed requirements on both consuming third-party code, as >>> > well >>> > > as >>> > > > > >> some >>> > > > > >> > more specific guidance on new encodings, and some orthogonal >>> > work >>> > > > that >>> > > > > >> > would be nice to see). >>> > > > > >> > >>> > > > > >> > The doc still lacks some details, and might be too >>> opinionated >>> > in >>> > > > > places >>> > > > > >> > but I think it serves as a good basis for conversation (and >>> at >>> > > least >>> > > > > >> gets >>> > > > > >> > me out of the critical path for evolving Parquet). >>> > > > > >> > >>> > > > > >> > I'm very excited to start moving forward with improvements. >>> > > > > >> > >>> > > > > >> > Thanks, >>> > > > > >> > Micah >>> > > > > >> > >>> > > > > >> > [1] >>> > > > > >> > >>> > > > > >> > >>> > > > > >> >>> > > > > >>> > > > >>> > > >>> > >>> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0 >>> > > > > >> > >>> > > > > >> >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> >>
