Hello,
Micah approved the PR and I made the last tweaks based on the feedback
(Thank you Micah and Marc).
I am planning to merge the PR soon.
https://github.com/apache/parquet-format/pull/513
This is your chance to chime in. (or, you know, open a PR later if you want
to make changes afterwards.)
Once this is merged, I heard some people who are looking forward to
test-driving this with proposals for new encodings.
I am looking forward to it!
Julien

On Tue, Sep 2, 2025 at 4:54 PM Julien Le Dem <[email protected]> wrote:

> FYI: I'm hoping to get closer to a conclusion in the meeting tomorrow.
> If you could take a second look, I would appreciate it.
> Thank you !
>
> On Fri, Aug 29, 2025 at 3:54 PM Julien Le Dem <[email protected]> wrote:
>
>> Thank you for the feedback.
>> I have updated the PR with all the feedback and introduced language to
>> remove gatekeeping as much as possible and encourage people to feel
>> empowered to propose and contribute new things.
>>
>> https://github.com/apache/parquet-format/pull/513
>> See in tree here:
>> https://github.com/apache/parquet-format/tree/proposals/proposals
>>
>>
>> On Mon, Aug 11, 2025 at 6:57 AM Andrew Lamb <[email protected]>
>> wrote:
>>
>>> I think the PR[1][2] that Julien created is a pretty nice high level flow
>>> as it:
>>> 1. Mostly documents clearly what is already done in practice
>>> 2. Postpones concerns and consensus about potentially overly restrictive
>>> requirements for new features (but not trying to exhaustively specify the
>>> criteria)
>>> 3. Gives a location to list active proposals
>>>
>>> We could make progress with his PR without having to come to a consensus
>>> on
>>> the criteria for inclusion.
>>>
>>> Once we had that high level flow up,  we could try it out and formalize
>>> some of the criteria that are used for changes.
>>>
>>> Andrew
>>>
>>>
>>> [1]: https://github.com/apache/parquet-format/pull/513
>>> [2]: https://github.com/apache/parquet-format/tree/proposals/proposals
>>>
>>> On Mon, Aug 11, 2025 at 12:31 AM Micah Kornfield <[email protected]>
>>> wrote:
>>>
>>> > >
>>> > > In this situation, it's great to say that we want people to run
>>> > benchmarks
>>> > > on some representative datasets and I agree that we probably want a
>>> > > substantial performance improvement to justify the cost of support.
>>> But I
>>> > > think we need to see these things as guidelines and not require
>>> running
>>> > 20
>>> > >
>>> > > The intention at least in the doc was to require 20 plus datasets
>>> but to
>>> > collect at least a list of open datasets that we can narrow down.
>>> What I
>>> > would at least like to see is a fairly standard set of data to make
>>> > comparisons consistent.   We also discussed this in the sync.  I think
>>> it
>>> > will be up to someone who has bandwidth to help at least designate a
>>> subset
>>> > of what we want to include.
>>> >
>>> > benchmarks or not considering features with 9% improvements across the
>>> > > board.
>>> >
>>> > Sure, we can maybe make the language softer language on having a target
>>> > percentage be a target goal but there can be trade-offs.
>>> >
>>> > I actually think having some sort of baseline helps to function as
>>> making
>>> > things easier in some ways as long as other requirements are met
>>> because it
>>> > removes some amount of subjectivity.
>>> >
>>> > Cheers,
>>> > Micah
>>> >
>>> >
>>> >
>>> > On Fri, Aug 8, 2025 at 2:40 PM Julien Le Dem <[email protected]>
>>> wrote:
>>> >
>>> > > I agree that the goal is to make contributions easier and not a
>>> daunting
>>> > > process.
>>> > > We could start the process by separating bigger projects that are
>>> > impacting
>>> > > the format in a non backward compatible way (new encodings, new
>>> footer,
>>> > > etc), versus things that are not as impacting (for example adding
>>> > metadata
>>> > > that can be ignored by older readers).
>>> > > The goal of the "proposals" list I'm outlining above is really only
>>> for
>>> > > bigger projects where we need collaboration across the ecosystem
>>> (like we
>>> > > just did for Variant).
>>> > > I'm taking inspiration from other projects here: Airflow Improvement
>>> > > Proposals
>>> > > <
>>> > >
>>> >
>>> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
>>> > > >
>>> > >  or Flink Improvement Proposals
>>> > > <
>>> > >
>>> >
>>> https://cwiki.apache.org/confluence/display/Flink/Flink+Improvement+Proposals
>>> > > >
>>> > > I think it's also useful to have a central place to find those.
>>> > >
>>> > > On Fri, Aug 8, 2025 at 12:11 PM Ryan Blue <[email protected]> wrote:
>>> > >
>>> > > > I like many things about the write up, but I want to call out one
>>> > > potential
>>> > > > pitfall.
>>> > > >
>>> > > > I think that this is needed so that we can evolve the project and
>>> so we
>>> > > > have a well-understood path for adding new encodings and
>>> improvements.
>>> > If
>>> > > > we can't add new things, then the project will become outdated and
>>> > > > irrelevant.
>>> > > >
>>> > > > I'd like to keep that goal in mind when discussing the path that
>>> we are
>>> > > > documenting because there is a risk of having the opposite effect:
>>> by
>>> > > > adding so much process or so many requirements to satisfy that
>>> people
>>> > > > choose not to contribute or can't make it through to the end.
>>> > > >
>>> > > > You can see this risk at play with many ASF projects that have a
>>> > > > well-defined "path to committer". Often these docs start with
>>> > guidelines
>>> > > > that say something like "you'll generally need to contribute for
>>> about
>>> > a
>>> > > > year" to give context, but those things turn into rules and the
>>> > community
>>> > > > doesn't add anyone that hasn't been around for a year.
>>> > > >
>>> > > > In this situation, it's great to say that we want people to run
>>> > > benchmarks
>>> > > > on some representative datasets and I agree that we probably want a
>>> > > > substantial performance improvement to justify the cost of support.
>>> > But I
>>> > > > think we need to see these things as guidelines and not require
>>> running
>>> > > 20
>>> > > > benchmarks or not considering features with 9% improvements across
>>> the
>>> > > > board.
>>> > > >
>>> > > > Ryan
>>> > > >
>>> > > > On Thu, Aug 7, 2025 at 5:10 PM Julien Le Dem <[email protected]>
>>> > wrote:
>>> > > >
>>> > > > > I opened a Draft PR to illustrate what this could look like.
>>> > > > > https://github.com/apache/parquet-format/pull/513
>>> > > > > See in tree here:
>>> > > > >
>>> https://github.com/apache/parquet-format/tree/proposals/proposals
>>> > > > >
>>> > > > > On Wed, Aug 6, 2025 at 3:30 PM Julien Le Dem <[email protected]>
>>> > > wrote:
>>> > > > >
>>> > > > > > IMO, this doc is pretty close to being ready to be published.
>>> We
>>> > can
>>> > > > > > always improve it as we go.
>>> > > > > >
>>> > > > > > I think that one important part of the whole process is to
>>> make it
>>> > > easy
>>> > > > > > for everyone to see what proposals are ongoing and their status
>>> > and a
>>> > > > > clear
>>> > > > > > step to move from proposal/evaluation to implementation.
>>> > > > > >
>>> > > > > > Once we agree the doc is close enough, I would propose to
>>> publish
>>> > it
>>> > > in
>>> > > > > > markdown on the parquet-format repo, organized as follows:
>>> > > > > > - The section "Baseline Requirements for new additions" as its
>>> own
>>> > > > page,
>>> > > > > > documenting how to approach the design of a parquet change and
>>> the
>>> > > > > > underlying constraints.
>>> > > > > > - We add a physical process to list proposals in the
>>> parquet-format
>>> > > > > github
>>> > > > > > Repo as follows.
>>> > > > > > - The steps described in the section "Incorporating
>>> > > > encoding/compression
>>> > > > > > improvements" become the process on how someone creates a
>>> proposal
>>> > > and
>>> > > > > > starts a POC.
>>> > > > > > - I would complement it by the following steps for people to
>>> > publish
>>> > > > > their
>>> > > > > > proposals:
>>> > > > > >    - We create a folder in the parquet-format repo to hold the
>>> > > > proposals.
>>> > > > > >    - a Readme in the folder tracks the ongoing POCs and status.
>>> > > > > >    - Initiating a proposal starts with a github issue. We
>>> create a
>>> > > > > > template for it based on what's outlined in that section of the
>>> > doc.
>>> > > > > >    - If the discussion concludes that the proposal is worth a
>>> POC,
>>> > > > > > the author opens a PR to add the proposal in markdown in the
>>> > > proposals
>>> > > > > > folder. It links to the Github issue where the discussion
>>> preceding
>>> > > the
>>> > > > > > proposal occurred. More people can contribute to the POC as
>>> needed.
>>> > > > > >    - POC and perf evaluation are implemented as part of the
>>> > proposal.
>>> > > > > >    - a vote by the PMC moves the proposal to actual feature in
>>> the
>>> > > > format
>>> > > > > > (based on the criteria outlined in this doc).
>>> > > > > >    - As part of the implementation step, we make sure we have
>>> cross
>>> > > > > > compatible implementations as we did for Variant.
>>> > > > > > - The section "Measuring improvements" becomes part of that
>>> process
>>> > > > > > section to explain how we'll decide if the addition is worth
>>> adding
>>> > > to
>>> > > > > the
>>> > > > > > spec for the complexity it is adding.
>>> > > > > >
>>> > > > > > If that makes sense to you all, I can draft a PR to make this
>>> > > proposal
>>> > > > a
>>> > > > > > little more concrete.
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > > On Wed, Aug 6, 2025 at 11:08 AM Andrew Lamb <
>>> > [email protected]>
>>> > > > > > wrote:
>>> > > > > >
>>> > > > > >> I would like to bump this thread as it came up again on the
>>> > parquet
>>> > > > sync
>>> > > > > >> call today
>>> > > > > >>
>>> > > > > >> Specifically, it seems like there is increasing interest in
>>> adding
>>> > > new
>>> > > > > >> encodings to the Parquet, so getting consensus on what that
>>> > process
>>> > > > > looks
>>> > > > > >> like and what is required is more important.
>>> > > > > >>
>>> > > > > >> If you are interested in this topic, please leave comments on
>>> the
>>> > > > Google
>>> > > > > >> Doc[1] or reply to this email chain.
>>> > > > > >>
>>> > > > > >> Thank you,
>>> > > > > >> Andrew
>>> > > > > >>
>>> > > > > >> [1]
>>> > > > > >>
>>> > > > > >>
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0
>>> > > > > >>
>>> > > > > >> On Thu, May 29, 2025 at 2:42 AM Micah Kornfield <
>>> > > > [email protected]>
>>> > > > > >> wrote:
>>> > > > > >>
>>> > > > > >> > I wrote up a long overdue draft
>>> > > > > >> > <
>>> > > > > >> >
>>> > > > > >>
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0
>>> > > > > >> > >
>>> > > > > >> > [1]
>>> > > > > >> > on how we can move forward with additional features (it
>>> provides
>>> > > > some
>>> > > > > >> > proposed requirements on both consuming third-party code, as
>>> > well
>>> > > as
>>> > > > > >> some
>>> > > > > >> > more specific guidance on new encodings, and some orthogonal
>>> > work
>>> > > > that
>>> > > > > >> > would be nice to see).
>>> > > > > >> >
>>> > > > > >> > The doc still lacks some details, and might be too
>>> opinionated
>>> > in
>>> > > > > places
>>> > > > > >> > but I think it serves as a good basis for conversation (and
>>> at
>>> > > least
>>> > > > > >> gets
>>> > > > > >> > me out of the critical path for evolving Parquet).
>>> > > > > >> >
>>> > > > > >> > I'm very excited to start moving forward with improvements.
>>> > > > > >> >
>>> > > > > >> > Thanks,
>>> > > > > >> > Micah
>>> > > > > >> >
>>> > > > > >> > [1]
>>> > > > > >> >
>>> > > > > >> >
>>> > > > > >>
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0
>>> > > > > >> >
>>> > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>

Reply via email to