I have now merged this PR.
Thank you all for the feedback.
(esp: Micah, Marc, Andrew, Ryan)


On Tue, Sep 23, 2025 at 11:29 AM Julien Le Dem <[email protected]> wrote:

> Hello,
> Micah approved the PR and I made the last tweaks based on the feedback
> (Thank you Micah and Marc).
> I am planning to merge the PR soon.
> https://github.com/apache/parquet-format/pull/513
> This is your chance to chime in. (or, you know, open a PR later if you
> want to make changes afterwards.)
> Once this is merged, I heard some people who are looking forward to
> test-driving this with proposals for new encodings.
> I am looking forward to it!
> Julien
>
> On Tue, Sep 2, 2025 at 4:54 PM Julien Le Dem <[email protected]> wrote:
>
>> FYI: I'm hoping to get closer to a conclusion in the meeting tomorrow.
>> If you could take a second look, I would appreciate it.
>> Thank you !
>>
>> On Fri, Aug 29, 2025 at 3:54 PM Julien Le Dem <[email protected]> wrote:
>>
>>> Thank you for the feedback.
>>> I have updated the PR with all the feedback and introduced language to
>>> remove gatekeeping as much as possible and encourage people to feel
>>> empowered to propose and contribute new things.
>>>
>>> https://github.com/apache/parquet-format/pull/513
>>> See in tree here:
>>> https://github.com/apache/parquet-format/tree/proposals/proposals
>>>
>>>
>>> On Mon, Aug 11, 2025 at 6:57 AM Andrew Lamb <[email protected]>
>>> wrote:
>>>
>>>> I think the PR[1][2] that Julien created is a pretty nice high level
>>>> flow
>>>> as it:
>>>> 1. Mostly documents clearly what is already done in practice
>>>> 2. Postpones concerns and consensus about potentially overly restrictive
>>>> requirements for new features (but not trying to exhaustively specify
>>>> the
>>>> criteria)
>>>> 3. Gives a location to list active proposals
>>>>
>>>> We could make progress with his PR without having to come to a
>>>> consensus on
>>>> the criteria for inclusion.
>>>>
>>>> Once we had that high level flow up,  we could try it out and formalize
>>>> some of the criteria that are used for changes.
>>>>
>>>> Andrew
>>>>
>>>>
>>>> [1]: https://github.com/apache/parquet-format/pull/513
>>>> [2]: https://github.com/apache/parquet-format/tree/proposals/proposals
>>>>
>>>> On Mon, Aug 11, 2025 at 12:31 AM Micah Kornfield <[email protected]
>>>> >
>>>> wrote:
>>>>
>>>> > >
>>>> > > In this situation, it's great to say that we want people to run
>>>> > benchmarks
>>>> > > on some representative datasets and I agree that we probably want a
>>>> > > substantial performance improvement to justify the cost of support.
>>>> But I
>>>> > > think we need to see these things as guidelines and not require
>>>> running
>>>> > 20
>>>> > >
>>>> > > The intention at least in the doc was to require 20 plus datasets
>>>> but to
>>>> > collect at least a list of open datasets that we can narrow down.
>>>> What I
>>>> > would at least like to see is a fairly standard set of data to make
>>>> > comparisons consistent.   We also discussed this in the sync.  I
>>>> think it
>>>> > will be up to someone who has bandwidth to help at least designate a
>>>> subset
>>>> > of what we want to include.
>>>> >
>>>> > benchmarks or not considering features with 9% improvements across the
>>>> > > board.
>>>> >
>>>> > Sure, we can maybe make the language softer language on having a
>>>> target
>>>> > percentage be a target goal but there can be trade-offs.
>>>> >
>>>> > I actually think having some sort of baseline helps to function as
>>>> making
>>>> > things easier in some ways as long as other requirements are met
>>>> because it
>>>> > removes some amount of subjectivity.
>>>> >
>>>> > Cheers,
>>>> > Micah
>>>> >
>>>> >
>>>> >
>>>> > On Fri, Aug 8, 2025 at 2:40 PM Julien Le Dem <[email protected]>
>>>> wrote:
>>>> >
>>>> > > I agree that the goal is to make contributions easier and not a
>>>> daunting
>>>> > > process.
>>>> > > We could start the process by separating bigger projects that are
>>>> > impacting
>>>> > > the format in a non backward compatible way (new encodings, new
>>>> footer,
>>>> > > etc), versus things that are not as impacting (for example adding
>>>> > metadata
>>>> > > that can be ignored by older readers).
>>>> > > The goal of the "proposals" list I'm outlining above is really only
>>>> for
>>>> > > bigger projects where we need collaboration across the ecosystem
>>>> (like we
>>>> > > just did for Variant).
>>>> > > I'm taking inspiration from other projects here: Airflow Improvement
>>>> > > Proposals
>>>> > > <
>>>> > >
>>>> >
>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
>>>> > > >
>>>> > >  or Flink Improvement Proposals
>>>> > > <
>>>> > >
>>>> >
>>>> https://cwiki.apache.org/confluence/display/Flink/Flink+Improvement+Proposals
>>>> > > >
>>>> > > I think it's also useful to have a central place to find those.
>>>> > >
>>>> > > On Fri, Aug 8, 2025 at 12:11 PM Ryan Blue <[email protected]> wrote:
>>>> > >
>>>> > > > I like many things about the write up, but I want to call out one
>>>> > > potential
>>>> > > > pitfall.
>>>> > > >
>>>> > > > I think that this is needed so that we can evolve the project and
>>>> so we
>>>> > > > have a well-understood path for adding new encodings and
>>>> improvements.
>>>> > If
>>>> > > > we can't add new things, then the project will become outdated and
>>>> > > > irrelevant.
>>>> > > >
>>>> > > > I'd like to keep that goal in mind when discussing the path that
>>>> we are
>>>> > > > documenting because there is a risk of having the opposite
>>>> effect: by
>>>> > > > adding so much process or so many requirements to satisfy that
>>>> people
>>>> > > > choose not to contribute or can't make it through to the end.
>>>> > > >
>>>> > > > You can see this risk at play with many ASF projects that have a
>>>> > > > well-defined "path to committer". Often these docs start with
>>>> > guidelines
>>>> > > > that say something like "you'll generally need to contribute for
>>>> about
>>>> > a
>>>> > > > year" to give context, but those things turn into rules and the
>>>> > community
>>>> > > > doesn't add anyone that hasn't been around for a year.
>>>> > > >
>>>> > > > In this situation, it's great to say that we want people to run
>>>> > > benchmarks
>>>> > > > on some representative datasets and I agree that we probably want
>>>> a
>>>> > > > substantial performance improvement to justify the cost of
>>>> support.
>>>> > But I
>>>> > > > think we need to see these things as guidelines and not require
>>>> running
>>>> > > 20
>>>> > > > benchmarks or not considering features with 9% improvements
>>>> across the
>>>> > > > board.
>>>> > > >
>>>> > > > Ryan
>>>> > > >
>>>> > > > On Thu, Aug 7, 2025 at 5:10 PM Julien Le Dem <[email protected]>
>>>> > wrote:
>>>> > > >
>>>> > > > > I opened a Draft PR to illustrate what this could look like.
>>>> > > > > https://github.com/apache/parquet-format/pull/513
>>>> > > > > See in tree here:
>>>> > > > >
>>>> https://github.com/apache/parquet-format/tree/proposals/proposals
>>>> > > > >
>>>> > > > > On Wed, Aug 6, 2025 at 3:30 PM Julien Le Dem <[email protected]
>>>> >
>>>> > > wrote:
>>>> > > > >
>>>> > > > > > IMO, this doc is pretty close to being ready to be published.
>>>> We
>>>> > can
>>>> > > > > > always improve it as we go.
>>>> > > > > >
>>>> > > > > > I think that one important part of the whole process is to
>>>> make it
>>>> > > easy
>>>> > > > > > for everyone to see what proposals are ongoing and their
>>>> status
>>>> > and a
>>>> > > > > clear
>>>> > > > > > step to move from proposal/evaluation to implementation.
>>>> > > > > >
>>>> > > > > > Once we agree the doc is close enough, I would propose to
>>>> publish
>>>> > it
>>>> > > in
>>>> > > > > > markdown on the parquet-format repo, organized as follows:
>>>> > > > > > - The section "Baseline Requirements for new additions" as
>>>> its own
>>>> > > > page,
>>>> > > > > > documenting how to approach the design of a parquet change
>>>> and the
>>>> > > > > > underlying constraints.
>>>> > > > > > - We add a physical process to list proposals in the
>>>> parquet-format
>>>> > > > > github
>>>> > > > > > Repo as follows.
>>>> > > > > > - The steps described in the section "Incorporating
>>>> > > > encoding/compression
>>>> > > > > > improvements" become the process on how someone creates a
>>>> proposal
>>>> > > and
>>>> > > > > > starts a POC.
>>>> > > > > > - I would complement it by the following steps for people to
>>>> > publish
>>>> > > > > their
>>>> > > > > > proposals:
>>>> > > > > >    - We create a folder in the parquet-format repo to hold the
>>>> > > > proposals.
>>>> > > > > >    - a Readme in the folder tracks the ongoing POCs and
>>>> status.
>>>> > > > > >    - Initiating a proposal starts with a github issue. We
>>>> create a
>>>> > > > > > template for it based on what's outlined in that section of
>>>> the
>>>> > doc.
>>>> > > > > >    - If the discussion concludes that the proposal is worth a
>>>> POC,
>>>> > > > > > the author opens a PR to add the proposal in markdown in the
>>>> > > proposals
>>>> > > > > > folder. It links to the Github issue where the discussion
>>>> preceding
>>>> > > the
>>>> > > > > > proposal occurred. More people can contribute to the POC as
>>>> needed.
>>>> > > > > >    - POC and perf evaluation are implemented as part of the
>>>> > proposal.
>>>> > > > > >    - a vote by the PMC moves the proposal to actual feature
>>>> in the
>>>> > > > format
>>>> > > > > > (based on the criteria outlined in this doc).
>>>> > > > > >    - As part of the implementation step, we make sure we have
>>>> cross
>>>> > > > > > compatible implementations as we did for Variant.
>>>> > > > > > - The section "Measuring improvements" becomes part of that
>>>> process
>>>> > > > > > section to explain how we'll decide if the addition is worth
>>>> adding
>>>> > > to
>>>> > > > > the
>>>> > > > > > spec for the complexity it is adding.
>>>> > > > > >
>>>> > > > > > If that makes sense to you all, I can draft a PR to make this
>>>> > > proposal
>>>> > > > a
>>>> > > > > > little more concrete.
>>>> > > > > >
>>>> > > > > >
>>>> > > > > >
>>>> > > > > > On Wed, Aug 6, 2025 at 11:08 AM Andrew Lamb <
>>>> > [email protected]>
>>>> > > > > > wrote:
>>>> > > > > >
>>>> > > > > >> I would like to bump this thread as it came up again on the
>>>> > parquet
>>>> > > > sync
>>>> > > > > >> call today
>>>> > > > > >>
>>>> > > > > >> Specifically, it seems like there is increasing interest in
>>>> adding
>>>> > > new
>>>> > > > > >> encodings to the Parquet, so getting consensus on what that
>>>> > process
>>>> > > > > looks
>>>> > > > > >> like and what is required is more important.
>>>> > > > > >>
>>>> > > > > >> If you are interested in this topic, please leave comments
>>>> on the
>>>> > > > Google
>>>> > > > > >> Doc[1] or reply to this email chain.
>>>> > > > > >>
>>>> > > > > >> Thank you,
>>>> > > > > >> Andrew
>>>> > > > > >>
>>>> > > > > >> [1]
>>>> > > > > >>
>>>> > > > > >>
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0
>>>> > > > > >>
>>>> > > > > >> On Thu, May 29, 2025 at 2:42 AM Micah Kornfield <
>>>> > > > [email protected]>
>>>> > > > > >> wrote:
>>>> > > > > >>
>>>> > > > > >> > I wrote up a long overdue draft
>>>> > > > > >> > <
>>>> > > > > >> >
>>>> > > > > >>
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0
>>>> > > > > >> > >
>>>> > > > > >> > [1]
>>>> > > > > >> > on how we can move forward with additional features (it
>>>> provides
>>>> > > > some
>>>> > > > > >> > proposed requirements on both consuming third-party code,
>>>> as
>>>> > well
>>>> > > as
>>>> > > > > >> some
>>>> > > > > >> > more specific guidance on new encodings, and some
>>>> orthogonal
>>>> > work
>>>> > > > that
>>>> > > > > >> > would be nice to see).
>>>> > > > > >> >
>>>> > > > > >> > The doc still lacks some details, and might be too
>>>> opinionated
>>>> > in
>>>> > > > > places
>>>> > > > > >> > but I think it serves as a good basis for conversation
>>>> (and at
>>>> > > least
>>>> > > > > >> gets
>>>> > > > > >> > me out of the critical path for evolving Parquet).
>>>> > > > > >> >
>>>> > > > > >> > I'm very excited to start moving forward with improvements.
>>>> > > > > >> >
>>>> > > > > >> > Thanks,
>>>> > > > > >> > Micah
>>>> > > > > >> >
>>>> > > > > >> > [1]
>>>> > > > > >> >
>>>> > > > > >> >
>>>> > > > > >>
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0
>>>> > > > > >> >
>>>> > > > > >>
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>>
>>>

Reply via email to