Hi Curt,

> As part of the process of amending the Parquet format, perhaps it would be
> a good idea for early implementations to generate sample files and commit
> them to apache/parquet-testing: Apache Parquet Testing
> <https://github.com/apache/parquet-testing> for other implementations to
> leverage?


It got dropped in the thread but does
https://github.com/apache/parquet-testing/pull/100 address your concerns?

Thanks,
Micah

On Wed, Apr 29, 2026 at 4:20 PM Curt Hagenlocher <[email protected]>
wrote:

> As part of the process of amending the Parquet format, perhaps it would be
> a good idea for early implementations to generate sample files and commit
> them to apache/parquet-testing: Apache Parquet Testing
> <https://github.com/apache/parquet-testing> for other implementations to
> leverage?
>
> -Curt
>
> On Wed, Apr 29, 2026 at 4:11 PM PRATEEK GAUR <[email protected]> wrote:
>
>> Thanks Andrew and Micah for review feedback on the two PR's
>> 1) (c++ arrow repo) https://github.com/apache/arrow/pull/48345/changes
>> 2) (parquet-format repo)
>> https://github.com/apache/parquet-format/pull/557
>>
>> I have addressed all (unless I missed something) comments on the two PR's.
>>
>> Best
>> Prateek
>>
>> On Sat, Apr 25, 2026 at 1:08 PM PRATEEK GAUR <[email protected]> wrote:
>>
>> > Thanks Andrew and Micah.
>> >
>> > `fair amount of feedback on at least the implementations`
>> > For the c++ I have already started addressing the feedback, I should be
>> > done with that Monday/Tuesday.
>> > I think Vinoo too has been making good progress on the Java
>> implementation.
>> >
>> > Best
>> > Prateek
>> >
>> > On Sat, Apr 25, 2026 at 12:55 PM Andrew Lamb <[email protected]>
>> > wrote:
>> >
>> >> Got it. Thank you for the clarification -- I will try and look into the
>> >> spec and the Rust implementation[1] in this next week
>> >>
>> >> [1]: https://github.com/apache/arrow-rs/pull/9372
>> >>
>> >> On Sat, Apr 25, 2026 at 12:01 PM Micah Kornfield <
>> [email protected]>
>> >> wrote:
>> >>
>> >>> Hi Andrew,
>> >>> I think there is a fair amount of feedback on at least the
>> >>> implementations, typically I think we've waited till they are close to
>> >>> mergeable before a final vote.  Otherwise I agree we are very close.
>> >>>
>> >>> -Micah
>> >>>
>> >>> On Saturday, April 25, 2026, Andrew Lamb <[email protected]>
>> wrote:
>> >>>
>> >>>> Thanks Prateek,
>> >>>>
>> >>>> I think from this content it looks to me like we are ready to start a
>> >>>> vote to explicitly accept ALP into Parquet
>> >>>>
>> >>>> Does anyone know of a reason we should postpone it for longer?
>> >>>> Perhaps someone needs some more time to review?
>> >>>>
>> >>>> Andrew
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Wed, Apr 22, 2026 at 1:00 PM PRATEEK GAUR <[email protected]>
>> >>>> wrote:
>> >>>>
>> >>>>> Hi team,
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> Hope everyone is doing well. I got a chance to work through all the
>> >>>>> remaining feedback and update the spec doc. Here are the new
>> artifacts
>> >>>>>
>> >>>>> 1) Spec document :
>> >>>>>
>> https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit
>> >>>>>
>> >>>>> 2) Spec document in parquet format repo :
>> >>>>> https://github.com/apache/parquet-format/pull/557
>> >>>>>
>> >>>>> 3) Alp implementation in arrow c++ repo :
>> >>>>> https://github.com/apache/arrow/pull/48345/changes
>> >>>>>
>> >>>>> 4) Alp implementation in parquet-java repo : Work for Vinoo and
>> Julien
>> >>>>>  https://github.com/apache/parquet-java/pull/3397
>> >>>>>
>> >>>>> 5) PR with test and benchmarking artifacts in parquet-testing repo :
>> >>>>> https://github.com/apache/parquet-testing/pull/100
>> >>>>>
>> >>>>>
>> >>>>> And
>> >>>>>
>> >>>>>
>> >>>>>    - Go : Arnav just submitted an in progress implementation in Go.
>> >>>>>    https://github.com/apache/arrow-go/pull/704 (I haven't started
>> >>>>>    looking at it yet)
>> >>>>>    - Rust : I remember Andrew mentioned that this work is also in
>> >>>>>    progress (So 4 languages!)
>> >>>>>
>> >>>>>
>> >>>>> *Arrow C++ implementation *
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> The PR is out and was also used by Antoine to report the numbers as
>> >>>>> reported here. Micah and Konstantin have given 1 round of feedback
>> >>>>> and I'm addressing them today. Please note that the default
>> >>>>> optimization flag for compiling is O2 and not Q3. I got around 70%
>> >>>>> performance improvement in the decoding speed when using the O3
>> flag.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> *Parqet-MR Java implementation (working with Vinoo and Julien) and
>> **Cross
>> >>>>> Language testing*
>> >>>>>
>> >>>>>
>> >>>>>    Let me know if you have any questions or feedback.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> Now pasting some performance numbers
>> >>>>>
>> >>>>>
>> >>>>>   Table 1: C++ ALP Double Decode — Spotify Columns (Graviton 3, ARM
>> >>>>> Neoverse V1)
>> >>>>>
>> >>>>>   ┌──────────────────┬──────────────┬──────────────┬─────────┐
>> >>>>>
>> >>>>>   │ Column           │  -O2 (MB/s)  │  -O3 (MB/s)  │ Speedup │
>> >>>>>
>> >>>>>   ├──────────────────┼──────────────┼──────────────┼─────────┤
>> >>>>>
>> >>>>>   │ valence          │     3,155    │     5,523    │  1.75x  │
>> >>>>>
>> >>>>>   │ danceability     │     3,233    │     5,685    │  1.76x  │
>> >>>>>
>> >>>>>   │ energy           │     3,197    │     5,652    │  1.77x  │
>> >>>>>
>> >>>>>   │ loudness         │     3,186    │     5,473    │  1.72x  │
>> >>>>>
>> >>>>>   └──────────────────┴──────────────┴──────────────┴─────────┘
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Wed, Feb 25, 2026 at 9:49 AM PRATEEK GAUR <[email protected]>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> @Micah Kornfield <[email protected]> : Got it.
>> >>>>>>
>> >>>>>> @Andrew Lamb <[email protected]>
>> >>>>>>
>> >>>>>>
>> >>>>>>> Do you think it would be good to start moving the spec development
>> >>>>>>> into
>> >>>>>>> markdown format, in preparation for finalizing it?
>> >>>>>>>
>> >>>>>>
>> >>>>>> Yes I'll update the numbers for some of the examples I have in the
>> >>>>>> spec based
>> >>>>>> on the updated header size. Then we should be good to go for the
>> >>>>>> markdown format.
>> >>>>>>
>> >>>>>> Thanks everyone!
>> >>>>>>
>> >>>>>>
>> >>>>>>>
>> >>>>>>> Andrew
>> >>>>>>>
>> >>>>>>> On Tue, Feb 17, 2026 at 7:28 PM PRATEEK GAUR <[email protected]>
>> >>>>>>> wrote:
>> >>>>>>>
>> >>>>>>> > Hi team,
>> >>>>>>> >
>> >>>>>>> > 1) Andrew
>> >>>>>>> >
>> >>>>>>> >    - Thanks for working on test files. My PR did add all the
>> test
>> >>>>>>> files I
>> >>>>>>> >    used to benchmark on datasets. Maybe we can club it together.
>> >>>>>>> WIll also
>> >>>>>>> > aid
>> >>>>>>> >    cross language testing
>> >>>>>>> >    -  Kosta Tarasov working on Rust implementation. This is
>> great.
>> >>>>>>> Thanks
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > 2) Antoine
>> >>>>>>> >
>> >>>>>>> >    - Thanks a lot for reporting the numbers on AMD. Looks like
>> you
>> >>>>>>> are
>> >>>>>>> >    getting 8X the decoding performance of BSS. This is
>> amazing!!.
>> >>>>>>> >    - Thanks for acknowledging the sampling design.
>> >>>>>>> >    - I agree with you on Fastlanes. In some crude experiments I
>> >>>>>>> didn't get
>> >>>>>>> >    a good perf benefit from it on Graviton3 (but maybe there was
>> >>>>>>> something
>> >>>>>>> >    wrong with my implementation).
>> >>>>>>> >    - Locking the 16bit exception encoding for the spec in this
>> >>>>>>> case.
>> >>>>>>> >    - Awesome I think we have solved for all open questions minus
>> >>>>>>> the
>> >>>>>>> >    version byte :). (will get back on this soon)
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > 3) Micah
>> >>>>>>> >
>> >>>>>>> >    - FastLanes : The current spec does allow for using FastLane
>> >>>>>>> with the
>> >>>>>>> >    configurable enum value for layout. We should be able to
>> inject
>> >>>>>>> any
>> >>>>>>> > layout
>> >>>>>>> >    in the current design.
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > Working on resolving all remaining open comments on the spec
>> this
>> >>>>>>> week.
>> >>>>>>> >
>> >>>>>>> > Best
>> >>>>>>> > Prateek
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>> > On Tue, Feb 10, 2026 at 3:37 AM Steve Loughran <
>> >>>>>>> [email protected]>
>> >>>>>>> > wrote:
>> >>>>>>> >
>> >>>>>>> > > On Sun, 8 Feb 2026 at 18:12, Micah Kornfield <
>> >>>>>>> [email protected]>
>> >>>>>>> > > wrote:
>> >>>>>>> > >
>> >>>>>>> > > >
>> >>>>>>> > > >
>> >>>>>>> > > > It looks like the actual issue described for ORC in the
>> paper
>> >>>>>>> is that
>> >>>>>>> > it
>> >>>>>>> > > > has multiple sub-encodings in a batch.  This is different
>> then
>> >>>>>>> the
>> >>>>>>> > design
>> >>>>>>> > > > proposed here where there is still fixed encoding per page
>> in
>> >>>>>>> parquet.
>> >>>>>>> > > > Given reasonably sized pages I don't think branch
>> >>>>>>> misprediction should
>> >>>>>>> > > be a
>> >>>>>>> > > > big issue for new encodings.  I agree that we should be
>> >>>>>>> conservative in
>> >>>>>>> > > > general for adding new encodings.
>> >>>>>>> > > >
>> >>>>>>> > > >
>> >>>>>>> > > +1
>> >>>>>>> > >
>> >>>>>>> >
>> >>>>>>>
>> >>>>>>
>>
>

Reply via email to