I also just filed a ticket[1] to track adding the example files and linked it around to try and give it a bit more visibility.
[1]: https://github.com/apache/parquet-testing/issues/105 On Wed, Apr 29, 2026 at 7:25 PM Curt Hagenlocher <[email protected]> wrote: > Ah, thanks! I missed that. > > On Wed, Apr 29, 2026 at 4:24 PM Micah Kornfield <[email protected]> > wrote: > >> Hi Curt, >> >>> As part of the process of amending the Parquet format, perhaps it would >>> be a good idea for early implementations to generate sample files and >>> commit them to apache/parquet-testing: Apache Parquet Testing >>> <https://github.com/apache/parquet-testing> for other implementations >>> to leverage? >> >> >> It got dropped in the thread but does >> https://github.com/apache/parquet-testing/pull/100 address your concerns? >> >> Thanks, >> Micah >> >> On Wed, Apr 29, 2026 at 4:20 PM Curt Hagenlocher <[email protected]> >> wrote: >> >>> As part of the process of amending the Parquet format, perhaps it would >>> be a good idea for early implementations to generate sample files and >>> commit them to apache/parquet-testing: Apache Parquet Testing >>> <https://github.com/apache/parquet-testing> for other implementations >>> to leverage? >>> >>> -Curt >>> >>> On Wed, Apr 29, 2026 at 4:11 PM PRATEEK GAUR <[email protected]> wrote: >>> >>>> Thanks Andrew and Micah for review feedback on the two PR's >>>> 1) (c++ arrow repo) https://github.com/apache/arrow/pull/48345/changes >>>> 2) (parquet-format repo) >>>> https://github.com/apache/parquet-format/pull/557 >>>> >>>> I have addressed all (unless I missed something) comments on the two >>>> PR's. >>>> >>>> Best >>>> Prateek >>>> >>>> On Sat, Apr 25, 2026 at 1:08 PM PRATEEK GAUR <[email protected]> >>>> wrote: >>>> >>>> > Thanks Andrew and Micah. >>>> > >>>> > `fair amount of feedback on at least the implementations` >>>> > For the c++ I have already started addressing the feedback, I should >>>> be >>>> > done with that Monday/Tuesday. >>>> > I think Vinoo too has been making good progress on the Java >>>> implementation. >>>> > >>>> > Best >>>> > Prateek >>>> > >>>> > On Sat, Apr 25, 2026 at 12:55 PM Andrew Lamb <[email protected]> >>>> > wrote: >>>> > >>>> >> Got it. Thank you for the clarification -- I will try and look into >>>> the >>>> >> spec and the Rust implementation[1] in this next week >>>> >> >>>> >> [1]: https://github.com/apache/arrow-rs/pull/9372 >>>> >> >>>> >> On Sat, Apr 25, 2026 at 12:01 PM Micah Kornfield < >>>> [email protected]> >>>> >> wrote: >>>> >> >>>> >>> Hi Andrew, >>>> >>> I think there is a fair amount of feedback on at least the >>>> >>> implementations, typically I think we've waited till they are close >>>> to >>>> >>> mergeable before a final vote. Otherwise I agree we are very close. >>>> >>> >>>> >>> -Micah >>>> >>> >>>> >>> On Saturday, April 25, 2026, Andrew Lamb <[email protected]> >>>> wrote: >>>> >>> >>>> >>>> Thanks Prateek, >>>> >>>> >>>> >>>> I think from this content it looks to me like we are ready to >>>> start a >>>> >>>> vote to explicitly accept ALP into Parquet >>>> >>>> >>>> >>>> Does anyone know of a reason we should postpone it for longer? >>>> >>>> Perhaps someone needs some more time to review? >>>> >>>> >>>> >>>> Andrew >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Apr 22, 2026 at 1:00 PM PRATEEK GAUR <[email protected]> >>>> >>>> wrote: >>>> >>>> >>>> >>>>> Hi team, >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> Hope everyone is doing well. I got a chance to work through all >>>> the >>>> >>>>> remaining feedback and update the spec doc. Here are the new >>>> artifacts >>>> >>>>> >>>> >>>>> 1) Spec document : >>>> >>>>> >>>> https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit >>>> >>>>> >>>> >>>>> 2) Spec document in parquet format repo : >>>> >>>>> https://github.com/apache/parquet-format/pull/557 >>>> >>>>> >>>> >>>>> 3) Alp implementation in arrow c++ repo : >>>> >>>>> https://github.com/apache/arrow/pull/48345/changes >>>> >>>>> >>>> >>>>> 4) Alp implementation in parquet-java repo : Work for Vinoo and >>>> Julien >>>> >>>>> https://github.com/apache/parquet-java/pull/3397 >>>> >>>>> >>>> >>>>> 5) PR with test and benchmarking artifacts in parquet-testing >>>> repo : >>>> >>>>> https://github.com/apache/parquet-testing/pull/100 >>>> >>>>> >>>> >>>>> >>>> >>>>> And >>>> >>>>> >>>> >>>>> >>>> >>>>> - Go : Arnav just submitted an in progress implementation in >>>> Go. >>>> >>>>> https://github.com/apache/arrow-go/pull/704 (I haven't started >>>> >>>>> looking at it yet) >>>> >>>>> - Rust : I remember Andrew mentioned that this work is also in >>>> >>>>> progress (So 4 languages!) >>>> >>>>> >>>> >>>>> >>>> >>>>> *Arrow C++ implementation * >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> The PR is out and was also used by Antoine to report the numbers >>>> as >>>> >>>>> reported here. Micah and Konstantin have given 1 round of feedback >>>> >>>>> and I'm addressing them today. Please note that the default >>>> >>>>> optimization flag for compiling is O2 and not Q3. I got around 70% >>>> >>>>> performance improvement in the decoding speed when using the O3 >>>> flag. >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> *Parqet-MR Java implementation (working with Vinoo and Julien) >>>> and **Cross >>>> >>>>> Language testing* >>>> >>>>> >>>> >>>>> >>>> >>>>> Let me know if you have any questions or feedback. >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> Now pasting some performance numbers >>>> >>>>> >>>> >>>>> >>>> >>>>> Table 1: C++ ALP Double Decode — Spotify Columns (Graviton 3, >>>> ARM >>>> >>>>> Neoverse V1) >>>> >>>>> >>>> >>>>> ┌──────────────────┬──────────────┬──────────────┬─────────┐ >>>> >>>>> >>>> >>>>> │ Column │ -O2 (MB/s) │ -O3 (MB/s) │ Speedup │ >>>> >>>>> >>>> >>>>> ├──────────────────┼──────────────┼──────────────┼─────────┤ >>>> >>>>> >>>> >>>>> │ valence │ 3,155 │ 5,523 │ 1.75x │ >>>> >>>>> >>>> >>>>> │ danceability │ 3,233 │ 5,685 │ 1.76x │ >>>> >>>>> >>>> >>>>> │ energy │ 3,197 │ 5,652 │ 1.77x │ >>>> >>>>> >>>> >>>>> │ loudness │ 3,186 │ 5,473 │ 1.72x │ >>>> >>>>> >>>> >>>>> └──────────────────┴──────────────┴──────────────┴─────────┘ >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> On Wed, Feb 25, 2026 at 9:49 AM PRATEEK GAUR <[email protected]> >>>> >>>>> wrote: >>>> >>>>> >>>> >>>>>> @Micah Kornfield <[email protected]> : Got it. >>>> >>>>>> >>>> >>>>>> @Andrew Lamb <[email protected]> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>>> Do you think it would be good to start moving the spec >>>> development >>>> >>>>>>> into >>>> >>>>>>> markdown format, in preparation for finalizing it? >>>> >>>>>>> >>>> >>>>>> >>>> >>>>>> Yes I'll update the numbers for some of the examples I have in >>>> the >>>> >>>>>> spec based >>>> >>>>>> on the updated header size. Then we should be good to go for the >>>> >>>>>> markdown format. >>>> >>>>>> >>>> >>>>>> Thanks everyone! >>>> >>>>>> >>>> >>>>>> >>>> >>>>>>> >>>> >>>>>>> Andrew >>>> >>>>>>> >>>> >>>>>>> On Tue, Feb 17, 2026 at 7:28 PM PRATEEK GAUR < >>>> [email protected]> >>>> >>>>>>> wrote: >>>> >>>>>>> >>>> >>>>>>> > Hi team, >>>> >>>>>>> > >>>> >>>>>>> > 1) Andrew >>>> >>>>>>> > >>>> >>>>>>> > - Thanks for working on test files. My PR did add all the >>>> test >>>> >>>>>>> files I >>>> >>>>>>> > used to benchmark on datasets. Maybe we can club it >>>> together. >>>> >>>>>>> WIll also >>>> >>>>>>> > aid >>>> >>>>>>> > cross language testing >>>> >>>>>>> > - Kosta Tarasov working on Rust implementation. This is >>>> great. >>>> >>>>>>> Thanks >>>> >>>>>>> > >>>> >>>>>>> > >>>> >>>>>>> > 2) Antoine >>>> >>>>>>> > >>>> >>>>>>> > - Thanks a lot for reporting the numbers on AMD. Looks >>>> like you >>>> >>>>>>> are >>>> >>>>>>> > getting 8X the decoding performance of BSS. This is >>>> amazing!!. >>>> >>>>>>> > - Thanks for acknowledging the sampling design. >>>> >>>>>>> > - I agree with you on Fastlanes. In some crude experiments >>>> I >>>> >>>>>>> didn't get >>>> >>>>>>> > a good perf benefit from it on Graviton3 (but maybe there >>>> was >>>> >>>>>>> something >>>> >>>>>>> > wrong with my implementation). >>>> >>>>>>> > - Locking the 16bit exception encoding for the spec in this >>>> >>>>>>> case. >>>> >>>>>>> > - Awesome I think we have solved for all open questions >>>> minus >>>> >>>>>>> the >>>> >>>>>>> > version byte :). (will get back on this soon) >>>> >>>>>>> > >>>> >>>>>>> > >>>> >>>>>>> > 3) Micah >>>> >>>>>>> > >>>> >>>>>>> > - FastLanes : The current spec does allow for using >>>> FastLane >>>> >>>>>>> with the >>>> >>>>>>> > configurable enum value for layout. We should be able to >>>> inject >>>> >>>>>>> any >>>> >>>>>>> > layout >>>> >>>>>>> > in the current design. >>>> >>>>>>> > >>>> >>>>>>> > >>>> >>>>>>> > Working on resolving all remaining open comments on the spec >>>> this >>>> >>>>>>> week. >>>> >>>>>>> > >>>> >>>>>>> > Best >>>> >>>>>>> > Prateek >>>> >>>>>>> > >>>> >>>>>>> > >>>> >>>>>>> > On Tue, Feb 10, 2026 at 3:37 AM Steve Loughran < >>>> >>>>>>> [email protected]> >>>> >>>>>>> > wrote: >>>> >>>>>>> > >>>> >>>>>>> > > On Sun, 8 Feb 2026 at 18:12, Micah Kornfield < >>>> >>>>>>> [email protected]> >>>> >>>>>>> > > wrote: >>>> >>>>>>> > > >>>> >>>>>>> > > > >>>> >>>>>>> > > > >>>> >>>>>>> > > > It looks like the actual issue described for ORC in the >>>> paper >>>> >>>>>>> is that >>>> >>>>>>> > it >>>> >>>>>>> > > > has multiple sub-encodings in a batch. This is different >>>> then >>>> >>>>>>> the >>>> >>>>>>> > design >>>> >>>>>>> > > > proposed here where there is still fixed encoding per >>>> page in >>>> >>>>>>> parquet. >>>> >>>>>>> > > > Given reasonably sized pages I don't think branch >>>> >>>>>>> misprediction should >>>> >>>>>>> > > be a >>>> >>>>>>> > > > big issue for new encodings. I agree that we should be >>>> >>>>>>> conservative in >>>> >>>>>>> > > > general for adding new encodings. >>>> >>>>>>> > > > >>>> >>>>>>> > > > >>>> >>>>>>> > > +1 >>>> >>>>>>> > > >>>> >>>>>>> > >>>> >>>>>>> >>>> >>>>>> >>>> >>>
