As part of the process of amending the Parquet format, perhaps it would be a good idea for early implementations to generate sample files and commit them to apache/parquet-testing: Apache Parquet Testing <https://github.com/apache/parquet-testing> for other implementations to leverage?
-Curt On Wed, Apr 29, 2026 at 4:11 PM PRATEEK GAUR <[email protected]> wrote: > Thanks Andrew and Micah for review feedback on the two PR's > 1) (c++ arrow repo) https://github.com/apache/arrow/pull/48345/changes > 2) (parquet-format repo) https://github.com/apache/parquet-format/pull/557 > > I have addressed all (unless I missed something) comments on the two PR's. > > Best > Prateek > > On Sat, Apr 25, 2026 at 1:08 PM PRATEEK GAUR <[email protected]> wrote: > > > Thanks Andrew and Micah. > > > > `fair amount of feedback on at least the implementations` > > For the c++ I have already started addressing the feedback, I should be > > done with that Monday/Tuesday. > > I think Vinoo too has been making good progress on the Java > implementation. > > > > Best > > Prateek > > > > On Sat, Apr 25, 2026 at 12:55 PM Andrew Lamb <[email protected]> > > wrote: > > > >> Got it. Thank you for the clarification -- I will try and look into the > >> spec and the Rust implementation[1] in this next week > >> > >> [1]: https://github.com/apache/arrow-rs/pull/9372 > >> > >> On Sat, Apr 25, 2026 at 12:01 PM Micah Kornfield <[email protected] > > > >> wrote: > >> > >>> Hi Andrew, > >>> I think there is a fair amount of feedback on at least the > >>> implementations, typically I think we've waited till they are close to > >>> mergeable before a final vote. Otherwise I agree we are very close. > >>> > >>> -Micah > >>> > >>> On Saturday, April 25, 2026, Andrew Lamb <[email protected]> > wrote: > >>> > >>>> Thanks Prateek, > >>>> > >>>> I think from this content it looks to me like we are ready to start a > >>>> vote to explicitly accept ALP into Parquet > >>>> > >>>> Does anyone know of a reason we should postpone it for longer? > >>>> Perhaps someone needs some more time to review? > >>>> > >>>> Andrew > >>>> > >>>> > >>>> > >>>> On Wed, Apr 22, 2026 at 1:00 PM PRATEEK GAUR <[email protected]> > >>>> wrote: > >>>> > >>>>> Hi team, > >>>>> > >>>>> > >>>>> > >>>>> Hope everyone is doing well. I got a chance to work through all the > >>>>> remaining feedback and update the spec doc. Here are the new > artifacts > >>>>> > >>>>> 1) Spec document : > >>>>> > https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit > >>>>> > >>>>> 2) Spec document in parquet format repo : > >>>>> https://github.com/apache/parquet-format/pull/557 > >>>>> > >>>>> 3) Alp implementation in arrow c++ repo : > >>>>> https://github.com/apache/arrow/pull/48345/changes > >>>>> > >>>>> 4) Alp implementation in parquet-java repo : Work for Vinoo and > Julien > >>>>> https://github.com/apache/parquet-java/pull/3397 > >>>>> > >>>>> 5) PR with test and benchmarking artifacts in parquet-testing repo : > >>>>> https://github.com/apache/parquet-testing/pull/100 > >>>>> > >>>>> > >>>>> And > >>>>> > >>>>> > >>>>> - Go : Arnav just submitted an in progress implementation in Go. > >>>>> https://github.com/apache/arrow-go/pull/704 (I haven't started > >>>>> looking at it yet) > >>>>> - Rust : I remember Andrew mentioned that this work is also in > >>>>> progress (So 4 languages!) > >>>>> > >>>>> > >>>>> *Arrow C++ implementation * > >>>>> > >>>>> > >>>>> > >>>>> The PR is out and was also used by Antoine to report the numbers as > >>>>> reported here. Micah and Konstantin have given 1 round of feedback > >>>>> and I'm addressing them today. Please note that the default > >>>>> optimization flag for compiling is O2 and not Q3. I got around 70% > >>>>> performance improvement in the decoding speed when using the O3 flag. > >>>>> > >>>>> > >>>>> > >>>>> *Parqet-MR Java implementation (working with Vinoo and Julien) and > **Cross > >>>>> Language testing* > >>>>> > >>>>> > >>>>> Let me know if you have any questions or feedback. > >>>>> > >>>>> > >>>>> > >>>>> Now pasting some performance numbers > >>>>> > >>>>> > >>>>> Table 1: C++ ALP Double Decode — Spotify Columns (Graviton 3, ARM > >>>>> Neoverse V1) > >>>>> > >>>>> ┌──────────────────┬──────────────┬──────────────┬─────────┐ > >>>>> > >>>>> │ Column │ -O2 (MB/s) │ -O3 (MB/s) │ Speedup │ > >>>>> > >>>>> ├──────────────────┼──────────────┼──────────────┼─────────┤ > >>>>> > >>>>> │ valence │ 3,155 │ 5,523 │ 1.75x │ > >>>>> > >>>>> │ danceability │ 3,233 │ 5,685 │ 1.76x │ > >>>>> > >>>>> │ energy │ 3,197 │ 5,652 │ 1.77x │ > >>>>> > >>>>> │ loudness │ 3,186 │ 5,473 │ 1.72x │ > >>>>> > >>>>> └──────────────────┴──────────────┴──────────────┴─────────┘ > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Wed, Feb 25, 2026 at 9:49 AM PRATEEK GAUR <[email protected]> > >>>>> wrote: > >>>>> > >>>>>> @Micah Kornfield <[email protected]> : Got it. > >>>>>> > >>>>>> @Andrew Lamb <[email protected]> > >>>>>> > >>>>>> > >>>>>>> Do you think it would be good to start moving the spec development > >>>>>>> into > >>>>>>> markdown format, in preparation for finalizing it? > >>>>>>> > >>>>>> > >>>>>> Yes I'll update the numbers for some of the examples I have in the > >>>>>> spec based > >>>>>> on the updated header size. Then we should be good to go for the > >>>>>> markdown format. > >>>>>> > >>>>>> Thanks everyone! > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> Andrew > >>>>>>> > >>>>>>> On Tue, Feb 17, 2026 at 7:28 PM PRATEEK GAUR <[email protected]> > >>>>>>> wrote: > >>>>>>> > >>>>>>> > Hi team, > >>>>>>> > > >>>>>>> > 1) Andrew > >>>>>>> > > >>>>>>> > - Thanks for working on test files. My PR did add all the test > >>>>>>> files I > >>>>>>> > used to benchmark on datasets. Maybe we can club it together. > >>>>>>> WIll also > >>>>>>> > aid > >>>>>>> > cross language testing > >>>>>>> > - Kosta Tarasov working on Rust implementation. This is > great. > >>>>>>> Thanks > >>>>>>> > > >>>>>>> > > >>>>>>> > 2) Antoine > >>>>>>> > > >>>>>>> > - Thanks a lot for reporting the numbers on AMD. Looks like > you > >>>>>>> are > >>>>>>> > getting 8X the decoding performance of BSS. This is amazing!!. > >>>>>>> > - Thanks for acknowledging the sampling design. > >>>>>>> > - I agree with you on Fastlanes. In some crude experiments I > >>>>>>> didn't get > >>>>>>> > a good perf benefit from it on Graviton3 (but maybe there was > >>>>>>> something > >>>>>>> > wrong with my implementation). > >>>>>>> > - Locking the 16bit exception encoding for the spec in this > >>>>>>> case. > >>>>>>> > - Awesome I think we have solved for all open questions minus > >>>>>>> the > >>>>>>> > version byte :). (will get back on this soon) > >>>>>>> > > >>>>>>> > > >>>>>>> > 3) Micah > >>>>>>> > > >>>>>>> > - FastLanes : The current spec does allow for using FastLane > >>>>>>> with the > >>>>>>> > configurable enum value for layout. We should be able to > inject > >>>>>>> any > >>>>>>> > layout > >>>>>>> > in the current design. > >>>>>>> > > >>>>>>> > > >>>>>>> > Working on resolving all remaining open comments on the spec this > >>>>>>> week. > >>>>>>> > > >>>>>>> > Best > >>>>>>> > Prateek > >>>>>>> > > >>>>>>> > > >>>>>>> > On Tue, Feb 10, 2026 at 3:37 AM Steve Loughran < > >>>>>>> [email protected]> > >>>>>>> > wrote: > >>>>>>> > > >>>>>>> > > On Sun, 8 Feb 2026 at 18:12, Micah Kornfield < > >>>>>>> [email protected]> > >>>>>>> > > wrote: > >>>>>>> > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > It looks like the actual issue described for ORC in the paper > >>>>>>> is that > >>>>>>> > it > >>>>>>> > > > has multiple sub-encodings in a batch. This is different > then > >>>>>>> the > >>>>>>> > design > >>>>>>> > > > proposed here where there is still fixed encoding per page in > >>>>>>> parquet. > >>>>>>> > > > Given reasonably sized pages I don't think branch > >>>>>>> misprediction should > >>>>>>> > > be a > >>>>>>> > > > big issue for new encodings. I agree that we should be > >>>>>>> conservative in > >>>>>>> > > > general for adding new encodings. > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > +1 > >>>>>>> > > > >>>>>>> > > >>>>>>> > >>>>>> >
