Ah, thanks! I missed that. On Wed, Apr 29, 2026 at 4:24 PM Micah Kornfield <[email protected]> wrote:
> Hi Curt, > >> As part of the process of amending the Parquet format, perhaps it would >> be a good idea for early implementations to generate sample files and >> commit them to apache/parquet-testing: Apache Parquet Testing >> <https://github.com/apache/parquet-testing> for other implementations to >> leverage? > > > It got dropped in the thread but does > https://github.com/apache/parquet-testing/pull/100 address your concerns? > > Thanks, > Micah > > On Wed, Apr 29, 2026 at 4:20 PM Curt Hagenlocher <[email protected]> > wrote: > >> As part of the process of amending the Parquet format, perhaps it would >> be a good idea for early implementations to generate sample files and >> commit them to apache/parquet-testing: Apache Parquet Testing >> <https://github.com/apache/parquet-testing> for other implementations to >> leverage? >> >> -Curt >> >> On Wed, Apr 29, 2026 at 4:11 PM PRATEEK GAUR <[email protected]> wrote: >> >>> Thanks Andrew and Micah for review feedback on the two PR's >>> 1) (c++ arrow repo) https://github.com/apache/arrow/pull/48345/changes >>> 2) (parquet-format repo) >>> https://github.com/apache/parquet-format/pull/557 >>> >>> I have addressed all (unless I missed something) comments on the two >>> PR's. >>> >>> Best >>> Prateek >>> >>> On Sat, Apr 25, 2026 at 1:08 PM PRATEEK GAUR <[email protected]> wrote: >>> >>> > Thanks Andrew and Micah. >>> > >>> > `fair amount of feedback on at least the implementations` >>> > For the c++ I have already started addressing the feedback, I should be >>> > done with that Monday/Tuesday. >>> > I think Vinoo too has been making good progress on the Java >>> implementation. >>> > >>> > Best >>> > Prateek >>> > >>> > On Sat, Apr 25, 2026 at 12:55 PM Andrew Lamb <[email protected]> >>> > wrote: >>> > >>> >> Got it. Thank you for the clarification -- I will try and look into >>> the >>> >> spec and the Rust implementation[1] in this next week >>> >> >>> >> [1]: https://github.com/apache/arrow-rs/pull/9372 >>> >> >>> >> On Sat, Apr 25, 2026 at 12:01 PM Micah Kornfield < >>> [email protected]> >>> >> wrote: >>> >> >>> >>> Hi Andrew, >>> >>> I think there is a fair amount of feedback on at least the >>> >>> implementations, typically I think we've waited till they are close >>> to >>> >>> mergeable before a final vote. Otherwise I agree we are very close. >>> >>> >>> >>> -Micah >>> >>> >>> >>> On Saturday, April 25, 2026, Andrew Lamb <[email protected]> >>> wrote: >>> >>> >>> >>>> Thanks Prateek, >>> >>>> >>> >>>> I think from this content it looks to me like we are ready to start >>> a >>> >>>> vote to explicitly accept ALP into Parquet >>> >>>> >>> >>>> Does anyone know of a reason we should postpone it for longer? >>> >>>> Perhaps someone needs some more time to review? >>> >>>> >>> >>>> Andrew >>> >>>> >>> >>>> >>> >>>> >>> >>>> On Wed, Apr 22, 2026 at 1:00 PM PRATEEK GAUR <[email protected]> >>> >>>> wrote: >>> >>>> >>> >>>>> Hi team, >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> Hope everyone is doing well. I got a chance to work through all the >>> >>>>> remaining feedback and update the spec doc. Here are the new >>> artifacts >>> >>>>> >>> >>>>> 1) Spec document : >>> >>>>> >>> https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit >>> >>>>> >>> >>>>> 2) Spec document in parquet format repo : >>> >>>>> https://github.com/apache/parquet-format/pull/557 >>> >>>>> >>> >>>>> 3) Alp implementation in arrow c++ repo : >>> >>>>> https://github.com/apache/arrow/pull/48345/changes >>> >>>>> >>> >>>>> 4) Alp implementation in parquet-java repo : Work for Vinoo and >>> Julien >>> >>>>> https://github.com/apache/parquet-java/pull/3397 >>> >>>>> >>> >>>>> 5) PR with test and benchmarking artifacts in parquet-testing repo >>> : >>> >>>>> https://github.com/apache/parquet-testing/pull/100 >>> >>>>> >>> >>>>> >>> >>>>> And >>> >>>>> >>> >>>>> >>> >>>>> - Go : Arnav just submitted an in progress implementation in Go. >>> >>>>> https://github.com/apache/arrow-go/pull/704 (I haven't started >>> >>>>> looking at it yet) >>> >>>>> - Rust : I remember Andrew mentioned that this work is also in >>> >>>>> progress (So 4 languages!) >>> >>>>> >>> >>>>> >>> >>>>> *Arrow C++ implementation * >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> The PR is out and was also used by Antoine to report the numbers as >>> >>>>> reported here. Micah and Konstantin have given 1 round of feedback >>> >>>>> and I'm addressing them today. Please note that the default >>> >>>>> optimization flag for compiling is O2 and not Q3. I got around 70% >>> >>>>> performance improvement in the decoding speed when using the O3 >>> flag. >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> *Parqet-MR Java implementation (working with Vinoo and Julien) and >>> **Cross >>> >>>>> Language testing* >>> >>>>> >>> >>>>> >>> >>>>> Let me know if you have any questions or feedback. >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> Now pasting some performance numbers >>> >>>>> >>> >>>>> >>> >>>>> Table 1: C++ ALP Double Decode — Spotify Columns (Graviton 3, ARM >>> >>>>> Neoverse V1) >>> >>>>> >>> >>>>> ┌──────────────────┬──────────────┬──────────────┬─────────┐ >>> >>>>> >>> >>>>> │ Column │ -O2 (MB/s) │ -O3 (MB/s) │ Speedup │ >>> >>>>> >>> >>>>> ├──────────────────┼──────────────┼──────────────┼─────────┤ >>> >>>>> >>> >>>>> │ valence │ 3,155 │ 5,523 │ 1.75x │ >>> >>>>> >>> >>>>> │ danceability │ 3,233 │ 5,685 │ 1.76x │ >>> >>>>> >>> >>>>> │ energy │ 3,197 │ 5,652 │ 1.77x │ >>> >>>>> >>> >>>>> │ loudness │ 3,186 │ 5,473 │ 1.72x │ >>> >>>>> >>> >>>>> └──────────────────┴──────────────┴──────────────┴─────────┘ >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> On Wed, Feb 25, 2026 at 9:49 AM PRATEEK GAUR <[email protected]> >>> >>>>> wrote: >>> >>>>> >>> >>>>>> @Micah Kornfield <[email protected]> : Got it. >>> >>>>>> >>> >>>>>> @Andrew Lamb <[email protected]> >>> >>>>>> >>> >>>>>> >>> >>>>>>> Do you think it would be good to start moving the spec >>> development >>> >>>>>>> into >>> >>>>>>> markdown format, in preparation for finalizing it? >>> >>>>>>> >>> >>>>>> >>> >>>>>> Yes I'll update the numbers for some of the examples I have in the >>> >>>>>> spec based >>> >>>>>> on the updated header size. Then we should be good to go for the >>> >>>>>> markdown format. >>> >>>>>> >>> >>>>>> Thanks everyone! >>> >>>>>> >>> >>>>>> >>> >>>>>>> >>> >>>>>>> Andrew >>> >>>>>>> >>> >>>>>>> On Tue, Feb 17, 2026 at 7:28 PM PRATEEK GAUR <[email protected] >>> > >>> >>>>>>> wrote: >>> >>>>>>> >>> >>>>>>> > Hi team, >>> >>>>>>> > >>> >>>>>>> > 1) Andrew >>> >>>>>>> > >>> >>>>>>> > - Thanks for working on test files. My PR did add all the >>> test >>> >>>>>>> files I >>> >>>>>>> > used to benchmark on datasets. Maybe we can club it >>> together. >>> >>>>>>> WIll also >>> >>>>>>> > aid >>> >>>>>>> > cross language testing >>> >>>>>>> > - Kosta Tarasov working on Rust implementation. This is >>> great. >>> >>>>>>> Thanks >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > 2) Antoine >>> >>>>>>> > >>> >>>>>>> > - Thanks a lot for reporting the numbers on AMD. Looks like >>> you >>> >>>>>>> are >>> >>>>>>> > getting 8X the decoding performance of BSS. This is >>> amazing!!. >>> >>>>>>> > - Thanks for acknowledging the sampling design. >>> >>>>>>> > - I agree with you on Fastlanes. In some crude experiments I >>> >>>>>>> didn't get >>> >>>>>>> > a good perf benefit from it on Graviton3 (but maybe there >>> was >>> >>>>>>> something >>> >>>>>>> > wrong with my implementation). >>> >>>>>>> > - Locking the 16bit exception encoding for the spec in this >>> >>>>>>> case. >>> >>>>>>> > - Awesome I think we have solved for all open questions >>> minus >>> >>>>>>> the >>> >>>>>>> > version byte :). (will get back on this soon) >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > 3) Micah >>> >>>>>>> > >>> >>>>>>> > - FastLanes : The current spec does allow for using FastLane >>> >>>>>>> with the >>> >>>>>>> > configurable enum value for layout. We should be able to >>> inject >>> >>>>>>> any >>> >>>>>>> > layout >>> >>>>>>> > in the current design. >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > Working on resolving all remaining open comments on the spec >>> this >>> >>>>>>> week. >>> >>>>>>> > >>> >>>>>>> > Best >>> >>>>>>> > Prateek >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > On Tue, Feb 10, 2026 at 3:37 AM Steve Loughran < >>> >>>>>>> [email protected]> >>> >>>>>>> > wrote: >>> >>>>>>> > >>> >>>>>>> > > On Sun, 8 Feb 2026 at 18:12, Micah Kornfield < >>> >>>>>>> [email protected]> >>> >>>>>>> > > wrote: >>> >>>>>>> > > >>> >>>>>>> > > > >>> >>>>>>> > > > >>> >>>>>>> > > > It looks like the actual issue described for ORC in the >>> paper >>> >>>>>>> is that >>> >>>>>>> > it >>> >>>>>>> > > > has multiple sub-encodings in a batch. This is different >>> then >>> >>>>>>> the >>> >>>>>>> > design >>> >>>>>>> > > > proposed here where there is still fixed encoding per page >>> in >>> >>>>>>> parquet. >>> >>>>>>> > > > Given reasonably sized pages I don't think branch >>> >>>>>>> misprediction should >>> >>>>>>> > > be a >>> >>>>>>> > > > big issue for new encodings. I agree that we should be >>> >>>>>>> conservative in >>> >>>>>>> > > > general for adding new encodings. >>> >>>>>>> > > > >>> >>>>>>> > > > >>> >>>>>>> > > +1 >>> >>>>>>> > > >>> >>>>>>> > >>> >>>>>>> >>> >>>>>> >>> >>
