Hi Curt, > As part of the process of amending the Parquet format, perhaps it would be > a good idea for early implementations to generate sample files and commit > them to apache/parquet-testing: Apache Parquet Testing > <https://github.com/apache/parquet-testing> for other implementations to > leverage?
It got dropped in the thread but does https://github.com/apache/parquet-testing/pull/100 address your concerns? Thanks, Micah On Wed, Apr 29, 2026 at 4:20 PM Curt Hagenlocher <[email protected]> wrote: > As part of the process of amending the Parquet format, perhaps it would be > a good idea for early implementations to generate sample files and commit > them to apache/parquet-testing: Apache Parquet Testing > <https://github.com/apache/parquet-testing> for other implementations to > leverage? > > -Curt > > On Wed, Apr 29, 2026 at 4:11 PM PRATEEK GAUR <[email protected]> wrote: > >> Thanks Andrew and Micah for review feedback on the two PR's >> 1) (c++ arrow repo) https://github.com/apache/arrow/pull/48345/changes >> 2) (parquet-format repo) >> https://github.com/apache/parquet-format/pull/557 >> >> I have addressed all (unless I missed something) comments on the two PR's. >> >> Best >> Prateek >> >> On Sat, Apr 25, 2026 at 1:08 PM PRATEEK GAUR <[email protected]> wrote: >> >> > Thanks Andrew and Micah. >> > >> > `fair amount of feedback on at least the implementations` >> > For the c++ I have already started addressing the feedback, I should be >> > done with that Monday/Tuesday. >> > I think Vinoo too has been making good progress on the Java >> implementation. >> > >> > Best >> > Prateek >> > >> > On Sat, Apr 25, 2026 at 12:55 PM Andrew Lamb <[email protected]> >> > wrote: >> > >> >> Got it. Thank you for the clarification -- I will try and look into the >> >> spec and the Rust implementation[1] in this next week >> >> >> >> [1]: https://github.com/apache/arrow-rs/pull/9372 >> >> >> >> On Sat, Apr 25, 2026 at 12:01 PM Micah Kornfield < >> [email protected]> >> >> wrote: >> >> >> >>> Hi Andrew, >> >>> I think there is a fair amount of feedback on at least the >> >>> implementations, typically I think we've waited till they are close to >> >>> mergeable before a final vote. Otherwise I agree we are very close. >> >>> >> >>> -Micah >> >>> >> >>> On Saturday, April 25, 2026, Andrew Lamb <[email protected]> >> wrote: >> >>> >> >>>> Thanks Prateek, >> >>>> >> >>>> I think from this content it looks to me like we are ready to start a >> >>>> vote to explicitly accept ALP into Parquet >> >>>> >> >>>> Does anyone know of a reason we should postpone it for longer? >> >>>> Perhaps someone needs some more time to review? >> >>>> >> >>>> Andrew >> >>>> >> >>>> >> >>>> >> >>>> On Wed, Apr 22, 2026 at 1:00 PM PRATEEK GAUR <[email protected]> >> >>>> wrote: >> >>>> >> >>>>> Hi team, >> >>>>> >> >>>>> >> >>>>> >> >>>>> Hope everyone is doing well. I got a chance to work through all the >> >>>>> remaining feedback and update the spec doc. Here are the new >> artifacts >> >>>>> >> >>>>> 1) Spec document : >> >>>>> >> https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit >> >>>>> >> >>>>> 2) Spec document in parquet format repo : >> >>>>> https://github.com/apache/parquet-format/pull/557 >> >>>>> >> >>>>> 3) Alp implementation in arrow c++ repo : >> >>>>> https://github.com/apache/arrow/pull/48345/changes >> >>>>> >> >>>>> 4) Alp implementation in parquet-java repo : Work for Vinoo and >> Julien >> >>>>> https://github.com/apache/parquet-java/pull/3397 >> >>>>> >> >>>>> 5) PR with test and benchmarking artifacts in parquet-testing repo : >> >>>>> https://github.com/apache/parquet-testing/pull/100 >> >>>>> >> >>>>> >> >>>>> And >> >>>>> >> >>>>> >> >>>>> - Go : Arnav just submitted an in progress implementation in Go. >> >>>>> https://github.com/apache/arrow-go/pull/704 (I haven't started >> >>>>> looking at it yet) >> >>>>> - Rust : I remember Andrew mentioned that this work is also in >> >>>>> progress (So 4 languages!) >> >>>>> >> >>>>> >> >>>>> *Arrow C++ implementation * >> >>>>> >> >>>>> >> >>>>> >> >>>>> The PR is out and was also used by Antoine to report the numbers as >> >>>>> reported here. Micah and Konstantin have given 1 round of feedback >> >>>>> and I'm addressing them today. Please note that the default >> >>>>> optimization flag for compiling is O2 and not Q3. I got around 70% >> >>>>> performance improvement in the decoding speed when using the O3 >> flag. >> >>>>> >> >>>>> >> >>>>> >> >>>>> *Parqet-MR Java implementation (working with Vinoo and Julien) and >> **Cross >> >>>>> Language testing* >> >>>>> >> >>>>> >> >>>>> Let me know if you have any questions or feedback. >> >>>>> >> >>>>> >> >>>>> >> >>>>> Now pasting some performance numbers >> >>>>> >> >>>>> >> >>>>> Table 1: C++ ALP Double Decode — Spotify Columns (Graviton 3, ARM >> >>>>> Neoverse V1) >> >>>>> >> >>>>> ┌──────────────────┬──────────────┬──────────────┬─────────┐ >> >>>>> >> >>>>> │ Column │ -O2 (MB/s) │ -O3 (MB/s) │ Speedup │ >> >>>>> >> >>>>> ├──────────────────┼──────────────┼──────────────┼─────────┤ >> >>>>> >> >>>>> │ valence │ 3,155 │ 5,523 │ 1.75x │ >> >>>>> >> >>>>> │ danceability │ 3,233 │ 5,685 │ 1.76x │ >> >>>>> >> >>>>> │ energy │ 3,197 │ 5,652 │ 1.77x │ >> >>>>> >> >>>>> │ loudness │ 3,186 │ 5,473 │ 1.72x │ >> >>>>> >> >>>>> └──────────────────┴──────────────┴──────────────┴─────────┘ >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> On Wed, Feb 25, 2026 at 9:49 AM PRATEEK GAUR <[email protected]> >> >>>>> wrote: >> >>>>> >> >>>>>> @Micah Kornfield <[email protected]> : Got it. >> >>>>>> >> >>>>>> @Andrew Lamb <[email protected]> >> >>>>>> >> >>>>>> >> >>>>>>> Do you think it would be good to start moving the spec development >> >>>>>>> into >> >>>>>>> markdown format, in preparation for finalizing it? >> >>>>>>> >> >>>>>> >> >>>>>> Yes I'll update the numbers for some of the examples I have in the >> >>>>>> spec based >> >>>>>> on the updated header size. Then we should be good to go for the >> >>>>>> markdown format. >> >>>>>> >> >>>>>> Thanks everyone! >> >>>>>> >> >>>>>> >> >>>>>>> >> >>>>>>> Andrew >> >>>>>>> >> >>>>>>> On Tue, Feb 17, 2026 at 7:28 PM PRATEEK GAUR <[email protected]> >> >>>>>>> wrote: >> >>>>>>> >> >>>>>>> > Hi team, >> >>>>>>> > >> >>>>>>> > 1) Andrew >> >>>>>>> > >> >>>>>>> > - Thanks for working on test files. My PR did add all the >> test >> >>>>>>> files I >> >>>>>>> > used to benchmark on datasets. Maybe we can club it together. >> >>>>>>> WIll also >> >>>>>>> > aid >> >>>>>>> > cross language testing >> >>>>>>> > - Kosta Tarasov working on Rust implementation. This is >> great. >> >>>>>>> Thanks >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > 2) Antoine >> >>>>>>> > >> >>>>>>> > - Thanks a lot for reporting the numbers on AMD. Looks like >> you >> >>>>>>> are >> >>>>>>> > getting 8X the decoding performance of BSS. This is >> amazing!!. >> >>>>>>> > - Thanks for acknowledging the sampling design. >> >>>>>>> > - I agree with you on Fastlanes. In some crude experiments I >> >>>>>>> didn't get >> >>>>>>> > a good perf benefit from it on Graviton3 (but maybe there was >> >>>>>>> something >> >>>>>>> > wrong with my implementation). >> >>>>>>> > - Locking the 16bit exception encoding for the spec in this >> >>>>>>> case. >> >>>>>>> > - Awesome I think we have solved for all open questions minus >> >>>>>>> the >> >>>>>>> > version byte :). (will get back on this soon) >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > 3) Micah >> >>>>>>> > >> >>>>>>> > - FastLanes : The current spec does allow for using FastLane >> >>>>>>> with the >> >>>>>>> > configurable enum value for layout. We should be able to >> inject >> >>>>>>> any >> >>>>>>> > layout >> >>>>>>> > in the current design. >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > Working on resolving all remaining open comments on the spec >> this >> >>>>>>> week. >> >>>>>>> > >> >>>>>>> > Best >> >>>>>>> > Prateek >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > On Tue, Feb 10, 2026 at 3:37 AM Steve Loughran < >> >>>>>>> [email protected]> >> >>>>>>> > wrote: >> >>>>>>> > >> >>>>>>> > > On Sun, 8 Feb 2026 at 18:12, Micah Kornfield < >> >>>>>>> [email protected]> >> >>>>>>> > > wrote: >> >>>>>>> > > >> >>>>>>> > > > >> >>>>>>> > > > >> >>>>>>> > > > It looks like the actual issue described for ORC in the >> paper >> >>>>>>> is that >> >>>>>>> > it >> >>>>>>> > > > has multiple sub-encodings in a batch. This is different >> then >> >>>>>>> the >> >>>>>>> > design >> >>>>>>> > > > proposed here where there is still fixed encoding per page >> in >> >>>>>>> parquet. >> >>>>>>> > > > Given reasonably sized pages I don't think branch >> >>>>>>> misprediction should >> >>>>>>> > > be a >> >>>>>>> > > > big issue for new encodings. I agree that we should be >> >>>>>>> conservative in >> >>>>>>> > > > general for adding new encodings. >> >>>>>>> > > > >> >>>>>>> > > > >> >>>>>>> > > +1 >> >>>>>>> > > >> >>>>>>> > >> >>>>>>> >> >>>>>> >> >
