On Sat, Nov 22, 2025 at 4:49 AM Steve Loughran <[email protected]> wrote:
> First, sorry: I think I accidentally marked as done the comment in the doc > about x86 performance. > No worries, I restored the thread :). Those x86 numbers are critical, especially AVX512 in a recent intel part. > There's a notorious feature in the early ones where the cores would reduce > frequency after you used the opcodes as a way of managing die temperature ( > https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency > ); the later ones and AMD models are the ones to worry about. > We did collect performance numbers in our early prototype and they looked good on x86 hardware. Though I didn't check the processor family. In our arrow implementation we are also working on a comprehensive benchmarking script which will help everyone run it on different CPU families to get a good idea of performance. Best Prateek > On Sat, 22 Nov 2025 at 04:15, Prateek Gaur via dev <[email protected]> > wrote: > >> Hi team, >> >> *ALP ---> ALP PeudoDecimal* >> >> As is visible from the numbers above and as stated in the paper too for >> real double values, i.e the values with high precision points, it is very >> difficult to get a good compression ratio. >> >> This combined with the fact that we want to keep the spec/implementation >> simpler, stating Antoine directly here >> >> `*2. Do not include the ALPrd fallback which is a homegrown dictionary* >> >> *encoding without dictionary reuse accross pages, and instead rely on >> awell-known Parquet encoding (such as BYTE_STREAM_SPLIT?)*` >> >> Also based on some discussion I had with Julien in person and the biweekly >> meeting with a number of you. >> >> We'll be going with ALPpd (pseudo decimal) as the first >> implementation relying on the query engine based on its own heuristics to >> decide on the right fallback to BYTE_STREAM_SPLIT of ZSTD. >> >> Best >> Prateek >> >> >> >> On Thu, Nov 20, 2025 at 5:09 PM Prateek Gaur <[email protected]> >> wrote: >> >> > Sheet with numbers >> > < >> https://docs.google.com/spreadsheets/d/1NmCg0WZKeZUc6vNXXD8M3GIyNqF_H3goj6mVbT8at7A/edit?gid=1351944517#gid=1351944517 >> > >> > . >> > >> > On Thu, Nov 20, 2025 at 5:09 PM PRATEEK GAUR <[email protected]> >> wrote: >> > >> >> Hi team, >> >> >> >> There was a request from a few folks, Antoine Pitrou and Adam Reeve if >> I >> >> remember correctly, to perform the experiment on some of the papers >> that >> >> talked about BYTE_STREAM_SPLIT for completeness. >> >> I wanted to share the numbers for the same in this sheet. At this point >> >> we have numbers on a wide variety of data. >> >> (Will have to share the sheet from my snowflake account as our laptops >> >> have fair bit of restriction with respect to copy paste permissions :) >> ) >> >> >> >> Best >> >> Prateek >> >> >> >> On Thu, Nov 20, 2025 at 2:25 PM PRATEEK GAUR <[email protected]> >> wrote: >> >> >> >>> Hi Julien, >> >>> >> >>> Yes based on >> >>> >> >>> - Numbers presented >> >>> - Discussions over the doc and >> >>> - Multiple discussions in the biweekly meeting >> >>> >> >>> We are in a stage where we agree this is the right encoding to add and >> >>> we can move to the DRAFT/POC stage from DISCUSS stage. >> >>> Will start working on the PR for the same. >> >>> >> >>> Thanks for bringing this up. >> >>> Prateek >> >>> >> >>> On Thu, Nov 20, 2025 at 8:16 AM Julien Le Dem <[email protected]> >> wrote: >> >>> >> >>>> @PRATEEK GAUR <[email protected]> : Would you agree that we are >> past >> >>>> the DISCUSS step and into the DRAFT/POC phase according to the >> proposals >> >>>> process < >> https://github.com/apache/parquet-format/tree/master/proposals >> >>>> >? >> >>>> If yes, could you open a PR on this page to add this proposal to the >> >>>> list? >> >>>> https://github.com/apache/parquet-format/tree/master/proposals >> >>>> Thank you! >> >>>> >> >>>> >> >>>> On Thu, Oct 30, 2025 at 2:38 PM Andrew Lamb <[email protected]> >> >>>> wrote: >> >>>> >> >>>> > I have filed a ticket[1] in arrow-rs to track prototyping ALP in >> the >> >>>> Rust >> >>>> > Parquet reader if anyone is interested >> >>>> > >> >>>> > Andrew >> >>>> > >> >>>> > [1]: https://github.com/apache/arrow-rs/issues/8748 >> >>>> > >> >>>> > On Wed, Oct 22, 2025 at 1:33 PM Micah Kornfield < >> >>>> [email protected]> >> >>>> > wrote: >> >>>> > >> >>>> > > > >> >>>> > > > C++, Java and Rust support them for sure. I feel like we should >> >>>> > > > probably default to V2 at some point. >> >>>> > > >> >>>> > > >> >>>> > > I seem to recall, some of the vectorized java readers (Iceberg, >> >>>> Spark) >> >>>> > > might not support V2 data pages (but I might be confusing this >> with >> >>>> > > encodings). But this is only a vague recollection. >> >>>> > > >> >>>> > > >> >>>> > > >> >>>> > > On Wed, Oct 22, 2025 at 6:38 AM Andrew Lamb < >> [email protected] >> >>>> > >> >>>> > > wrote: >> >>>> > > >> >>>> > > > > Someone has to add V2 data pages to >> >>>> > > > > >> >>>> > > > >> >>>> > > > >> >>>> > > >> >>>> > >> >>>> >> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md >> >>>> > > > > :) >> >>>> > > > >> >>>> > > > Your wish is my command: >> >>>> > https://github.com/apache/parquet-site/pull/124 >> >>>> > > > >> >>>> > > > As the format grows in popularity and momentum builds to >> evolve, >> >>>> I feel >> >>>> > > the >> >>>> > > > content on the parquet.apache.org site could use refreshing / >> >>>> > updating. >> >>>> > > > So, while I had the site open, I made some other PRs to scratch >> >>>> various >> >>>> > > > itches >> >>>> > > > >> >>>> > > > (I am absolutely 🎣 for someone to please review 🙏): >> >>>> > > > >> >>>> > > > 1. Add Variant/Geometry/Geography types to implementation >> status >> >>>> > matrix: >> >>>> > > > https://github.com/apache/parquet-site/pull/123 >> >>>> > > > 2. Improve introduction / overview, add more links to spec and >> >>>> > > > implementation status: >> >>>> https://github.com/apache/parquet-site/pull/125 >> >>>> > > > >> >>>> > > > >> >>>> > > > Thanks, >> >>>> > > > Andrew >> >>>> > > > >> >>>> > > > On Wed, Oct 22, 2025 at 4:09 AM Antoine Pitrou < >> >>>> [email protected]> >> >>>> > > wrote: >> >>>> > > > >> >>>> > > > > >> >>>> > > > > Hi Julien, hi all, >> >>>> > > > > >> >>>> > > > > On Mon, 20 Oct 2025 15:14:58 -0700 >> >>>> > > > > Julien Le Dem <[email protected]> wrote: >> >>>> > > > > > >> >>>> > > > > > Another question from me: >> >>>> > > > > > >> >>>> > > > > > Since the goal is to not use compression at all in this >> case >> >>>> (no >> >>>> > > ZSTD) >> >>>> > > > > > I'm assuming we would be using either: >> >>>> > > > > > - the Data Page V1 with UNCOMPRESSED in the >> >>>> ColumnMetadata.column >> >>>> > > > > > < >> >>>> > > > > >> >>>> > > > >> >>>> > > >> >>>> > >> >>>> >> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L887 >> >>>> > > > > > >> >>>> > > > > > field. >> >>>> > > > > > - the Data Page V2 with false in the >> >>>> DataPageHeaderV2.is_compressed >> >>>> > > > > > < >> >>>> > > > > >> >>>> > > > >> >>>> > > >> >>>> > >> >>>> >> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L746 >> >>>> > > > > > >> >>>> > > > > > field >> >>>> > > > > > The second helping decide if we can selectively compress >> some >> >>>> pages >> >>>> > > if >> >>>> > > > > they >> >>>> > > > > > are less compressed by the >> >>>> > > > > > A few years ago there was a question on the support of the >> >>>> > > DATA_PAGE_V2 >> >>>> > > > > and >> >>>> > > > > > I was curious to hear a refresh on how that's generally >> >>>> supported >> >>>> > in >> >>>> > > > > > Parquet implementations. The is_compressed field was >> exactly >> >>>> > intended >> >>>> > > > to >> >>>> > > > > > avoid block compression when the encoding itself is good >> >>>> enough. >> >>>> > > > > >> >>>> > > > > Someone has to add V2 data pages to >> >>>> > > > > >> >>>> > > > > >> >>>> > > > >> >>>> > > >> >>>> > >> >>>> >> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md >> >>>> > > > > :) >> >>>> > > > > >> >>>> > > > > C++, Java and Rust support them for sure. I feel like we >> should >> >>>> > > > > probably default to V2 at some point. >> >>>> > > > > >> >>>> > > > > Also see https://github.com/apache/parquet-java/issues/3344 >> for >> >>>> > Java. >> >>>> > > > > >> >>>> > > > > Regards >> >>>> > > > > >> >>>> > > > > Antoine. >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > > >> >>>> > > > > > Julien >> >>>> > > > > > >> >>>> > > > > > On Mon, Oct 20, 2025 at 11:57 AM Andrew Lamb >> >>>> > > > > <[email protected]> wrote: >> >>>> > > > > > >> >>>> > > > > > > Thanks again Prateek and co for pushing this along! >> >>>> > > > > > > >> >>>> > > > > > > >> >>>> > > > > > > > 1. Design and write our own Parquet-ALP spec so that >> >>>> > > > implementations >> >>>> > > > > > > > know exactly how to encode and represent data >> >>>> > > > > > > >> >>>> > > > > > > 100% agree with this (similar to what was done for >> >>>> > ParquetVariant) >> >>>> > > > > > > >> >>>> > > > > > > > 2. I may be missing something, but the paper doesn't >> seem >> >>>> to >> >>>> > > > > mention >> >>>> > > > > > > non-finite values (such as +/-Inf and NaNs). >> >>>> > > > > > > >> >>>> > > > > > > I think they are handled via the "Exception" mechanism. >> >>>> Vortex's >> >>>> > > ALP >> >>>> > > > > > > implementation (below) does appear to handle finite >> >>>> numbers[2] >> >>>> > > > > > > >> >>>> > > > > > > > 3. It seems there is a single implementation, which is >> >>>> the one >> >>>> > > > > published >> >>>> > > > > > > > together with the paper. It is not obvious that it >> will be >> >>>> > > > > > > > maintained in the future, and reusing it is probably >> not >> >>>> an >> >>>> > > option >> >>>> > > > > for >> >>>> > > > > > > > non-C++ Parquet implementations >> >>>> > > > > > > >> >>>> > > > > > > My understanding from the call was that Prateek and team >> >>>> > > > re-implemented >> >>>> > > > > > > ALP (did not use the implementation from CWI[3]) but >> that >> >>>> would >> >>>> > be >> >>>> > > > > good to >> >>>> > > > > > > confirm. >> >>>> > > > > > > >> >>>> > > > > > > There is also a Rust implementation of ALP[1] that is >> part >> >>>> of the >> >>>> > > > > Vortex >> >>>> > > > > > > file format implementation. I have not reviewed it to see >> >>>> if it >> >>>> > > > > deviates >> >>>> > > > > > > from the algorithm presented in the paper. >> >>>> > > > > > > >> >>>> > > > > > > Andrew >> >>>> > > > > > > >> >>>> > > > > > > [1]: >> >>>> > > > > > > >> >>>> > > > > > > >> >>>> > > > > >> >>>> > > > >> >>>> > > >> >>>> > >> >>>> >> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs >> >>>> > > > > > > [2]: >> >>>> > > > > > > >> >>>> > > > > > > >> >>>> > > > > >> >>>> > > > >> >>>> > > >> >>>> > >> >>>> >> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/compress.rs#L266-L281 >> >>>> > > > > > > [3]: https://github.com/cwida/ALP >> >>>> > > > > > > >> >>>> > > > > > > >> >>>> > > > > > > On Mon, Oct 20, 2025 at 4:47 AM Antoine Pitrou >> >>>> > > > > <[email protected]> wrote: >> >>>> > > > > > > >> >>>> > > > > > > > >> >>>> > > > > > > > Hello, >> >>>> > > > > > > > >> >>>> > > > > > > > Thanks for doing this and I agree the numbers look >> >>>> impressive. >> >>>> > > > > > > > >> >>>> > > > > > > > I would ask if possible for more data points: >> >>>> > > > > > > > >> >>>> > > > > > > > 1. More datasets: you could for example look at the >> >>>> datasets >> >>>> > that >> >>>> > > > > were >> >>>> > > > > > > > used to originally evalute BYTE_STREAM_SPLIT (see >> >>>> > > > > > > > https://issues.apache.org/jira/browse/PARQUET-1622 and >> >>>> > > > specifically >> >>>> > > > > > > > the Google Doc linked there) >> >>>> > > > > > > > >> >>>> > > > > > > > 2. Comparison to BYTE_STREAM_SPLIT + LZ4 and >> >>>> BYTE_STREAM_SPLIT >> >>>> > + >> >>>> > > > ZSTD >> >>>> > > > > > > > >> >>>> > > > > > > > 3. Optionally, some perf numbers on x86 too, but I >> expect >> >>>> that >> >>>> > > ALP >> >>>> > > > > will >> >>>> > > > > > > > remain very good there as well >> >>>> > > > > > > > >> >>>> > > > > > > > >> >>>> > > > > > > > I also have the following reservations towards ALP: >> >>>> > > > > > > > >> >>>> > > > > > > > 1. There is no published official spec AFAICT, just a >> >>>> research >> >>>> > > > paper. >> >>>> > > > > > > > >> >>>> > > > > > > > 2. I may be missing something, but the paper doesn't >> seem >> >>>> to >> >>>> > > > mention >> >>>> > > > > > > > non-finite values (such as +/-Inf and NaNs). >> >>>> > > > > > > > >> >>>> > > > > > > > 3. It seems there is a single implementation, which is >> >>>> the one >> >>>> > > > > published >> >>>> > > > > > > > together with the paper. It is not obvious that it >> will be >> >>>> > > > > > > > maintained in the future, and reusing it is probably >> not >> >>>> an >> >>>> > > option >> >>>> > > > > for >> >>>> > > > > > > > non-C++ Parquet implementations >> >>>> > > > > > > > >> >>>> > > > > > > > 4. The encoding itself is complex, since it involves a >> >>>> fallback >> >>>> > > on >> >>>> > > > > > > > another encoding if the primary encoding (which >> >>>> constitutes the >> >>>> > > > real >> >>>> > > > > > > > innovation) doesn't work out on a piece of data. >> >>>> > > > > > > > >> >>>> > > > > > > > >> >>>> > > > > > > > Based on this, I would say that if we think ALP is >> >>>> attractive >> >>>> > for >> >>>> > > > us, >> >>>> > > > > > > > we may want to incorporate our own version of ALP with >> the >> >>>> > > > following >> >>>> > > > > > > > changes: >> >>>> > > > > > > > >> >>>> > > > > > > > 1. Design and write our own Parquet-ALP spec so that >> >>>> > > > implementations >> >>>> > > > > > > > know exactly how to encode and represent data >> >>>> > > > > > > > >> >>>> > > > > > > > 2. Do not include the ALPrd fallback which is a >> homegrown >> >>>> > > > dictionary >> >>>> > > > > > > > encoding without dictionary reuse accross pages, and >> >>>> instead >> >>>> > rely >> >>>> > > > on >> >>>> > > > > a >> >>>> > > > > > > > well-known Parquet encoding (such as >> BYTE_STREAM_SPLIT?) >> >>>> > > > > > > > >> >>>> > > > > > > > 3. Replace the FOR encoding inside ALP, which aims at >> >>>> > compressing >> >>>> > > > > > > > integers efficiently, with our own DELTA_BINARY_PACKED >> >>>> (which >> >>>> > has >> >>>> > > > the >> >>>> > > > > > > > same qualities and is already available in Parquet >> >>>> > > implementations) >> >>>> > > > > > > > >> >>>> > > > > > > > Regards >> >>>> > > > > > > > >> >>>> > > > > > > > Antoine. >> >>>> > > > > > > > >> >>>> > > > > > > > >> >>>> > > > > > > > >> >>>> > > > > > > > On Thu, 16 Oct 2025 14:47:33 -0700 >> >>>> > > > > > > > PRATEEK GAUR <[email protected]> wrote: >> >>>> > > > > > > > > Hi team, >> >>>> > > > > > > > > >> >>>> > > > > > > > > We spent some time evaluating ALP compression and >> >>>> > decompression >> >>>> > > > > > > compared >> >>>> > > > > > > > to >> >>>> > > > > > > > > other encoding alternatives like CHIMP/GORILLA and >> >>>> > compression >> >>>> > > > > > > techniques >> >>>> > > > > > > > > like SNAPPY/LZ4/ZSTD. We presented these numbers to >> the >> >>>> > > community >> >>>> > > > > > > members >> >>>> > > > > > > > > on October 15th in the biweekly parquet meeting. ( I >> >>>> can't >> >>>> > seem >> >>>> > > > > to >> >>>> > > > > > > access >> >>>> > > > > > > > > the recording, so please let me know what access >> rules >> >>>> I need >> >>>> > > to >> >>>> > > > > get to >> >>>> > > > > > > > be >> >>>> > > > > > > > > able to view it ) >> >>>> > > > > > > > > >> >>>> > > > > > > > > We did this evaluation over some datasets pointed by >> >>>> the ALP >> >>>> > > > paper >> >>>> > > > > and >> >>>> > > > > > > > some >> >>>> > > > > > > > > pointed by the parquet community. >> >>>> > > > > > > > > >> >>>> > > > > > > > > The results are available in the following document >> >>>> > > > > > > > > < >> >>>> > > > > > > > >> >>>> > > > > > > >> >>>> > > > > >> >>>> > > > >> >>>> > > >> >>>> > >> >>>> >> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0 >> >>>> > > > > >> >>>> > > > > > > > > >> >>>> > > > > > > > > : >> >>>> > > > > > > > > >> >>>> > > > > > > > >> >>>> > > > > > > >> >>>> > > > > >> >>>> > > > >> >>>> > > >> >>>> > >> >>>> >> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg >> >>>> > > > > >> >>>> > > > > > > > > >> >>>> > > > > > > > > Based on the numbers we see >> >>>> > > > > > > > > >> >>>> > > > > > > > > - ALP is comparable to ZSTD(level=1) in terms of >> >>>> > > compression >> >>>> > > > > ratio >> >>>> > > > > > > > and >> >>>> > > > > > > > > much better compared to other schemes. (numbers in >> >>>> the >> >>>> > sheet >> >>>> > > > > are >> >>>> > > > > > > bytes >> >>>> > > > > > > > > needed to encode each value ) >> >>>> > > > > > > > > - ALP going quite well in terms of decompression >> >>>> speed >> >>>> > > > (numbers >> >>>> > > > > in >> >>>> > > > > > > the >> >>>> > > > > > > > > sheet are bytes decompressed per second) >> >>>> > > > > > > > > >> >>>> > > > > > > > > As next steps we will >> >>>> > > > > > > > > >> >>>> > > > > > > > > - Get the numbers for compression on top of byte >> >>>> stream >> >>>> > > split. >> >>>> > > > > > > > > - Evaluate the algorithm over a few more datasets. >> >>>> > > > > > > > > - Have an implementation in the arrow-parquet >> repo. >> >>>> > > > > > > > > >> >>>> > > > > > > > > Looking forward to feedback from the community. >> >>>> > > > > > > > > >> >>>> > > > > > > > > Best >> >>>> > > > > > > > > Prateek and Dhirhan >> >>>> > > > > > > > > >> >>>> > > > > > > > >> >>>> > > > > > > > >> >>>> > > > > > > > >> >>>> > > > > > > > >> >>>> > > > > > > >> >>>> > > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > >> >>>> > > >> >>>> > >> >>>> >> >>> >> >
