Sheet with numbers <https://docs.google.com/spreadsheets/d/1NmCg0WZKeZUc6vNXXD8M3GIyNqF_H3goj6mVbT8at7A/edit?gid=1351944517#gid=1351944517> .
On Thu, Nov 20, 2025 at 5:09 PM PRATEEK GAUR <[email protected]> wrote: > Hi team, > > There was a request from a few folks, Antoine Pitrou and Adam Reeve if I > remember correctly, to perform the experiment on some of the papers that > talked about BYTE_STREAM_SPLIT for completeness. > I wanted to share the numbers for the same in this sheet. At this point we > have numbers on a wide variety of data. > (Will have to share the sheet from my snowflake account as our laptops > have fair bit of restriction with respect to copy paste permissions :) ) > > Best > Prateek > > On Thu, Nov 20, 2025 at 2:25 PM PRATEEK GAUR <[email protected]> wrote: > >> Hi Julien, >> >> Yes based on >> >> - Numbers presented >> - Discussions over the doc and >> - Multiple discussions in the biweekly meeting >> >> We are in a stage where we agree this is the right encoding to add and we >> can move to the DRAFT/POC stage from DISCUSS stage. >> Will start working on the PR for the same. >> >> Thanks for bringing this up. >> Prateek >> >> On Thu, Nov 20, 2025 at 8:16 AM Julien Le Dem <[email protected]> wrote: >> >>> @PRATEEK GAUR <[email protected]> : Would you agree that we are past >>> the DISCUSS step and into the DRAFT/POC phase according to the proposals >>> process <https://github.com/apache/parquet-format/tree/master/proposals >>> >? >>> If yes, could you open a PR on this page to add this proposal to the >>> list? >>> https://github.com/apache/parquet-format/tree/master/proposals >>> Thank you! >>> >>> >>> On Thu, Oct 30, 2025 at 2:38 PM Andrew Lamb <[email protected]> >>> wrote: >>> >>> > I have filed a ticket[1] in arrow-rs to track prototyping ALP in the >>> Rust >>> > Parquet reader if anyone is interested >>> > >>> > Andrew >>> > >>> > [1]: https://github.com/apache/arrow-rs/issues/8748 >>> > >>> > On Wed, Oct 22, 2025 at 1:33 PM Micah Kornfield <[email protected] >>> > >>> > wrote: >>> > >>> > > > >>> > > > C++, Java and Rust support them for sure. I feel like we should >>> > > > probably default to V2 at some point. >>> > > >>> > > >>> > > I seem to recall, some of the vectorized java readers (Iceberg, >>> Spark) >>> > > might not support V2 data pages (but I might be confusing this with >>> > > encodings). But this is only a vague recollection. >>> > > >>> > > >>> > > >>> > > On Wed, Oct 22, 2025 at 6:38 AM Andrew Lamb <[email protected]> >>> > > wrote: >>> > > >>> > > > > Someone has to add V2 data pages to >>> > > > > >>> > > > >>> > > > >>> > > >>> > >>> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md >>> > > > > :) >>> > > > >>> > > > Your wish is my command: >>> > https://github.com/apache/parquet-site/pull/124 >>> > > > >>> > > > As the format grows in popularity and momentum builds to evolve, I >>> feel >>> > > the >>> > > > content on the parquet.apache.org site could use refreshing / >>> > updating. >>> > > > So, while I had the site open, I made some other PRs to scratch >>> various >>> > > > itches >>> > > > >>> > > > (I am absolutely 🎣 for someone to please review 🙏): >>> > > > >>> > > > 1. Add Variant/Geometry/Geography types to implementation status >>> > matrix: >>> > > > https://github.com/apache/parquet-site/pull/123 >>> > > > 2. Improve introduction / overview, add more links to spec and >>> > > > implementation status: >>> https://github.com/apache/parquet-site/pull/125 >>> > > > >>> > > > >>> > > > Thanks, >>> > > > Andrew >>> > > > >>> > > > On Wed, Oct 22, 2025 at 4:09 AM Antoine Pitrou <[email protected] >>> > >>> > > wrote: >>> > > > >>> > > > > >>> > > > > Hi Julien, hi all, >>> > > > > >>> > > > > On Mon, 20 Oct 2025 15:14:58 -0700 >>> > > > > Julien Le Dem <[email protected]> wrote: >>> > > > > > >>> > > > > > Another question from me: >>> > > > > > >>> > > > > > Since the goal is to not use compression at all in this case >>> (no >>> > > ZSTD) >>> > > > > > I'm assuming we would be using either: >>> > > > > > - the Data Page V1 with UNCOMPRESSED in the >>> ColumnMetadata.column >>> > > > > > < >>> > > > > >>> > > > >>> > > >>> > >>> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L887 >>> > > > > > >>> > > > > > field. >>> > > > > > - the Data Page V2 with false in the >>> DataPageHeaderV2.is_compressed >>> > > > > > < >>> > > > > >>> > > > >>> > > >>> > >>> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L746 >>> > > > > > >>> > > > > > field >>> > > > > > The second helping decide if we can selectively compress some >>> pages >>> > > if >>> > > > > they >>> > > > > > are less compressed by the >>> > > > > > A few years ago there was a question on the support of the >>> > > DATA_PAGE_V2 >>> > > > > and >>> > > > > > I was curious to hear a refresh on how that's generally >>> supported >>> > in >>> > > > > > Parquet implementations. The is_compressed field was exactly >>> > intended >>> > > > to >>> > > > > > avoid block compression when the encoding itself is good >>> enough. >>> > > > > >>> > > > > Someone has to add V2 data pages to >>> > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md >>> > > > > :) >>> > > > > >>> > > > > C++, Java and Rust support them for sure. I feel like we should >>> > > > > probably default to V2 at some point. >>> > > > > >>> > > > > Also see https://github.com/apache/parquet-java/issues/3344 for >>> > Java. >>> > > > > >>> > > > > Regards >>> > > > > >>> > > > > Antoine. >>> > > > > >>> > > > > >>> > > > > > >>> > > > > > Julien >>> > > > > > >>> > > > > > On Mon, Oct 20, 2025 at 11:57 AM Andrew Lamb >>> > > > > <[email protected]> wrote: >>> > > > > > >>> > > > > > > Thanks again Prateek and co for pushing this along! >>> > > > > > > >>> > > > > > > >>> > > > > > > > 1. Design and write our own Parquet-ALP spec so that >>> > > > implementations >>> > > > > > > > know exactly how to encode and represent data >>> > > > > > > >>> > > > > > > 100% agree with this (similar to what was done for >>> > ParquetVariant) >>> > > > > > > >>> > > > > > > > 2. I may be missing something, but the paper doesn't seem >>> to >>> > > > > mention >>> > > > > > > non-finite values (such as +/-Inf and NaNs). >>> > > > > > > >>> > > > > > > I think they are handled via the "Exception" mechanism. >>> Vortex's >>> > > ALP >>> > > > > > > implementation (below) does appear to handle finite >>> numbers[2] >>> > > > > > > >>> > > > > > > > 3. It seems there is a single implementation, which is the >>> one >>> > > > > published >>> > > > > > > > together with the paper. It is not obvious that it will be >>> > > > > > > > maintained in the future, and reusing it is probably not an >>> > > option >>> > > > > for >>> > > > > > > > non-C++ Parquet implementations >>> > > > > > > >>> > > > > > > My understanding from the call was that Prateek and team >>> > > > re-implemented >>> > > > > > > ALP (did not use the implementation from CWI[3]) but that >>> would >>> > be >>> > > > > good to >>> > > > > > > confirm. >>> > > > > > > >>> > > > > > > There is also a Rust implementation of ALP[1] that is part >>> of the >>> > > > > Vortex >>> > > > > > > file format implementation. I have not reviewed it to see if >>> it >>> > > > > deviates >>> > > > > > > from the algorithm presented in the paper. >>> > > > > > > >>> > > > > > > Andrew >>> > > > > > > >>> > > > > > > [1]: >>> > > > > > > >>> > > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs >>> > > > > > > [2]: >>> > > > > > > >>> > > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/compress.rs#L266-L281 >>> > > > > > > [3]: https://github.com/cwida/ALP >>> > > > > > > >>> > > > > > > >>> > > > > > > On Mon, Oct 20, 2025 at 4:47 AM Antoine Pitrou >>> > > > > <[email protected]> wrote: >>> > > > > > > >>> > > > > > > > >>> > > > > > > > Hello, >>> > > > > > > > >>> > > > > > > > Thanks for doing this and I agree the numbers look >>> impressive. >>> > > > > > > > >>> > > > > > > > I would ask if possible for more data points: >>> > > > > > > > >>> > > > > > > > 1. More datasets: you could for example look at the >>> datasets >>> > that >>> > > > > were >>> > > > > > > > used to originally evalute BYTE_STREAM_SPLIT (see >>> > > > > > > > https://issues.apache.org/jira/browse/PARQUET-1622 and >>> > > > specifically >>> > > > > > > > the Google Doc linked there) >>> > > > > > > > >>> > > > > > > > 2. Comparison to BYTE_STREAM_SPLIT + LZ4 and >>> BYTE_STREAM_SPLIT >>> > + >>> > > > ZSTD >>> > > > > > > > >>> > > > > > > > 3. Optionally, some perf numbers on x86 too, but I expect >>> that >>> > > ALP >>> > > > > will >>> > > > > > > > remain very good there as well >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > I also have the following reservations towards ALP: >>> > > > > > > > >>> > > > > > > > 1. There is no published official spec AFAICT, just a >>> research >>> > > > paper. >>> > > > > > > > >>> > > > > > > > 2. I may be missing something, but the paper doesn't seem >>> to >>> > > > mention >>> > > > > > > > non-finite values (such as +/-Inf and NaNs). >>> > > > > > > > >>> > > > > > > > 3. It seems there is a single implementation, which is the >>> one >>> > > > > published >>> > > > > > > > together with the paper. It is not obvious that it will be >>> > > > > > > > maintained in the future, and reusing it is probably not an >>> > > option >>> > > > > for >>> > > > > > > > non-C++ Parquet implementations >>> > > > > > > > >>> > > > > > > > 4. The encoding itself is complex, since it involves a >>> fallback >>> > > on >>> > > > > > > > another encoding if the primary encoding (which >>> constitutes the >>> > > > real >>> > > > > > > > innovation) doesn't work out on a piece of data. >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > Based on this, I would say that if we think ALP is >>> attractive >>> > for >>> > > > us, >>> > > > > > > > we may want to incorporate our own version of ALP with the >>> > > > following >>> > > > > > > > changes: >>> > > > > > > > >>> > > > > > > > 1. Design and write our own Parquet-ALP spec so that >>> > > > implementations >>> > > > > > > > know exactly how to encode and represent data >>> > > > > > > > >>> > > > > > > > 2. Do not include the ALPrd fallback which is a homegrown >>> > > > dictionary >>> > > > > > > > encoding without dictionary reuse accross pages, and >>> instead >>> > rely >>> > > > on >>> > > > > a >>> > > > > > > > well-known Parquet encoding (such as BYTE_STREAM_SPLIT?) >>> > > > > > > > >>> > > > > > > > 3. Replace the FOR encoding inside ALP, which aims at >>> > compressing >>> > > > > > > > integers efficiently, with our own DELTA_BINARY_PACKED >>> (which >>> > has >>> > > > the >>> > > > > > > > same qualities and is already available in Parquet >>> > > implementations) >>> > > > > > > > >>> > > > > > > > Regards >>> > > > > > > > >>> > > > > > > > Antoine. >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > On Thu, 16 Oct 2025 14:47:33 -0700 >>> > > > > > > > PRATEEK GAUR <[email protected]> wrote: >>> > > > > > > > > Hi team, >>> > > > > > > > > >>> > > > > > > > > We spent some time evaluating ALP compression and >>> > decompression >>> > > > > > > compared >>> > > > > > > > to >>> > > > > > > > > other encoding alternatives like CHIMP/GORILLA and >>> > compression >>> > > > > > > techniques >>> > > > > > > > > like SNAPPY/LZ4/ZSTD. We presented these numbers to the >>> > > community >>> > > > > > > members >>> > > > > > > > > on October 15th in the biweekly parquet meeting. ( I >>> can't >>> > seem >>> > > > > to >>> > > > > > > access >>> > > > > > > > > the recording, so please let me know what access rules I >>> need >>> > > to >>> > > > > get to >>> > > > > > > > be >>> > > > > > > > > able to view it ) >>> > > > > > > > > >>> > > > > > > > > We did this evaluation over some datasets pointed by the >>> ALP >>> > > > paper >>> > > > > and >>> > > > > > > > some >>> > > > > > > > > pointed by the parquet community. >>> > > > > > > > > >>> > > > > > > > > The results are available in the following document >>> > > > > > > > > < >>> > > > > > > > >>> > > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0 >>> > > > > >>> > > > > > > > > >>> > > > > > > > > : >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg >>> > > > > >>> > > > > > > > > >>> > > > > > > > > Based on the numbers we see >>> > > > > > > > > >>> > > > > > > > > - ALP is comparable to ZSTD(level=1) in terms of >>> > > compression >>> > > > > ratio >>> > > > > > > > and >>> > > > > > > > > much better compared to other schemes. (numbers in the >>> > sheet >>> > > > > are >>> > > > > > > bytes >>> > > > > > > > > needed to encode each value ) >>> > > > > > > > > - ALP going quite well in terms of decompression speed >>> > > > (numbers >>> > > > > in >>> > > > > > > the >>> > > > > > > > > sheet are bytes decompressed per second) >>> > > > > > > > > >>> > > > > > > > > As next steps we will >>> > > > > > > > > >>> > > > > > > > > - Get the numbers for compression on top of byte >>> stream >>> > > split. >>> > > > > > > > > - Evaluate the algorithm over a few more datasets. >>> > > > > > > > > - Have an implementation in the arrow-parquet repo. >>> > > > > > > > > >>> > > > > > > > > Looking forward to feedback from the community. >>> > > > > > > > > >>> > > > > > > > > Best >>> > > > > > > > > Prateek and Dhirhan >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > >>> > > >>> > >>> >>
