[forking into a new thread] Should we add columns for BigQuery, Databricks, Snowflake, Dremio, ...? I feel that this page is even more important for proprietary engines that we can not look at the implementation to check. (But they are important member of the ecosystem)
On Wed, Oct 22, 2025 at 6:38 AM Andrew Lamb <[email protected]> wrote: > > Someone has to add V2 data pages to > > > > https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md > > :) > > Your wish is my command: https://github.com/apache/parquet-site/pull/124 > > As the format grows in popularity and momentum builds to evolve, I feel the > content on the parquet.apache.org site could use refreshing / updating. > So, while I had the site open, I made some other PRs to scratch various > itches > > (I am absolutely 🎣 for someone to please review 🙏): > > 1. Add Variant/Geometry/Geography types to implementation status matrix: > https://github.com/apache/parquet-site/pull/123 > 2. Improve introduction / overview, add more links to spec and > implementation status: https://github.com/apache/parquet-site/pull/125 > > > Thanks, > Andrew > > On Wed, Oct 22, 2025 at 4:09 AM Antoine Pitrou <[email protected]> wrote: > > > > > Hi Julien, hi all, > > > > On Mon, 20 Oct 2025 15:14:58 -0700 > > Julien Le Dem <[email protected]> wrote: > > > > > > Another question from me: > > > > > > Since the goal is to not use compression at all in this case (no ZSTD) > > > I'm assuming we would be using either: > > > - the Data Page V1 with UNCOMPRESSED in the ColumnMetadata.column > > > < > > > https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L887 > > > > > > field. > > > - the Data Page V2 with false in the DataPageHeaderV2.is_compressed > > > < > > > https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L746 > > > > > > field > > > The second helping decide if we can selectively compress some pages if > > they > > > are less compressed by the > > > A few years ago there was a question on the support of the DATA_PAGE_V2 > > and > > > I was curious to hear a refresh on how that's generally supported in > > > Parquet implementations. The is_compressed field was exactly intended > to > > > avoid block compression when the encoding itself is good enough. > > > > Someone has to add V2 data pages to > > > > > https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md > > :) > > > > C++, Java and Rust support them for sure. I feel like we should > > probably default to V2 at some point. > > > > Also see https://github.com/apache/parquet-java/issues/3344 for Java. > > > > Regards > > > > Antoine. > > > > > > > > > > Julien > > > > > > On Mon, Oct 20, 2025 at 11:57 AM Andrew Lamb > > <[email protected]> wrote: > > > > > > > Thanks again Prateek and co for pushing this along! > > > > > > > > > > > > > 1. Design and write our own Parquet-ALP spec so that > implementations > > > > > know exactly how to encode and represent data > > > > > > > > 100% agree with this (similar to what was done for ParquetVariant) > > > > > > > > > 2. I may be missing something, but the paper doesn't seem to > > mention > > > > non-finite values (such as +/-Inf and NaNs). > > > > > > > > I think they are handled via the "Exception" mechanism. Vortex's ALP > > > > implementation (below) does appear to handle finite numbers[2] > > > > > > > > > 3. It seems there is a single implementation, which is the one > > published > > > > > together with the paper. It is not obvious that it will be > > > > > maintained in the future, and reusing it is probably not an option > > for > > > > > non-C++ Parquet implementations > > > > > > > > My understanding from the call was that Prateek and team > re-implemented > > > > ALP (did not use the implementation from CWI[3]) but that would be > > good to > > > > confirm. > > > > > > > > There is also a Rust implementation of ALP[1] that is part of the > > Vortex > > > > file format implementation. I have not reviewed it to see if it > > deviates > > > > from the algorithm presented in the paper. > > > > > > > > Andrew > > > > > > > > [1]: > > > > > > > > > > > https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs > > > > [2]: > > > > > > > > > > > https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/compress.rs#L266-L281 > > > > [3]: https://github.com/cwida/ALP > > > > > > > > > > > > On Mon, Oct 20, 2025 at 4:47 AM Antoine Pitrou > > <[email protected]> wrote: > > > > > > > > > > > > > > Hello, > > > > > > > > > > Thanks for doing this and I agree the numbers look impressive. > > > > > > > > > > I would ask if possible for more data points: > > > > > > > > > > 1. More datasets: you could for example look at the datasets that > > were > > > > > used to originally evalute BYTE_STREAM_SPLIT (see > > > > > https://issues.apache.org/jira/browse/PARQUET-1622 and > specifically > > > > > the Google Doc linked there) > > > > > > > > > > 2. Comparison to BYTE_STREAM_SPLIT + LZ4 and BYTE_STREAM_SPLIT + > ZSTD > > > > > > > > > > 3. Optionally, some perf numbers on x86 too, but I expect that ALP > > will > > > > > remain very good there as well > > > > > > > > > > > > > > > I also have the following reservations towards ALP: > > > > > > > > > > 1. There is no published official spec AFAICT, just a research > paper. > > > > > > > > > > 2. I may be missing something, but the paper doesn't seem to > mention > > > > > non-finite values (such as +/-Inf and NaNs). > > > > > > > > > > 3. It seems there is a single implementation, which is the one > > published > > > > > together with the paper. It is not obvious that it will be > > > > > maintained in the future, and reusing it is probably not an option > > for > > > > > non-C++ Parquet implementations > > > > > > > > > > 4. The encoding itself is complex, since it involves a fallback on > > > > > another encoding if the primary encoding (which constitutes the > real > > > > > innovation) doesn't work out on a piece of data. > > > > > > > > > > > > > > > Based on this, I would say that if we think ALP is attractive for > us, > > > > > we may want to incorporate our own version of ALP with the > following > > > > > changes: > > > > > > > > > > 1. Design and write our own Parquet-ALP spec so that > implementations > > > > > know exactly how to encode and represent data > > > > > > > > > > 2. Do not include the ALPrd fallback which is a homegrown > dictionary > > > > > encoding without dictionary reuse accross pages, and instead rely > on > > a > > > > > well-known Parquet encoding (such as BYTE_STREAM_SPLIT?) > > > > > > > > > > 3. Replace the FOR encoding inside ALP, which aims at compressing > > > > > integers efficiently, with our own DELTA_BINARY_PACKED (which has > the > > > > > same qualities and is already available in Parquet implementations) > > > > > > > > > > Regards > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > On Thu, 16 Oct 2025 14:47:33 -0700 > > > > > PRATEEK GAUR <[email protected]> wrote: > > > > > > Hi team, > > > > > > > > > > > > We spent some time evaluating ALP compression and decompression > > > > compared > > > > > to > > > > > > other encoding alternatives like CHIMP/GORILLA and compression > > > > techniques > > > > > > like SNAPPY/LZ4/ZSTD. We presented these numbers to the community > > > > members > > > > > > on October 15th in the biweekly parquet meeting. ( I can't seem > > to > > > > access > > > > > > the recording, so please let me know what access rules I need to > > get to > > > > > be > > > > > > able to view it ) > > > > > > > > > > > > We did this evaluation over some datasets pointed by the ALP > paper > > and > > > > > some > > > > > > pointed by the parquet community. > > > > > > > > > > > > The results are available in the following document > > > > > > < > > > > > > > > > > > > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0 > > > > > > > > > > > > > > : > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg > > > > > > > > > > > > > > Based on the numbers we see > > > > > > > > > > > > - ALP is comparable to ZSTD(level=1) in terms of compression > > ratio > > > > > and > > > > > > much better compared to other schemes. (numbers in the sheet > > are > > > > bytes > > > > > > needed to encode each value ) > > > > > > - ALP going quite well in terms of decompression speed > (numbers > > in > > > > the > > > > > > sheet are bytes decompressed per second) > > > > > > > > > > > > As next steps we will > > > > > > > > > > > > - Get the numbers for compression on top of byte stream split. > > > > > > - Evaluate the algorithm over a few more datasets. > > > > > > - Have an implementation in the arrow-parquet repo. > > > > > > > > > > > > Looking forward to feedback from the community. > > > > > > > > > > > > Best > > > > > > Prateek and Dhirhan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
