We'll change the format from Markdown to flatbuffer. (this joke has a very small TAM)
On Fri, Oct 24, 2025 at 5:17 AM Antoine Pitrou <[email protected]> wrote: > > Ok, but let's keep in mind that parsing the Thrift footer for all those > columns will become expensive. > > Regards > > Antoine. > > > On Thu, 23 Oct 2025 11:13:53 -0700 > Julien Le Dem <[email protected]> wrote: > > Do relevant people on that list who work at said vendors feel like adding > > their respective columns? > > BigQuery, Databricks, Dremio, Snowflake, ... ? > > > https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md > > > > On Wed, Oct 22, 2025 at 10:05 PM Arnav Balyan <[email protected]> > > wrote: > > > > > +1, I can try to help crowdsource the list. Maybe we could use the > public > > > Slack channel (seems to have 100+ people) > > > > > > On Thu, Oct 23, 2025 at 5:57 AM Andrew Lamb <[email protected]> > > > wrote: > > > > > > > I think it is a great idea -- I can certainly add the columns, but > as you > > > > say only people from those companies would be able to fill them out. > > > > > > > > Maybe if we added some columns that would add some (positive) > pressure to > > > > provide the information > > > > > > > > Andrew > > > > > > > > On Wed, Oct 22, 2025 at 2:50 PM Julien Le Dem > <[email protected]> wrote: > > > > > > > > > [forking into a new thread] > > > > > Should we add columns for BigQuery, Databricks, Snowflake, Dremio, > ...? > > > > > I feel that this page is even more important for proprietary > engines > > > that > > > > > we can not look at the implementation to check. (But they are > important > > > > > member of the ecosystem) > > > > > > > > > > > > > > > On Wed, Oct 22, 2025 at 6:38 AM Andrew Lamb < > [email protected]> > > > > > wrote: > > > > > > > > > > > > Someone has to add V2 data pages to > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md > > > > > > > > :) > > > > > > > > > > > > Your wish is my command: > > > > https://github.com/apache/parquet-site/pull/124 > > > > > > > > > > > > As the format grows in popularity and momentum builds to evolve, > I > > > feel > > > > > the > > > > > > content on the parquet.apache.org site could use refreshing / > > > > updating. > > > > > > So, while I had the site open, I made some other PRs to scratch > > > various > > > > > > itches > > > > > > > > > > > > (I am absolutely 🎣 for someone to please review 🙏): > > > > > > > > > > > > 1. Add Variant/Geometry/Geography types to implementation > status > > > > matrix: > > > > > > https://github.com/apache/parquet-site/pull/123 > > > > > > 2. Improve introduction / overview, add more links to spec and > > > > > > implementation status: > > > https://github.com/apache/parquet-site/pull/125 > > > > > > > > > > > > > > > > > > Thanks, > > > > > > Andrew > > > > > > > > > > > > On Wed, Oct 22, 2025 at 4:09 AM Antoine Pitrou < > [email protected]> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > Hi Julien, hi all, > > > > > > > > > > > > > > On Mon, 20 Oct 2025 15:14:58 -0700 > > > > > > > Julien Le Dem <[email protected]> wrote: > > > > > > > > > > > > > > > > Another question from me: > > > > > > > > > > > > > > > > Since the goal is to not use compression at all in this case > (no > > > > > ZSTD) > > > > > > > > I'm assuming we would be using either: > > > > > > > > - the Data Page V1 with UNCOMPRESSED in the > ColumnMetadata.column > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L887 > > > > > > > > > > > > > > > > > field. > > > > > > > > - the Data Page V2 with false in the > > > DataPageHeaderV2.is_compressed > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L746 > > > > > > > > > > > > > > > > > field > > > > > > > > The second helping decide if we can selectively compress > some > > > pages > > > > > if > > > > > > > they > > > > > > > > are less compressed by the > > > > > > > > A few years ago there was a question on the support of the > > > > > DATA_PAGE_V2 > > > > > > > and > > > > > > > > I was curious to hear a refresh on how that's generally > supported > > > > in > > > > > > > > Parquet implementations. The is_compressed field was > exactly > > > > intended > > > > > > to > > > > > > > > avoid block compression when the encoding itself is good > enough. > > > > > > > > > > > > > > Someone has to add V2 data pages to > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md > > > > > > > > :) > > > > > > > > > > > > > > C++, Java and Rust support them for sure. I feel like we should > > > > > > > probably default to V2 at some point. > > > > > > > > > > > > > > Also see https://github.com/apache/parquet-java/issues/3344 > for > > > > Java. > > > > > > > > > > > > > > Regards > > > > > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Julien > > > > > > > > > > > > > > > > On Mon, Oct 20, 2025 at 11:57 AM Andrew Lamb > > > > > > > <andrewlamb11-Re5JQEeQqe8-XMD5yJDbdMReXY1tMh2IBgC/ > [email protected]> wrote: > > > > > > > > > > > > > > > > > Thanks again Prateek and co for pushing this along! > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. Design and write our own Parquet-ALP spec so that > > > > > > implementations > > > > > > > > > > know exactly how to encode and represent data > > > > > > > > > > > > > > > > > > 100% agree with this (similar to what was done for > > > > ParquetVariant) > > > > > > > > > > > > > > > > > > > 2. I may be missing something, but the paper doesn't > seem to > > > > > > > mention > > > > > > > > > non-finite values (such as +/-Inf and NaNs). > > > > > > > > > > > > > > > > > > I think they are handled via the "Exception" mechanism. > > > Vortex's > > > > > ALP > > > > > > > > > implementation (below) does appear to handle finite > numbers[2] > > > > > > > > > > > > > > > > > > > 3. It seems there is a single implementation, which is > the > > > one > > > > > > > published > > > > > > > > > > together with the paper. It is not obvious that it will > be > > > > > > > > > > maintained in the future, and reusing it is probably not > an > > > > > option > > > > > > > for > > > > > > > > > > non-C++ Parquet implementations > > > > > > > > > > > > > > > > > > My understanding from the call was that Prateek and team > > > > > > re-implemented > > > > > > > > > ALP (did not use the implementation from CWI[3]) but > that > > > would > > > > be > > > > > > > good to > > > > > > > > > confirm. > > > > > > > > > > > > > > > > > > There is also a Rust implementation of ALP[1] that is part > of > > > the > > > > > > > Vortex > > > > > > > > > file format implementation. I have not reviewed it to see > if it > > > > > > > deviates > > > > > > > > > from the algorithm presented in the paper. > > > > > > > > > > > > > > > > > > Andrew > > > > > > > > > > > > > > > > > > [1]: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs > > > > > > > > > > [2]: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/compress.rs#L266-L281 > > > > > > > > > > [3]: https://github.com/cwida/ALP > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Oct 20, 2025 at 4:47 AM Antoine Pitrou > > > > > > > <antoine-+ZN9ApsXKcFQFI55V6+gNQ-XMD5yJDbdMReXY1tMh2IBti2O/ > [email protected]> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > > > Thanks for doing this and I agree the numbers look > > > impressive. > > > > > > > > > > > > > > > > > > > > I would ask if possible for more data points: > > > > > > > > > > > > > > > > > > > > 1. More datasets: you could for example look at the > datasets > > > > that > > > > > > > were > > > > > > > > > > used to originally evalute BYTE_STREAM_SPLIT (see > > > > > > > > > > https://issues.apache.org/jira/browse/PARQUET-1622 and > > > > > > specifically > > > > > > > > > > the Google Doc linked there) > > > > > > > > > > > > > > > > > > > > 2. Comparison to BYTE_STREAM_SPLIT + LZ4 and > > > BYTE_STREAM_SPLIT > > > > + > > > > > > ZSTD > > > > > > > > > > > > > > > > > > > > 3. Optionally, some perf numbers on x86 too, but I > expect > > > that > > > > > ALP > > > > > > > will > > > > > > > > > > remain very good there as well > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I also have the following reservations towards ALP: > > > > > > > > > > > > > > > > > > > > 1. There is no published official spec AFAICT, just a > > > research > > > > > > paper. > > > > > > > > > > > > > > > > > > > > 2. I may be missing something, but the paper doesn't > seem to > > > > > > mention > > > > > > > > > > non-finite values (such as +/-Inf and NaNs). > > > > > > > > > > > > > > > > > > > > 3. It seems there is a single implementation, which is > the > > > one > > > > > > > published > > > > > > > > > > together with the paper. It is not obvious that it will > be > > > > > > > > > > maintained in the future, and reusing it is probably not > an > > > > > option > > > > > > > for > > > > > > > > > > non-C++ Parquet implementations > > > > > > > > > > > > > > > > > > > > 4. The encoding itself is complex, since it involves a > > > fallback > > > > > on > > > > > > > > > > another encoding if the primary encoding (which > constitutes > > > the > > > > > > real > > > > > > > > > > innovation) doesn't work out on a piece of data. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Based on this, I would say that if we think ALP is > attractive > > > > for > > > > > > us, > > > > > > > > > > we may want to incorporate our own version of ALP with > the > > > > > > following > > > > > > > > > > changes: > > > > > > > > > > > > > > > > > > > > 1. Design and write our own Parquet-ALP spec so that > > > > > > implementations > > > > > > > > > > know exactly how to encode and represent data > > > > > > > > > > > > > > > > > > > > 2. Do not include the ALPrd fallback which is a > homegrown > > > > > > dictionary > > > > > > > > > > encoding without dictionary reuse accross pages, and > instead > > > > rely > > > > > > on > > > > > > > a > > > > > > > > > > well-known Parquet encoding (such as BYTE_STREAM_SPLIT?) > > > > > > > > > > > > > > > > > > > > 3. Replace the FOR encoding inside ALP, which aims at > > > > compressing > > > > > > > > > > integers efficiently, with our own DELTA_BINARY_PACKED > (which > > > > has > > > > > > the > > > > > > > > > > same qualities and is already available in Parquet > > > > > implementations) > > > > > > > > > > > > > > > > > > > > Regards > > > > > > > > > > > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, 16 Oct 2025 14:47:33 -0700 > > > > > > > > > > PRATEEK GAUR < > [email protected]> wrote: > > > > > > > > > > > Hi team, > > > > > > > > > > > > > > > > > > > > > > We spent some time evaluating ALP compression and > > > > decompression > > > > > > > > > compared > > > > > > > > > > to > > > > > > > > > > > other encoding alternatives like CHIMP/GORILLA and > > > > compression > > > > > > > > > techniques > > > > > > > > > > > like SNAPPY/LZ4/ZSTD. We presented these numbers to > the > > > > > community > > > > > > > > > members > > > > > > > > > > > on October 15th in the biweekly parquet meeting. ( I > can't > > > > seem > > > > > > > to > > > > > > > > > access > > > > > > > > > > > the recording, so please let me know what access rules > I > > > need > > > > > to > > > > > > > get to > > > > > > > > > > be > > > > > > > > > > > able to view it ) > > > > > > > > > > > > > > > > > > > > > > We did this evaluation over some datasets pointed by > the > > > ALP > > > > > > paper > > > > > > > and > > > > > > > > > > some > > > > > > > > > > > pointed by the parquet community. > > > > > > > > > > > > > > > > > > > > > > The results are available in the following document > > > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > : > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Based on the numbers we see > > > > > > > > > > > > > > > > > > > > > > - ALP is comparable to ZSTD(level=1) in terms of > > > > > compression > > > > > > > ratio > > > > > > > > > > and > > > > > > > > > > > much better compared to other schemes. (numbers in > the > > > > sheet > > > > > > > are > > > > > > > > > bytes > > > > > > > > > > > needed to encode each value ) > > > > > > > > > > > - ALP going quite well in terms of decompression > speed > > > > > > (numbers > > > > > > > in > > > > > > > > > the > > > > > > > > > > > sheet are bytes decompressed per second) > > > > > > > > > > > > > > > > > > > > > > As next steps we will > > > > > > > > > > > > > > > > > > > > > > - Get the numbers for compression on top of byte > stream > > > > > split. > > > > > > > > > > > - Evaluate the algorithm over a few more datasets. > > > > > > > > > > > - Have an implementation in the arrow-parquet repo. > > > > > > > > > > > > > > > > > > > > > > Looking forward to feedback from the community. > > > > > > > > > > > > > > > > > > > > > > Best > > > > > > > > > > > Prateek and Dhirhan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
