Re: Implementation status

Antoine Pitrou Fri, 24 Oct 2025 05:19:36 -0700


Ok, but let's keep in mind that parsing the Thrift footer for all those
columns will become expensive.


Regards

Antoine.


On Thu, 23 Oct 2025 11:13:53 -0700
Julien Le Dem <[email protected]> wrote:
> Do relevant people on that list who work at said vendors feel like adding
> their respective columns?
> BigQuery, Databricks, Dremio, Snowflake, ... ?
> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
> 
> On Wed, Oct 22, 2025 at 10:05 PM Arnav Balyan <[email protected]>
> wrote:
> 
> > +1, I can try to help crowdsource the list. Maybe we could use the public
> > Slack channel (seems to have 100+ people)
> >
> > On Thu, Oct 23, 2025 at 5:57 AM Andrew Lamb <[email protected]>
> > wrote:
> >  
> > > I think it is a great idea -- I can certainly add the columns, but as you
> > > say only people from those companies would be able to fill them out.
> > >
> > > Maybe if we added some columns that would add some (positive) pressure to
> > > provide the information
> > >
> > > Andrew
> > >
> > > On Wed, Oct 22, 2025 at 2:50 PM Julien Le Dem 
> > > <[email protected]> wrote:
> > >  
> > > > [forking into a new thread]
> > > > Should we add columns for BigQuery, Databricks, Snowflake, Dremio, ...?
> > > > I feel that this page is even more important for proprietary engines  
> > that  
> > > > we can not look at the implementation to check. (But they are important
> > > > member of the ecosystem)
> > > >
> > > >
> > > > On Wed, Oct 22, 2025 at 6:38 AM Andrew Lamb <[email protected]>
> > > > wrote:
> > > >  
> > > > > > Someone has to add V2 data pages to
> > > > > >  
> > > > >
> > > > >  
> > > >  
> > >  
> > https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
> >   
> > > > > > :)  
> > > > >
> > > > > Your wish is my command:  
> > > https://github.com/apache/parquet-site/pull/124  
> > > > >
> > > > > As the format grows in popularity and momentum builds to evolve, I  
> > feel  
> > > > the  
> > > > > content on the parquet.apache.org site could use refreshing /  
> > > updating.  
> > > > > So, while I had the site open, I made some other PRs to scratch  
> > various  
> > > > > itches
> > > > >
> > > > > (I am absolutely 🎣 for someone to please review 🙏):
> > > > >
> > > > > 1. Add Variant/Geometry/Geography types to implementation status  
> > > matrix:  
> > > > > https://github.com/apache/parquet-site/pull/123
> > > > > 2. Improve introduction / overview, add more links to spec and
> > > > > implementation status:  
> > https://github.com/apache/parquet-site/pull/125  
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Andrew
> > > > >
> > > > > On Wed, Oct 22, 2025 at 4:09 AM Antoine Pitrou <[email protected]>  
> > > > wrote:  
> > > > >  
> > > > > >
> > > > > > Hi Julien, hi all,
> > > > > >
> > > > > > On Mon, 20 Oct 2025 15:14:58 -0700
> > > > > > Julien Le Dem <[email protected]> wrote:  
> > > > > > >
> > > > > > > Another question from me:
> > > > > > >
> > > > > > > Since the goal is to not use compression at all in this case (no  
> > > > ZSTD)  
> > > > > > > I'm assuming we would be using either:
> > > > > > > - the Data Page V1 with UNCOMPRESSED in the ColumnMetadata.column
> > > > > > > <  
> > > > > >  
> > > > >  
> > > >  
> > >  
> > https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L887
> >   
> > > > > > >
> > > > > > > field.
> > > > > > > - the Data Page V2 with false in the  
> > DataPageHeaderV2.is_compressed  
> > > > > > > <  
> > > > > >  
> > > > >  
> > > >  
> > >  
> > https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L746
> >   
> > > > > > >
> > > > > > > field
> > > > > > > The second helping decide if we can selectively compress some  
> > pages  
> > > > if  
> > > > > > they  
> > > > > > > are less compressed by the
> > > > > > > A few years ago there was a question on the support of the  
> > > > DATA_PAGE_V2  
> > > > > > and  
> > > > > > > I was curious to hear a refresh on how that's generally supported 
> > > > > > >  
> > > in  
> > > > > > > Parquet implementations. The is_compressed field was exactly  
> > > intended  
> > > > > to  
> > > > > > > avoid block compression when the encoding itself is good enough.  
> > > > > >
> > > > > > Someone has to add V2 data pages to
> > > > > >
> > > > > >  
> > > > >  
> > > >  
> > >  
> > https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
> >   
> > > > > > :)
> > > > > >
> > > > > > C++, Java and Rust support them for sure. I feel like we should
> > > > > > probably default to V2 at some point.
> > > > > >
> > > > > > Also see https://github.com/apache/parquet-java/issues/3344 for  
> > > Java.  
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > > > >
> > > > > >  
> > > > > > >
> > > > > > > Julien
> > > > > > >
> > > > > > > On Mon, Oct 20, 2025 at 11:57 AM Andrew Lamb  
> > > > > > <andrewlamb11-Re5JQEeQqe8-XMD5yJDbdMReXY1tMh2IBgC/[email protected]>
> > > > > >  wrote:  
> > > > > > >  
> > > > > > > > Thanks again Prateek and co for pushing this along!
> > > > > > > >
> > > > > > > >  
> > > > > > > > > 1. Design and write our own Parquet-ALP spec so that  
> > > > > implementations  
> > > > > > > > > know exactly how to encode and represent data  
> > > > > > > >
> > > > > > > > 100% agree with this (similar to what was done for  
> > > ParquetVariant)  
> > > > > > > >  
> > > > > > > > > 2. I may be missing something, but the paper doesn't seem to  
> > > > > > mention  
> > > > > > > > non-finite values (such as +/-Inf and NaNs).
> > > > > > > >
> > > > > > > > I think they are handled via the "Exception" mechanism.  
> > Vortex's  
> > > > ALP  
> > > > > > > > implementation (below) does appear to handle finite numbers[2]
> > > > > > > >  
> > > > > > > > > 3. It seems there is a single implementation, which is the  
> > one  
> > > > > > published  
> > > > > > > > > together with the paper. It is not obvious that it will be
> > > > > > > > > maintained in the future, and reusing it is probably not an  
> > > > option  
> > > > > > for  
> > > > > > > > > non-C++ Parquet implementations  
> > > > > > > >
> > > > > > > > My understanding from the call was that Prateek and team  
> > > > > re-implemented  
> > > > > > > > ALP  (did not use the implementation from CWI[3]) but that  
> > would  
> > > be  
> > > > > > good to  
> > > > > > > > confirm.
> > > > > > > >
> > > > > > > > There is also a Rust implementation of ALP[1] that is part of  
> > the  
> > > > > > Vortex  
> > > > > > > > file format implementation. I have not reviewed it to see if it 
> > > > > > > >  
> > > > > > deviates  
> > > > > > > > from the algorithm presented in the paper.
> > > > > > > >
> > > > > > > > Andrew
> > > > > > > >
> > > > > > > > [1]:
> > > > > > > >
> > > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > >  
> > https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs
> >   
> > > > > > > > [2]:
> > > > > > > >
> > > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > >  
> > https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/compress.rs#L266-L281
> >   
> > > > > > > > [3]: https://github.com/cwida/ALP
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Oct 20, 2025 at 4:47 AM Antoine Pitrou  
> > > > > > <antoine-+ZN9ApsXKcFQFI55V6+gNQ-XMD5yJDbdMReXY1tMh2IBti2O/[email protected]>
> > > > > >  wrote:  
> > > > > > > >  
> > > > > > > > >
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > Thanks for doing this and I agree the numbers look  
> > impressive.  
> > > > > > > > >
> > > > > > > > > I would ask if possible for more data points:
> > > > > > > > >
> > > > > > > > > 1. More datasets: you could for example look at the datasets  
> > > that  
> > > > > > were  
> > > > > > > > > used to originally evalute BYTE_STREAM_SPLIT (see
> > > > > > > > > https://issues.apache.org/jira/browse/PARQUET-1622 and  
> > > > > specifically  
> > > > > > > > > the Google Doc linked there)
> > > > > > > > >
> > > > > > > > > 2. Comparison to BYTE_STREAM_SPLIT + LZ4 and  
> > BYTE_STREAM_SPLIT  
> > > +  
> > > > > ZSTD  
> > > > > > > > >
> > > > > > > > > 3. Optionally, some perf numbers on x86 too, but I expect  
> > that  
> > > > ALP  
> > > > > > will  
> > > > > > > > > remain very good there as well
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I also have the following reservations towards ALP:
> > > > > > > > >
> > > > > > > > > 1. There is no published official spec AFAICT, just a  
> > research  
> > > > > paper.  
> > > > > > > > >
> > > > > > > > > 2. I may be missing something, but the paper doesn't seem to  
> > > > > mention  
> > > > > > > > > non-finite values (such as +/-Inf and NaNs).
> > > > > > > > >
> > > > > > > > > 3. It seems there is a single implementation, which is the  
> > one  
> > > > > > published  
> > > > > > > > > together with the paper. It is not obvious that it will be
> > > > > > > > > maintained in the future, and reusing it is probably not an  
> > > > option  
> > > > > > for  
> > > > > > > > > non-C++ Parquet implementations
> > > > > > > > >
> > > > > > > > > 4. The encoding itself is complex, since it involves a  
> > fallback  
> > > > on  
> > > > > > > > > another encoding if the primary encoding (which constitutes  
> > the  
> > > > > real  
> > > > > > > > > innovation) doesn't work out on a piece of data.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Based on this, I would say that if we think ALP is attractive 
> > > > > > > > >  
> > > for  
> > > > > us,  
> > > > > > > > > we may want to incorporate our own version of ALP with the  
> > > > > following  
> > > > > > > > > changes:
> > > > > > > > >
> > > > > > > > > 1. Design and write our own Parquet-ALP spec so that  
> > > > > implementations  
> > > > > > > > > know exactly how to encode and represent data
> > > > > > > > >
> > > > > > > > > 2. Do not include the ALPrd fallback which is a homegrown  
> > > > > dictionary  
> > > > > > > > > encoding without dictionary reuse accross pages, and instead  
> > > rely  
> > > > > on  
> > > > > > a  
> > > > > > > > > well-known Parquet encoding (such as BYTE_STREAM_SPLIT?)
> > > > > > > > >
> > > > > > > > > 3. Replace the FOR encoding inside ALP, which aims at  
> > > compressing  
> > > > > > > > > integers efficiently, with our own DELTA_BINARY_PACKED (which 
> > > > > > > > >  
> > > has  
> > > > > the  
> > > > > > > > > same qualities and is already available in Parquet  
> > > > implementations)  
> > > > > > > > >
> > > > > > > > > Regards
> > > > > > > > >
> > > > > > > > > Antoine.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, 16 Oct 2025 14:47:33 -0700
> > > > > > > > > PRATEEK GAUR 
> > > > > > > > > <[email protected]> wrote:  
> > > > > > > > > > Hi team,
> > > > > > > > > >
> > > > > > > > > > We spent some time evaluating ALP compression and  
> > > decompression  
> > > > > > > > compared  
> > > > > > > > > to  
> > > > > > > > > > other encoding alternatives like CHIMP/GORILLA and  
> > > compression  
> > > > > > > > techniques  
> > > > > > > > > > like SNAPPY/LZ4/ZSTD. We presented these numbers to the  
> > > > community  
> > > > > > > > members  
> > > > > > > > > > on October 15th in the biweekly parquet meeting. ( I can't  
> > > seem  
> > > > > > to  
> > > > > > > > access  
> > > > > > > > > > the recording, so please let me know what access rules I  
> > need  
> > > > to  
> > > > > > get to  
> > > > > > > > > be  
> > > > > > > > > > able to view it )
> > > > > > > > > >
> > > > > > > > > > We did this evaluation over some datasets pointed by the  
> > ALP  
> > > > > paper  
> > > > > > and  
> > > > > > > > > some  
> > > > > > > > > > pointed by the parquet community.
> > > > > > > > > >
> > > > > > > > > > The results are available in the following document
> > > > > > > > > > <  
> > > > > > > > >  
> > > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > >  
> > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0
> >   
> > > > > >  
> > > > > > > > > >
> > > > > > > > > > :
> > > > > > > > > >  
> > > > > > > > >  
> > > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > >  
> > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg
> >   
> > > > > >  
> > > > > > > > > >
> > > > > > > > > > Based on the numbers we see
> > > > > > > > > >
> > > > > > > > > >    -  ALP is comparable to ZSTD(level=1) in terms of  
> > > > compression  
> > > > > > ratio  
> > > > > > > > > and  
> > > > > > > > > >    much better compared to other schemes. (numbers in the  
> > > sheet  
> > > > > > are  
> > > > > > > > bytes  
> > > > > > > > > >    needed to encode each value )
> > > > > > > > > >    - ALP going quite well in terms of decompression speed  
> > > > > (numbers  
> > > > > > in  
> > > > > > > > the  
> > > > > > > > > >    sheet are bytes decompressed per second)
> > > > > > > > > >
> > > > > > > > > > As next steps we will
> > > > > > > > > >
> > > > > > > > > >    - Get the numbers for compression on top of byte stream  
> > > > split.  
> > > > > > > > > >    - Evaluate the algorithm over a few more datasets.
> > > > > > > > > >    - Have an implementation in the arrow-parquet repo.
> > > > > > > > > >
> > > > > > > > > > Looking forward to feedback from the community.
> > > > > > > > > >
> > > > > > > > > > Best
> > > > > > > > > > Prateek and Dhirhan
> > > > > > > > > >  
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >  
> > > > > > > >  
> > > > > > >  
> > > > > >
> > > > > >
> > > > > >
> > > > > >  
> > > > >  
> > > >  
> > >  
> >  
>

Re: Implementation status

Reply via email to