Speaking of which and responding to my own question, parquet-java also
defaults to 1 MiB:
https://github.com/apache/parquet-java/blob/9b11410f15410b4d76d9f73f9545cf9110488517/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L49

Regards

Antoine.



On Thu, 23 May 2024 01:39:58 -1000
Jacques Nadeau <jacq...@apache.org> wrote:
> I've found that a variable page size based on expected read back number of
> columns is necessary since you'll need read back memory equal to number of
> columns times page size times number concurrent files being read. So if one
> is reading back 1000 columns one may  need 1gb+ of memory per file for
> reads. This resulted in sizing things down as width went up to avoid
> spending excessive budget on read memory. This often resulted in pages
> closer to 64k - 128k. (in the work I did, we typically expected many files
> to be concurrently read across many requested ops.)
> 
> On Wed, May 22, 2024, 11:50 PM Andrew Lamb 
> <andrewlamb11-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> 
> > The Rust implementation uses 1MB pages by default[1]
> >
> > Andrew
> >
> > [1]:
> >
> > https://github.com/apache/arrow-rs/blob/bd5d4a59db5d6d0e1b3bdf00644dbaf317f3be03/parquet/src/file/properties.rs#L28-L29
> >
> > On Thu, May 23, 2024 at 4:10 AM Fokko Driesprong 
> > <fokko-1odqgaof3llqfi55v6+...@public.gmane.orgg> wrote:
> >  
> > > Hey Antoine,
> > >
> > > Thanks for raising this. In Iceberg we also use the 1 MiB page size:
> > >
> > >
> > >  
> > https://github.com/apache/iceberg/blob/b3c25fb7608934d975a054b353823ca001ca3742/core/src/main/java/org/apache/iceberg/TableProperties.java#L133
> >   
> > >
> > > Kind regards,
> > > Fokko
> > >
> > > Op do 23 mei 2024 om 10:06 schreef Antoine Pitrou 
> > > <antoine-+zn9apsxkcednm+yrof...@public.gmane.org>:
> > >  
> > > >
> > > > Hello,
> > > >
> > > > The Parquet format itself (or at least the README) recommends a 8 kiB
> > > > page size, suggesting that data pages are the unit of computation.
> > > >
> > > > However, Parquet C++ has long chosen a 1 MiB page size by default (*),
> > > > suggesting that data pages are considered as the unit of IO there.
> > > >
> > > > (*) even bumping it to 64 MiB at some point, perhaps by mistake:
> > > >
> > > >  
> > >  
> > https://github.com/apache/arrow/commit/4078b876e0cc7503f4da16693ce7901a6ae503d3
> >   
> > > >
> > > > What are the typical choices in other writers?
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > >  
> > >  
> >  
> 



Reply via email to