I've found that a variable page size based on expected read back number of
columns is necessary since you'll need read back memory equal to number of
columns times page size times number concurrent files being read. So if one
is reading back 1000 columns one may  need 1gb+ of memory per file for
reads. This resulted in sizing things down as width went up to avoid
spending excessive budget on read memory. This often resulted in pages
closer to 64k - 128k. (in the work I did, we typically expected many files
to be concurrently read across many requested ops.)

On Wed, May 22, 2024, 11:50 PM Andrew Lamb <[email protected]> wrote:

> The Rust implementation uses 1MB pages by default[1]
>
> Andrew
>
> [1]:
>
> https://github.com/apache/arrow-rs/blob/bd5d4a59db5d6d0e1b3bdf00644dbaf317f3be03/parquet/src/file/properties.rs#L28-L29
>
> On Thu, May 23, 2024 at 4:10 AM Fokko Driesprong <[email protected]> wrote:
>
> > Hey Antoine,
> >
> > Thanks for raising this. In Iceberg we also use the 1 MiB page size:
> >
> >
> >
> https://github.com/apache/iceberg/blob/b3c25fb7608934d975a054b353823ca001ca3742/core/src/main/java/org/apache/iceberg/TableProperties.java#L133
> >
> > Kind regards,
> > Fokko
> >
> > Op do 23 mei 2024 om 10:06 schreef Antoine Pitrou <[email protected]>:
> >
> > >
> > > Hello,
> > >
> > > The Parquet format itself (or at least the README) recommends a 8 kiB
> > > page size, suggesting that data pages are the unit of computation.
> > >
> > > However, Parquet C++ has long chosen a 1 MiB page size by default (*),
> > > suggesting that data pages are considered as the unit of IO there.
> > >
> > > (*) even bumping it to 64 MiB at some point, perhaps by mistake:
> > >
> > >
> >
> https://github.com/apache/arrow/commit/4078b876e0cc7503f4da16693ce7901a6ae503d3
> > >
> > > What are the typical choices in other writers?
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> >
>

Reply via email to