Likewise Rust does the same thing (limits sizes based page size or row count, whichever is hit first), though the default row limit is 1M[1] (rather than 20,000).
[1]: https://github.com/apache/arrow-rs/blob/bd5d4a59db5d6d0e1b3bdf00644dbaf317f3be03/parquet/src/file/properties.rs#L45 On Thu, May 23, 2024 at 9:40 AM Jan Finis <[email protected]> wrote: > Addendum, since Fokko mentioned Iceberg. > > Iceberg does the same, also applying a 20000 row limit by default > > ( > > https://github.com/apache/iceberg/blob/b3c25fb7608934d975a054b353823ca001ca3742/core/src/main/java/org/apache/iceberg/TableProperties.java#L137C3-L137C67 > ) > > Am Do., 23. Mai 2024 um 15:38 Uhr schrieb Jan Finis <[email protected]>: > > > The 1 MiB page size limit of parquet-mr is a red herring. Parquet-mr (now > > parquet-java) actually writes *way smaller* pages by default. parquet-mr > > has actually *three limits* for deciding when to finish a page: > > > > - The size limit, which is 1MiB by default, as you mention. > > (DEFAULT_PAGE_SIZE) > > - A value limit, which is INT_MAX / 2 by default (so not really a > > limit, if the default is used) (DEFAULT_PAGE_VALUE_COUNT_THRESHOLD). > > - A row count limit, which is 20000 by default. > > (DEFAULT_PAGE_ROW_COUNT_LIMIT) > > This limit will, in practice hit *way* before the page size limit of > > 1MiB is reached. > > > > (See > > > https://github.com/apache/parquet-java/blob/9b11410f15410b4d76d9f73f9545cf9110488517/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java#L238 > > for the code that checks all three limits) > > > > Thus, the page size limit is rather an upper bound for very large values > > (e.g., long strings) or very many values in case of nested columns. It > will > > usually not be reached at all for a normal non-nested, non-long-string > > column. > > > > Rather the pages will actually be quite small due to the 20000 row limit, > > e.g., in PLAIN encoding, a page without any R and D levels would be 80kB > > for 4 byte values and 160kB for 8 byte values. And this is *before* > > applying compression. If your values compress very well, or if you use an > > encoding that is way smaller (e.g., dict) pages will be way smaller. > E.g., > > say you only have 16 distinct values in the page, then dictionary > encoding > > with 4 bit keys will be used, leading to a page of only 10kB, even if > there > > are not any runs in here. As some data types compress very well (either > due > > to RLE (dict keys), DELTA_* or due to black box compression applied on > > top), I have seen many pages < 1kB in practice. > > > > Cheers, > > Jan > > > > > > > > > > Am Do., 23. Mai 2024 um 15:05 Uhr schrieb Antoine Pitrou < > > [email protected]>: > > > >> > >> Speaking of which and responding to my own question, parquet-java also > >> defaults to 1 MiB: > >> > >> > https://github.com/apache/parquet-java/blob/9b11410f15410b4d76d9f73f9545cf9110488517/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L49 > >> > >> Regards > >> > >> Antoine. > >> > >> > >> > >> On Thu, 23 May 2024 01:39:58 -1000 > >> Jacques Nadeau <[email protected]> wrote: > >> > I've found that a variable page size based on expected read back > number > >> of > >> > columns is necessary since you'll need read back memory equal to > number > >> of > >> > columns times page size times number concurrent files being read. So > if > >> one > >> > is reading back 1000 columns one may need 1gb+ of memory per file for > >> > reads. This resulted in sizing things down as width went up to avoid > >> > spending excessive budget on read memory. This often resulted in pages > >> > closer to 64k - 128k. (in the work I did, we typically expected many > >> files > >> > to be concurrently read across many requested ops.) > >> > > >> > On Wed, May 22, 2024, 11:50 PM Andrew Lamb < > >> [email protected]> wrote: > >> > > >> > > The Rust implementation uses 1MB pages by default[1] > >> > > > >> > > Andrew > >> > > > >> > > [1]: > >> > > > >> > > > >> > https://github.com/apache/arrow-rs/blob/bd5d4a59db5d6d0e1b3bdf00644dbaf317f3be03/parquet/src/file/properties.rs#L28-L29 > >> > > > >> > > On Thu, May 23, 2024 at 4:10 AM Fokko Driesprong > >> <[email protected]> wrote: > >> > > > >> > > > Hey Antoine, > >> > > > > >> > > > Thanks for raising this. In Iceberg we also use the 1 MiB page > size: > >> > > > > >> > > > > >> > > > > >> > > > >> > https://github.com/apache/iceberg/blob/b3c25fb7608934d975a054b353823ca001ca3742/core/src/main/java/org/apache/iceberg/TableProperties.java#L133 > >> > >> > > > > >> > > > Kind regards, > >> > > > Fokko > >> > > > > >> > > > Op do 23 mei 2024 om 10:06 schreef Antoine Pitrou < > >> [email protected]>: > >> > > > > >> > > > > > >> > > > > Hello, > >> > > > > > >> > > > > The Parquet format itself (or at least the README) recommends a > 8 > >> kiB > >> > > > > page size, suggesting that data pages are the unit of > computation. > >> > > > > > >> > > > > However, Parquet C++ has long chosen a 1 MiB page size by > default > >> (*), > >> > > > > suggesting that data pages are considered as the unit of IO > there. > >> > > > > > >> > > > > (*) even bumping it to 64 MiB at some point, perhaps by mistake: > >> > > > > > >> > > > > > >> > > > > >> > > > >> > https://github.com/apache/arrow/commit/4078b876e0cc7503f4da16693ce7901a6ae503d3 > >> > >> > > > > > >> > > > > What are the typical choices in other writers? > >> > > > > > >> > > > > Regards > >> > > > > > >> > > > > Antoine. > >> > > > > > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >> > >> >
