The rust implementation supports limiting the number of rows in a page, although this is disabled by default. If there is consensus that 20,000 is the recommended limit, I don't see any issue with changing this default.

On 23/05/2024 14:39, Jan Finis wrote:
Addendum, since Fokko mentioned Iceberg.

Iceberg does the same, also applying a 20000 row limit by default

(
https://github.com/apache/iceberg/blob/b3c25fb7608934d975a054b353823ca001ca3742/core/src/main/java/org/apache/iceberg/TableProperties.java#L137C3-L137C67
)

Am Do., 23. Mai 2024 um 15:38 Uhr schrieb Jan Finis <jpfi...@gmail.com>:

The 1 MiB page size limit of parquet-mr is a red herring. Parquet-mr (now
parquet-java) actually writes *way smaller* pages by default. parquet-mr
has actually *three limits* for deciding when to finish a page:

    - The size limit, which is 1MiB by default, as you mention.
    (DEFAULT_PAGE_SIZE)
    - A value limit, which is INT_MAX / 2 by default (so not really a
    limit, if the default is used) (DEFAULT_PAGE_VALUE_COUNT_THRESHOLD).
    - A row count limit, which is 20000 by default.
    (DEFAULT_PAGE_ROW_COUNT_LIMIT)
    This limit will, in practice hit *way* before the page size limit of
    1MiB is reached.

(See
https://github.com/apache/parquet-java/blob/9b11410f15410b4d76d9f73f9545cf9110488517/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java#L238
for the code that checks all three limits)

Thus, the page size limit is rather an upper bound for very large values
(e.g., long strings) or very many values in case of nested columns. It will
usually not be reached at all for a normal non-nested, non-long-string
column.

Rather the pages will actually be quite small due to the 20000 row limit,
e.g., in PLAIN encoding, a page without any R and D levels would be 80kB
for 4 byte values and 160kB for 8 byte values. And this is *before*
applying compression. If your values compress very well, or if you use an
encoding that is way smaller (e.g., dict) pages will be way smaller. E.g.,
say you only have 16 distinct values in the page, then dictionary encoding
with 4 bit keys will be used, leading to a page of only 10kB, even if there
are not any runs in here. As some data types compress very well (either due
to RLE (dict keys), DELTA_* or due to black box compression applied on
top), I have seen many pages < 1kB in practice.

Cheers,
Jan




Am Do., 23. Mai 2024 um 15:05 Uhr schrieb Antoine Pitrou <
anto...@python.org>:

Speaking of which and responding to my own question, parquet-java also
defaults to 1 MiB:

https://github.com/apache/parquet-java/blob/9b11410f15410b4d76d9f73f9545cf9110488517/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L49

Regards

Antoine.



On Thu, 23 May 2024 01:39:58 -1000
Jacques Nadeau <jacq...@apache.org> wrote:
I've found that a variable page size based on expected read back number
of
columns is necessary since you'll need read back memory equal to number
of
columns times page size times number concurrent files being read. So if
one
is reading back 1000 columns one may  need 1gb+ of memory per file for
reads. This resulted in sizing things down as width went up to avoid
spending excessive budget on read memory. This often resulted in pages
closer to 64k - 128k. (in the work I did, we typically expected many
files
to be concurrently read across many requested ops.)

On Wed, May 22, 2024, 11:50 PM Andrew Lamb <
andrewlamb11-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
The Rust implementation uses 1MB pages by default[1]

Andrew

[1]:


https://github.com/apache/arrow-rs/blob/bd5d4a59db5d6d0e1b3bdf00644dbaf317f3be03/parquet/src/file/properties.rs#L28-L29
On Thu, May 23, 2024 at 4:10 AM Fokko Driesprong
<fokko-1odqgaof3llqfi55v6+...@public.gmane.orgg> wrote:
Hey Antoine,

Thanks for raising this. In Iceberg we also use the 1 MiB page size:



https://github.com/apache/iceberg/blob/b3c25fb7608934d975a054b353823ca001ca3742/core/src/main/java/org/apache/iceberg/TableProperties.java#L133

Kind regards,
Fokko

Op do 23 mei 2024 om 10:06 schreef Antoine Pitrou <
antoine-+zn9apsxkcednm+yrof...@public.gmane.org>:
Hello,

The Parquet format itself (or at least the README) recommends a 8
kiB
page size, suggesting that data pages are the unit of computation.

However, Parquet C++ has long chosen a 1 MiB page size by default
(*),
suggesting that data pages are considered as the unit of IO there.

(*) even bumping it to 64 MiB at some point, perhaps by mistake:


https://github.com/apache/arrow/commit/4078b876e0cc7503f4da16693ce7901a6ae503d3

What are the typical choices in other writers?

Regards

Antoine.






Reply via email to