Re: Typical data page size

Weston Pace Thu, 23 May 2024 08:14:12 -0700

I would argue a slightly different point which is that the page size is not
the unit of compute but the unit of compression.


Small page size = more metadata, better compression ratios
Large page size = less metadata, worse compression ratios

The unit of compute should be decided by the reader, not the writer.  The
reader is in a better place to determine this.

On Thu, May 23, 2024 at 6:50 AM Raphael Taylor-Davies
<[email protected]> wrote:

> The rust implementation supports limiting the number of rows in a page,
> although this is disabled by default. If there is consensus that 20,000
> is the recommended limit, I don't see any issue with changing this default.
>
> On 23/05/2024 14:39, Jan Finis wrote:
> > Addendum, since Fokko mentioned Iceberg.
> >
> > Iceberg does the same, also applying a 20000 row limit by default
> >
> > (
> >
> https://github.com/apache/iceberg/blob/b3c25fb7608934d975a054b353823ca001ca3742/core/src/main/java/org/apache/iceberg/TableProperties.java#L137C3-L137C67
> > )
> >
> > Am Do., 23. Mai 2024 um 15:38 Uhr schrieb Jan Finis <[email protected]>:
> >
> >> The 1 MiB page size limit of parquet-mr is a red herring. Parquet-mr
> (now
> >> parquet-java) actually writes *way smaller* pages by default. parquet-mr
> >> has actually *three limits* for deciding when to finish a page:
> >>
> >>     - The size limit, which is 1MiB by default, as you mention.
> >>     (DEFAULT_PAGE_SIZE)
> >>     - A value limit, which is INT_MAX / 2 by default (so not really a
> >>     limit, if the default is used) (DEFAULT_PAGE_VALUE_COUNT_THRESHOLD).
> >>     - A row count limit, which is 20000 by default.
> >>     (DEFAULT_PAGE_ROW_COUNT_LIMIT)
> >>     This limit will, in practice hit *way* before the page size limit of
> >>     1MiB is reached.
> >>
> >> (See
> >>
> https://github.com/apache/parquet-java/blob/9b11410f15410b4d76d9f73f9545cf9110488517/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java#L238
> >> for the code that checks all three limits)
> >>
> >> Thus, the page size limit is rather an upper bound for very large values
> >> (e.g., long strings) or very many values in case of nested columns. It
> will
> >> usually not be reached at all for a normal non-nested, non-long-string
> >> column.
> >>
> >> Rather the pages will actually be quite small due to the 20000 row
> limit,
> >> e.g., in PLAIN encoding, a page without any R and D levels would be 80kB
> >> for 4 byte values and 160kB for 8 byte values. And this is *before*
> >> applying compression. If your values compress very well, or if you use
> an
> >> encoding that is way smaller (e.g., dict) pages will be way smaller.
> E.g.,
> >> say you only have 16 distinct values in the page, then dictionary
> encoding
> >> with 4 bit keys will be used, leading to a page of only 10kB, even if
> there
> >> are not any runs in here. As some data types compress very well (either
> due
> >> to RLE (dict keys), DELTA_* or due to black box compression applied on
> >> top), I have seen many pages < 1kB in practice.
> >>
> >> Cheers,
> >> Jan
> >>
> >>
> >>
> >>
> >> Am Do., 23. Mai 2024 um 15:05 Uhr schrieb Antoine Pitrou <
> >> [email protected]>:
> >>
> >>> Speaking of which and responding to my own question, parquet-java also
> >>> defaults to 1 MiB:
> >>>
> >>>
> https://github.com/apache/parquet-java/blob/9b11410f15410b4d76d9f73f9545cf9110488517/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L49
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>>
> >>>
> >>>
> >>> On Thu, 23 May 2024 01:39:58 -1000
> >>> Jacques Nadeau <[email protected]> wrote:
> >>>> I've found that a variable page size based on expected read back
> number
> >>> of
> >>>> columns is necessary since you'll need read back memory equal to
> number
> >>> of
> >>>> columns times page size times number concurrent files being read. So
> if
> >>> one
> >>>> is reading back 1000 columns one may  need 1gb+ of memory per file for
> >>>> reads. This resulted in sizing things down as width went up to avoid
> >>>> spending excessive budget on read memory. This often resulted in pages
> >>>> closer to 64k - 128k. (in the work I did, we typically expected many
> >>> files
> >>>> to be concurrently read across many requested ops.)
> >>>>
> >>>> On Wed, May 22, 2024, 11:50 PM Andrew Lamb <
> >>> [email protected]> wrote:
> >>>>> The Rust implementation uses 1MB pages by default[1]
> >>>>>
> >>>>> Andrew
> >>>>>
> >>>>> [1]:
> >>>>>
> >>>>>
> >>>
> https://github.com/apache/arrow-rs/blob/bd5d4a59db5d6d0e1b3bdf00644dbaf317f3be03/parquet/src/file/properties.rs#L28-L29
> >>>>> On Thu, May 23, 2024 at 4:10 AM Fokko Driesprong
> >>> <[email protected]> wrote:
> >>>>>> Hey Antoine,
> >>>>>>
> >>>>>> Thanks for raising this. In Iceberg we also use the 1 MiB page size:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>
> https://github.com/apache/iceberg/blob/b3c25fb7608934d975a054b353823ca001ca3742/core/src/main/java/org/apache/iceberg/TableProperties.java#L133
> >>>
> >>>>>> Kind regards,
> >>>>>> Fokko
> >>>>>>
> >>>>>> Op do 23 mei 2024 om 10:06 schreef Antoine Pitrou <
> >>> [email protected]>:
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> The Parquet format itself (or at least the README) recommends a 8
> >>> kiB
> >>>>>>> page size, suggesting that data pages are the unit of computation.
> >>>>>>>
> >>>>>>> However, Parquet C++ has long chosen a 1 MiB page size by default
> >>> (*),
> >>>>>>> suggesting that data pages are considered as the unit of IO there.
> >>>>>>>
> >>>>>>> (*) even bumping it to 64 MiB at some point, perhaps by mistake:
> >>>>>>>
> >>>>>>>
> >>>
> https://github.com/apache/arrow/commit/4078b876e0cc7503f4da16693ce7901a6ae503d3
> >>>
> >>>>>>> What are the typical choices in other writers?
> >>>>>>>
> >>>>>>> Regards
> >>>>>>>
> >>>>>>> Antoine.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>
> >>>
> >>>
>

Re: Typical data page size

Reply via email to