Thanks for the thorough analysis.

I would suggest not worrying too much about point lookups when sizing the
page size.  If point lookups are very important then I hope users will be
able to use just btr-blocks / procella style encodings (e.g. bit packing,
frame of reference, FSST, dictionary, etc.) instead of compression.  These
encodings support "random access scheduling" and so the page size is
irrelevant and can be based on what size is good for statistics.  I hope to
have an apples-to-apples comparison of some of this in Lance by the end of
the summer and will share what I learn.

If point lookups are somewhat important to a user then I think 20000 rows
generally works ok, especially if the point lookups can be batched in some
way because then you're probably going to be coalescing anyways (this is
less true when storing "big" types like vector embeddings but that can be
tuned on a per-column basis).


On Thu, May 23, 2024 at 9:30 AM Ed Seidl <etse...@live.com> wrote:

> Great analysis Jan, thanks. As to your last question, I'd suggest that a
> row
> limit was chosen to match the row-based nature of the page indexes. I
> agree that a value limit would be nice too, but setting a low page size
> limit can handle that as well (just more coarsely).
>
> FWIW, in some of our early testing of a system where we wanted good scan
> and point lookup performance, we settled on a 64KiB page size limit to
> go with the 20K row count. As with all things, however, YMMV :)
>
> Ed
>
> On 5/23/24 8:50 AM, Jan Finis wrote:
> > Now to the implied question: Is the default good? Or if not, what would
> be
> > a good default?
> >
> > Weston gave the very high level argument. Let me try to tease it apart a
> > bit more to get a good common understanding of the trade-offs. I'll also
> > try to estimate which point is relevant and which is rather academic, and
> > what a resulting good default would be.
> >
> > The considerations that benefit from a *bigger *page size:
> >
> >     - *[Most relevant] Per-page decoding overhead: *When scanning, the
> page
> >     should be big enough to make the constant work that has to be
> performed per
> >     page negligible in contrast to the work required to read the actual
> values.
> >     This work is among other things:
> >        - Decoding the page header (basically reading some thrift varsize
> >        integers)
> >        - Choosing and instantiating the decoders for data, R, and D
> levels
> >        - If applicable, checking the statistics (min/max) to apply scan
> >        restrictions
> >        - "Lost opportunity" cost for leaving the hot decoding loop of the
> >        previous page. I.e., vectorized, unrolled loop ends, etc.
> >
> >        This is the most important consideration of all I would say, but
> it
> >        is only relevant once the page size gets quite small. Once your
> page
> >        reaches a certain number of values, the handling of the actual
> > values will
> >        dominate the decoding. I would argue (from doing a lot of
> experiments on
> >        Parquet and our own, comparable format) that 20000 is in the
> realm that
> >        should make all these costs mostly negligible. Maybe cranking it
> up a bit
> >        further (e.g., to 100000) would give you some more percent of
> > performance,
> >        but you will not see much difference.
> >
> >        - *[Somewhat relevant] Per-page size overhead *The page should be
> big
> >     enough so that the space overhead of the page header is not too
> high. A
> >     page header is in the ballpark of around 10 bytes without statistics
> and
> >     maybe 20-30 with statistics (very raw back of the envelope
> guestimate).
> >     Thus, a page in the kilobytes should be big enough (~1% header
> overhead),
> >     so for uncompressed PLAIN pages, we should be good with 20000
> values. If
> >     the page compresses heavily (say super long RLE runs), then the page
> header
> >     might end up being a more significant percentage of all data, say
> 10%. But
> >     note that we then already compress very well, so it's 10% of very
> little
> >     memory to begin with.
> >     You can construct a worst case in which we compress basically
> infinitely
> >     (say one long RLE run) and then the header size becomes very
> significant.
> >     But I would argue that this case isn't too relevant, as this column's
> >     memory usage will be super low, even if you pay a lot of overhead
> for the
> >     header in this case.
> >
> >     - *[Somewhat relevant] Compression performance:* Most black-box
> >     compression schemes are better when they can compress more memory at
> once.
> >     I don't have numbers here, but I guess we will compress some percent
> worse
> >     (say 5-20%) if we compress a few kilobytes instead of a megabyte at
> once.
> >
> >     - *[Mostly irrelevant] Encoding disadvantages: *Certain encodings are
> >     "broken up" by beginning a new page, reducing their efficiency. RLE
> needs
> >     to prematurely end its current run or bit pack, DELTA_BINARY_PACKED
> needs
> >     to prematurely end its current mini block, etc. Here again, 20000 is
> big
> >     enough so this doesn't matter at all. This would matter if we just
> had 10
> >     values in the page, but with 20000, this effect should mostly be nil.
> >
> > Considerations for a *smaller *page size:
> >
> >     - *[Relevant, if you have point look-ups] Point lookups:* By using
> the
> >     offset index, we can load and decode single pages. For a point-lookup
> >     that's searching for a single row, a smaller page is therefore
> better. On
> >     the I/O side, we're mostly optimal once we're smaller than the block
> size
> >     of our medium (say 4kb for SSD). On the decoding side, the smaller
> the
> >     better. Some encodings allow random access but others do not (RLE
> (and
> >     therefore DICT), DELTA_*). Compression is also all or nothing, so
> once you
> >     compress your pages, then you never have random access into a page,
> so
> >     smaller is better.
> >
> >     If you have a scenario that is point-lookup heavy, then 20000 is
> almost
> >     too much and you would rather like to have something like 1000 or
> even
> >     less. But I guess we should not tune the defaults for point-lookups,
> but
> >     rather for mostly scans while keeping point lookups in mind. For
> this,
> >     20000 seems to be a good trade-off, as scans do not suffer, while
> it's
> >     still okay for point access. 1 million values would be way worse
> here,
> >     speaking against the current default of parquet-rs.
> >
> >     - *[Relevant, if you have scan restrictions] More fine granular
> min/max
> >     bounds: *The point look-ups above are an extreme case. Much more
> common
> >     is the case that you have a scan with scan restrictions. The min/max
> >     synopses of pages / page index can be used to skip pages that cannot
> have
> >     matches. Here, finer grained pages allow for tighter min/max bounds,
> so
> >     they are at an advantage. E.g., if you store all values x from 1 to 1
> >     million (in ascending order) and now have a query WHERE x BETWEEN
> 100000
> >     AND 199999, we could skip all but five pages if we had a page size
> of 20000
> >     but we would need to read everything if our page size was 1 million.
> >
> >
> >     - *[Somewhat relevant] Memory usage while decoding: *A reader will
> >     usually need at least memory proportional to the page size (e.g., for
> >     decompressing a compressed page). You could argue that 1 MiB is
> still not
> >     too much and I agree. A naive reader could however also require
> memory
> >     proportional to the number of values in the page (e.g., it could
> decode the
> >     whole page into a buffer). So, you want to have an upper bound on the
> >     number of values, to avoid cases where an infinite compression
> (e.g., a
> >     single RLE run) makes a page contain an unbounded number of values.
> >
> >
> >     - *[Somewhat relevant] Memory usage and complexity while encoding:*
> >     Encoders also have to work harder for larger pages. E.g., a
> dictionary
> >     encoder has to find out which dictionary ids the values would have,
> if the
> >     page was encoded by it. This either requires a reverse dictionary or
> the
> >     sorting of the values. This is faster for less values (reverse
> dictionary
> >     fits into cache, sorting is O(n log n)).
> >
> >     - *[Could be relevant, but only with an optimized implementation]
> Fine
> >     grained encoding choice: *As each page may use a different encoding,
> we
> >     can react better to skew in the data distribution by being able to
> switch
> >     the encoding more often. E.g., if we first only have unique values
> and then
> >     later a lot of repeated values, we could first use PLAIN and later
> >     DICTIONARY. This is mostly irrelevant for now, as the library (at
> least
> >     parquet-mr, haven't checked the others) isn't this clever. It has
> very
> >     simple fallback rules (first dict, if not one fallback encoding)
> which are
> >     unlikely to take advantage of a more fine granular encoding choice.
> But in
> >     a perfect world where the writer chooses encodings very carefully,
> more
> >     fine granular encoding is valuable. E.g., the BtrBlocks paper
> chooses the
> >     encoding per page by sampling the page contents. With this strategy,
> >     smaller pages could make a difference.
> >
> >     - *[Quite relevant if the reader supports it] Intra-row-group
> >     parallelism: *See our v3 discussion thread. If you want to
> parallelize
> >     inside a row group, then you will have concurrent readers that have
> to read
> >     *parts* of a page. And since compression and some encodings (e.g.,
> RLE and
> >     thus  DICT, DELTA_*) do not allow you to have O(1) random access
> into a
> >     page, scanning parts of a page is easier if the page is small.
> >
> > These would be all arguments that I considered so far. All in all, the
> > conclusion I draw from this is that the current defaults of parquet-mr
> are
> > fine. 20000 seems like a sane in-between value with the 1MiB page size
> > limit for large strings to ensure that long strings do not create huge
> > pages. 1 Million values would be too much IMHO. It's okay if you only do
> > scans, do not use min/max values often and do not use intra-row-group
> > parallelism, but I would say it's not a good default, as some readers
> will
> > have some of these use cases.
> >
> > One final questionable point to me is that parquet-mr applies 20000 as a*
> > row limit*, not as a *value limit*. Consequently, pages of nested columns
> > will not abide by the 20000 limit, but can have much much more values.
> > Whether this is a good idea is somewhat questionable to me. I think I
> > myself would rather apply a 20000 value limit, so that also pages in
> nested
> > columns abide by the same rules. I don't see a big argument why you would
> > not want these limits for nested columns.
> >
> > Cheers,
> > Jan
> >
> > Am Do., 23. Mai 2024 um 17:28 Uhr schrieb Ed Seidl <etse...@live.com>:
> >
> >> I haven't seen it mentioned in this thread, but for the curious the
> >> 20000 row limit appears to come from a 2020 blog post by Cloudera [1]
> >> (in the section "Testing with Parquet-MR").
> >>
> >> Cheers,
> >> Ed
> >>
> >> [1]
> >>
> >>
> https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
> >>
> >> On 5/23/24 6:50 AM, Raphael Taylor-Davies wrote:
> >>> The rust implementation supports limiting the number of rows in a
> >>> page, although this is disabled by default. If there is consensus that
> >>> 20,000 is the recommended limit, I don't see any issue with changing
> >>> this default.
> >>>
> >>> On 23/05/2024 14:39, Jan Finis wrote:
> >>>> Addendum, since Fokko mentioned Iceberg.
> >>>>
> >>>> Iceberg does the same, also applying a 20000 row limit by default
> >>>>
> >>>> (
> >>>>
> >>
> https://github.com/apache/iceberg/blob/b3c25fb7608934d975a054b353823ca001ca3742/core/src/main/java/org/apache/iceberg/TableProperties.java#L137C3-L137C67
> >>>> )
> >>>>
> >>>> Am Do., 23. Mai 2024 um 15:38 Uhr schrieb Jan Finis <
> jpfi...@gmail.com
> >>> :
> >>>>> The 1 MiB page size limit of parquet-mr is a red herring. Parquet-mr
> >>>>> (now
> >>>>> parquet-java) actually writes *way smaller* pages by default.
> >>>>> parquet-mr
> >>>>> has actually *three limits* for deciding when to finish a page:
> >>>>>
> >>>>>      - The size limit, which is 1MiB by default, as you mention.
> >>>>>      (DEFAULT_PAGE_SIZE)
> >>>>>      - A value limit, which is INT_MAX / 2 by default (so not really
> a
> >>>>>      limit, if the default is used)
> >>>>> (DEFAULT_PAGE_VALUE_COUNT_THRESHOLD).
> >>>>>      - A row count limit, which is 20000 by default.
> >>>>>      (DEFAULT_PAGE_ROW_COUNT_LIMIT)
> >>>>>      This limit will, in practice hit *way* before the page size
> >>>>> limit of
> >>>>>      1MiB is reached.
> >>>>>
> >>>>> (See
> >>>>>
> >>
> https://github.com/apache/parquet-java/blob/9b11410f15410b4d76d9f73f9545cf9110488517/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java#L238
> >>>>> for the code that checks all three limits)
> >>>>>
> >>>>> Thus, the page size limit is rather an upper bound for very large
> >>>>> values
> >>>>> (e.g., long strings) or very many values in case of nested columns.
> >>>>> It will
> >>>>> usually not be reached at all for a normal non-nested,
> non-long-string
> >>>>> column.
> >>>>>
> >>>>> Rather the pages will actually be quite small due to the 20000 row
> >>>>> limit,
> >>>>> e.g., in PLAIN encoding, a page without any R and D levels would be
> >>>>> 80kB
> >>>>> for 4 byte values and 160kB for 8 byte values. And this is *before*
> >>>>> applying compression. If your values compress very well, or if you
> >>>>> use an
> >>>>> encoding that is way smaller (e.g., dict) pages will be way smaller.
> >>>>> E.g.,
> >>>>> say you only have 16 distinct values in the page, then dictionary
> >>>>> encoding
> >>>>> with 4 bit keys will be used, leading to a page of only 10kB, even
> >>>>> if there
> >>>>> are not any runs in here. As some data types compress very well
> >>>>> (either due
> >>>>> to RLE (dict keys), DELTA_* or due to black box compression applied
> on
> >>>>> top), I have seen many pages < 1kB in practice.
> >>>>>
> >>>>> Cheers,
> >>>>> Jan
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> Am Do., 23. Mai 2024 um 15:05 Uhr schrieb Antoine Pitrou <
> >>>>> anto...@python.org>:
> >>>>>
> >>>>>> Speaking of which and responding to my own question, parquet-java
> also
> >>>>>> defaults to 1 MiB:
> >>>>>>
> >>>>>>
> >>
> https://github.com/apache/parquet-java/blob/9b11410f15410b4d76d9f73f9545cf9110488517/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L49
> >>>>>>
> >>>>>> Regards
> >>>>>>
> >>>>>> Antoine.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Thu, 23 May 2024 01:39:58 -1000
> >>>>>> Jacques Nadeau <jacq...@apache.org> wrote:
> >>>>>>> I've found that a variable page size based on expected read back
> >>>>>>> number
> >>>>>> of
> >>>>>>> columns is necessary since you'll need read back memory equal to
> >>>>>>> number
> >>>>>> of
> >>>>>>> columns times page size times number concurrent files being read.
> >>>>>>> So if
> >>>>>> one
> >>>>>>> is reading back 1000 columns one may  need 1gb+ of memory per file
> >>>>>>> for
> >>>>>>> reads. This resulted in sizing things down as width went up to
> avoid
> >>>>>>> spending excessive budget on read memory. This often resulted in
> >>>>>>> pages
> >>>>>>> closer to 64k - 128k. (in the work I did, we typically expected
> many
> >>>>>> files
> >>>>>>> to be concurrently read across many requested ops.)
> >>>>>>>
> >>>>>>> On Wed, May 22, 2024, 11:50 PM Andrew Lamb <
> >>>>>> andrewlamb11-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> >>>>>>>> The Rust implementation uses 1MB pages by default[1]
> >>>>>>>>
> >>>>>>>> Andrew
> >>>>>>>>
> >>>>>>>> [1]:
> >>>>>>>>
> >>>>>>>>
> >>
> https://github.com/apache/arrow-rs/blob/bd5d4a59db5d6d0e1b3bdf00644dbaf317f3be03/parquet/src/file/properties.rs#L28-L29
> >>>>>>>> On Thu, May 23, 2024 at 4:10 AM Fokko Driesprong
> >>>>>> <fokko-1odqgaof3llqfi55v6+...@public.gmane.orgg> wrote:
> >>>>>>>>> Hey Antoine,
> >>>>>>>>>
> >>>>>>>>> Thanks for raising this. In Iceberg we also use the 1 MiB page
> >>>>>>>>> size:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>
> https://github.com/apache/iceberg/blob/b3c25fb7608934d975a054b353823ca001ca3742/core/src/main/java/org/apache/iceberg/TableProperties.java#L133
> >>>>>>
> >>>>>>>>> Kind regards,
> >>>>>>>>> Fokko
> >>>>>>>>>
> >>>>>>>>> Op do 23 mei 2024 om 10:06 schreef Antoine Pitrou <
> >>>>>> antoine-+zn9apsxkcednm+yrof...@public.gmane.org>:
> >>>>>>>>>> Hello,
> >>>>>>>>>>
> >>>>>>>>>> The Parquet format itself (or at least the README) recommends a
> 8
> >>>>>> kiB
> >>>>>>>>>> page size, suggesting that data pages are the unit of
> computation.
> >>>>>>>>>>
> >>>>>>>>>> However, Parquet C++ has long chosen a 1 MiB page size by
> default
> >>>>>> (*),
> >>>>>>>>>> suggesting that data pages are considered as the unit of IO
> there.
> >>>>>>>>>>
> >>>>>>>>>> (*) even bumping it to 64 MiB at some point, perhaps by mistake:
> >>>>>>>>>>
> >>>>>>>>>>
> >>
> https://github.com/apache/arrow/commit/4078b876e0cc7503f4da16693ce7901a6ae503d3
> >>>>>>
> >>>>>>>>>> What are the typical choices in other writers?
> >>>>>>>>>>
> >>>>>>>>>> Regards
> >>>>>>>>>>
> >>>>>>>>>> Antoine.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>
> >>>>>>
> >>
>
>

Reply via email to