Re: Typical data page size

2024-05-24 Thread Weston Pace
> I also want to argue that a lot more workloads actually have point(ish) > accesses than one would think. Any database scanning Parquet with scan > filters can have point(ish) accesses: I agree. Efficient point accesses are, from one point of view, the top reason we built the lance format in the

Re: Typical data page size

2024-05-24 Thread Jan Finis
> > If point lookups are very important then I hope users will be > able to use just btr-blocks / procella style encodings (e.g. bit packing, > frame of reference, FSST, dictionary, etc.) instead of compression. These > encodings support "random access scheduling" and so the page size is > irreleva

Re: Typical data page size

2024-05-23 Thread Weston Pace
Thanks for the thorough analysis. I would suggest not worrying too much about point lookups when sizing the page size. If point lookups are very important then I hope users will be able to use just btr-blocks / procella style encodings (e.g. bit packing, frame of reference, FSST, dictionary, etc.

Re: Typical data page size

2024-05-23 Thread Ed Seidl
Great analysis Jan, thanks. As to your last question, I'd suggest that a row limit was chosen to match the row-based nature of the page indexes. I agree that a value limit would be nice too, but setting a low page size limit can handle that as well (just more coarsely). FWIW, in some of our early

Re: Typical data page size

2024-05-23 Thread Raphael Taylor-Davies
One argument for it being a row limit and not a value limit, is that records/rows are typically the fundamental unit of most parquet implementations. As such providing indexing granularity finer than this is of limited practical benefit, with zone maps, parallelism, etc... oriented around row b

Re: Typical data page size

2024-05-23 Thread Jan Finis
Now to the implied question: Is the default good? Or if not, what would be a good default? Weston gave the very high level argument. Let me try to tease it apart a bit more to get a good common understanding of the trade-offs. I'll also try to estimate which point is relevant and which is rather a

Re: Typical data page size

2024-05-23 Thread Ed Seidl
I haven't seen it mentioned in this thread, but for the curious the 2 row limit appears to come from a 2020 blog post by Cloudera [1]  (in the section "Testing with Parquet-MR"). Cheers, Ed [1] https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/ On 5/23/24 6:

Re: Typical data page size

2024-05-23 Thread Weston Pace
I would argue a slightly different point which is that the page size is not the unit of compute but the unit of compression. Small page size = more metadata, better compression ratios Large page size = less metadata, worse compression ratios The unit of compute should be decided by the reader, no

Re: Typical data page size

2024-05-23 Thread Raphael Taylor-Davies
The rust implementation supports limiting the number of rows in a page, although this is disabled by default. If there is consensus that 20,000 is the recommended limit, I don't see any issue with changing this default. On 23/05/2024 14:39, Jan Finis wrote: Addendum, since Fokko mentioned Iceb

Re: Typical data page size

2024-05-23 Thread Andrew Lamb
Likewise Rust does the same thing (limits sizes based page size or row count, whichever is hit first), though the default row limit is 1M[1] (rather than 20,000). [1]: https://github.com/apache/arrow-rs/blob/bd5d4a59db5d6d0e1b3bdf00644dbaf317f3be03/parquet/src/file/properties.rs#L45 On Thu, May 2

Re: Typical data page size

2024-05-23 Thread Jan Finis
Addendum, since Fokko mentioned Iceberg. Iceberg does the same, also applying a 2 row limit by default ( https://github.com/apache/iceberg/blob/b3c25fb7608934d975a054b353823ca001ca3742/core/src/main/java/org/apache/iceberg/TableProperties.java#L137C3-L137C67 ) Am Do., 23. Mai 2024 um 15:38 U

Re: Typical data page size

2024-05-23 Thread Jan Finis
The 1 MiB page size limit of parquet-mr is a red herring. Parquet-mr (now parquet-java) actually writes *way smaller* pages by default. parquet-mr has actually *three limits* for deciding when to finish a page: - The size limit, which is 1MiB by default, as you mention. (DEFAULT_PAGE_SIZE)

Re: Typical data page size

2024-05-23 Thread Antoine Pitrou
Speaking of which and responding to my own question, parquet-java also defaults to 1 MiB: https://github.com/apache/parquet-java/blob/9b11410f15410b4d76d9f73f9545cf9110488517/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L49 Regards Antoine. On Thu, 23 May 202

Re: Typical data page size

2024-05-23 Thread Jacques Nadeau
I've found that a variable page size based on expected read back number of columns is necessary since you'll need read back memory equal to number of columns times page size times number concurrent files being read. So if one is reading back 1000 columns one may need 1gb+ of memory per file for re

Re: Typical data page size

2024-05-23 Thread Andrew Lamb
The Rust implementation uses 1MB pages by default[1] Andrew [1]: https://github.com/apache/arrow-rs/blob/bd5d4a59db5d6d0e1b3bdf00644dbaf317f3be03/parquet/src/file/properties.rs#L28-L29 On Thu, May 23, 2024 at 4:10 AM Fokko Driesprong wrote: > Hey Antoine, > > Thanks for raising this. In Iceber

Re: Typical data page size

2024-05-23 Thread Fokko Driesprong
Hey Antoine, Thanks for raising this. In Iceberg we also use the 1 MiB page size: https://github.com/apache/iceberg/blob/b3c25fb7608934d975a054b353823ca001ca3742/core/src/main/java/org/apache/iceberg/TableProperties.java#L133 Kind regards, Fokko Op do 23 mei 2024 om 10:06 schreef Antoine Pitrou

Typical data page size

2024-05-23 Thread Antoine Pitrou
Hello, The Parquet format itself (or at least the README) recommends a 8 kiB page size, suggesting that data pages are the unit of computation. However, Parquet C++ has long chosen a 1 MiB page size by default (*), suggesting that data pages are considered as the unit of IO there. (*) even bum