Thanks for Weston's clear explanation. Point 2 is what I've experienced without tuning parameter and point 3 is what I concerned about. Looking forward for finer granularity of reading/indexing parquet than row group level, which should fix the issue.
Weston Pace <[email protected]> 于2022年2月23日周三 13:30写道: > These are all great points. A few notes from my own experiments > (mostly confirming what others have said): > > 1) 1M rows is the minimum safe size for row groups on HDD (and > perhaps a hair too low in some situations) if you are doing any kind > of column selection (i.e. projection pushdown). As that number gets > lower the ratio of skips to reads increases to the point where it > starts to look too much like "random read" for the HDD and performance > suffers. > 2) 64M rows is too high for two (preventable) reasons in the C++ > datasets API. The first reason is that the datasets API does not > currently support sub-row-group streaming reads (e.g. we can only read > one row group at a time from the disk). Large row groups leads to too > much initial latency (have to read an entire row group before we start > processing) and too much RAM usage. > 3) 64M rows is also typically too high for the C++ datasets API > because, as Micah pointed out, we don't yet have support in the > datasets API for page-level column indices. This means that > statistics-based filtering is done at the row-group level and very > coarse grained. The bigger the block the less likely it is that a > filter will eclipse the entire block. > > Points 2 & 3 above are (I'm fairly certain) entirely fixable. I've > found reasonable performance with 1M rows per row group and so I > haven't personally been as highly motivated to fix the latter two > issues but they are somewhat high up on my personal priority list. If > anyone has time to devote to working on these issues I would be happy > to help someone get started. Ideally, if we can fix those two issues, > then something like what Micah described (one row group per file) is > fine and we can help shield users from frustrating parameter tuning. > > I have a draft of a partial fix for 2 at [1][2]. I expect I should be > able to get back to it before the 8.0.0 release. I couldn't find an > issue for the more complete fix (scanning at page-resolution instead > of row-group-resolution) so I created [3]. > > A good read for the third point is at [4]. I couldn't find a JIRA > issue for this from a quick search but I feel that we probably have > some issues somewhere. > > [1] https://github.com/apache/arrow/pull/12228 > [2] https://issues.apache.org/jira/browse/ARROW-14510 > [3] https://issues.apache.org/jira/browse/ARROW-15759 > [4] > https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/ > > On Tue, Feb 22, 2022 at 5:07 PM Shawn Zeng <[email protected]> wrote: > > > > Hi, thank you for your reply. The confusion comes from what Micah > mentions: the row_group_size in pyarrow is 64M rows, instead of 64MB. In > that case it does not align will Hadoop block size unless you have only 1B > per row. So in most case the row group will be very large than 64MB. I > think this parameter uses num of rows instead of size already brings > confusion when I read the doc and change the parameter. > > > > I dont understand clearly why the current thinking is to have 1 > row-group per file? Could you explain more? > > > > Micah Kornfield <[email protected]> 于2022年2月23日周三 03:52写道: > >>> > >>> What is the reason for this? Do you plan to change the default? > >> > >> > >> I think there is some confusion, I do believe this is the number of > rows but I'd guess it was set to 64M because it wasn't carefully adapted > from parquet-mr which I would guess uses byte size and therefore it aligns > well with the HDFS block size. > >> > >> I don't recall seeing any open issues to change it. It looks like for > datasets [1] the default is 1 million, so maybe we should try to align > these two. I don't have a strong opinion here, but my impression is that > the current thinking is to generally have 1 row-group per file, and > eliminate entire files. For sub-file pruning, I think column indexes are a > better solution but they have not been implemented in pyarrow yet. > >> > >> [1] > https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html#pyarrow.dataset.write_dataset > >> > >> On Tue, Feb 22, 2022 at 4:50 AM Marnix van den Broek < > [email protected]> wrote: > >>> > >>> hi Shawn, > >>> > >>> I expect this is the default because Parquet comes from the Hadoop > ecosystem, and the Hadoop block size is usually set to 64MB. Why would you > need a different default? You can set it to the size that fits your use > case best, right? > >>> > >>> Marnix > >>> > >>> > >>> > >>> On Tue, Feb 22, 2022 at 1:42 PM Shawn Zeng <[email protected]> wrote: > >>>> > >>>> For a clarification, I am referring to pyarrow.parquet.write_table > >>>> > >>>> Shawn Zeng <[email protected]> 于2022年2月22日周二 20:40写道: > >>>>> > >>>>> Hi, > >>>>> > >>>>> The default row_group_size is really large, which means a large > table smaller than 64M rows will not get the benefits of row group level > statistics. What is the reason for this? Do you plan to change the default? > >>>>> > >>>>> Thanks, > >>>>> Shawn >
