Thanks for Weston's clear explanation. Point 2 is what I've experienced
without tuning parameter and point 3 is what I concerned about. Looking
forward for finer granularity of reading/indexing parquet than row group
level, which should fix the issue.

Weston Pace <[email protected]> 于2022年2月23日周三 13:30写道:

> These are all great points.  A few notes from my own experiments
> (mostly confirming what others have said):
>
>  1) 1M rows is the minimum safe size for row groups on HDD (and
> perhaps a hair too low in some situations) if you are doing any kind
> of column selection (i.e. projection pushdown).  As that number gets
> lower the ratio of skips to reads increases to the point where it
> starts to look too much like "random read" for the HDD and performance
> suffers.
>  2) 64M rows is too high for two (preventable) reasons in the C++
> datasets API.  The first reason is that the datasets API does not
> currently support sub-row-group streaming reads (e.g. we can only read
> one row group at a time from the disk).  Large row groups leads to too
> much initial latency (have to read an entire row group before we start
> processing) and too much RAM usage.
>  3) 64M rows is also typically too high for the C++ datasets API
> because, as Micah pointed out, we don't yet have support in the
> datasets API for page-level column indices.  This means that
> statistics-based filtering is done at the row-group level and very
> coarse grained.  The bigger the block the less likely it is that a
> filter will eclipse the entire block.
>
> Points 2 & 3 above are (I'm fairly certain) entirely fixable.  I've
> found reasonable performance with 1M rows per row group and so I
> haven't personally been as highly motivated to fix the latter two
> issues but they are somewhat high up on my personal priority list.  If
> anyone has time to devote to working on these issues I would be happy
> to help someone get started.  Ideally, if we can fix those two issues,
> then something like what Micah described (one row group per file) is
> fine and we can help shield users from frustrating parameter tuning.
>
> I have a draft of a partial fix for 2 at [1][2].  I expect I should be
> able to get back to it before the 8.0.0 release.  I couldn't find an
> issue for the more complete fix (scanning at page-resolution instead
> of row-group-resolution) so I created [3].
>
> A good read for the third point is at [4].  I couldn't find a JIRA
> issue for this from a quick search but I feel that we probably have
> some issues somewhere.
>
> [1] https://github.com/apache/arrow/pull/12228
> [2] https://issues.apache.org/jira/browse/ARROW-14510
> [3] https://issues.apache.org/jira/browse/ARROW-15759
> [4]
> https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
>
> On Tue, Feb 22, 2022 at 5:07 PM Shawn Zeng <[email protected]> wrote:
> >
> > Hi, thank you for your reply. The confusion comes from what Micah
> mentions: the row_group_size in pyarrow is 64M rows, instead of 64MB. In
> that case it does not align will Hadoop block size unless you have only 1B
> per row. So in most case the row group will be very large than 64MB. I
> think this parameter uses num of rows instead of size already brings
> confusion when I read the doc and change the parameter.
> >
> > I dont understand clearly why the current thinking is to have 1
> row-group per file? Could you explain more?
> >
> > Micah Kornfield <[email protected]> 于2022年2月23日周三 03:52写道:
> >>>
> >>> What is the reason for this? Do you plan to change the default?
> >>
> >>
> >> I think there is some confusion, I do believe this is the number of
> rows but I'd guess it was set to 64M because it wasn't carefully adapted
> from parquet-mr which I would guess uses byte size and therefore it aligns
> well with the HDFS block size.
> >>
> >> I don't recall seeing any open issues to change it.  It looks like for
> datasets [1] the default is 1 million, so maybe we should try to align
> these two.  I don't have a strong opinion here, but my impression is that
> the current thinking is to generally have 1 row-group per file, and
> eliminate entire files.  For sub-file pruning, I think column indexes are a
> better solution but they have not been implemented in pyarrow yet.
> >>
> >> [1]
> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html#pyarrow.dataset.write_dataset
> >>
> >> On Tue, Feb 22, 2022 at 4:50 AM Marnix van den Broek <
> [email protected]> wrote:
> >>>
> >>> hi Shawn,
> >>>
> >>> I expect this is the default because Parquet comes from the Hadoop
> ecosystem, and the Hadoop block size is usually set to 64MB. Why would you
> need a different default? You can set it to the size that fits your use
> case best, right?
> >>>
> >>> Marnix
> >>>
> >>>
> >>>
> >>> On Tue, Feb 22, 2022 at 1:42 PM Shawn Zeng <[email protected]> wrote:
> >>>>
> >>>> For a clarification, I am referring to pyarrow.parquet.write_table
> >>>>
> >>>> Shawn Zeng <[email protected]> 于2022年2月22日周二 20:40写道:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> The default row_group_size is really large, which means a large
> table smaller than 64M rows will not get the benefits of row group level
> statistics. What is the reason for this? Do you plan to change the default?
> >>>>>
> >>>>> Thanks,
> >>>>> Shawn
>

Reply via email to