What are the tradeoffs between a low and large and row group size?

Is it that a low value allows for quicker random access (as we can seek row
groups based on the number of rows they have), while a larger value allows
for higher dict-encoding and compression ratios?

Best,
Jorge




On Wed, Nov 17, 2021 at 9:11 PM Jonathan Keane <jke...@gmail.com> wrote:

> This doesn't address the large number of row groups ticket that was
> raised, but for some visibility: there is some work to change the row
> group sizing based on the size of data instead of a static number of
> rows [1] as well as exposing a few more knobs to tune [2]
>
> There is a bit of prior art in the R implementation for attempting to
> get a reasonable row group size based on the shape of the data
> (basically, aims to have row groups that have 250 Million cells in
> them). [3]
>
> [1] https://issues.apache.org/jira/browse/ARROW-4542
> [2] https://issues.apache.org/jira/browse/ARROW-14426 and
> https://issues.apache.org/jira/browse/ARROW-14427
> [3]
> https://github.com/apache/arrow/blob/641554b0bcce587549bfcfd0cde3cb4bc23054aa/r/R/parquet.R#L204-L222
>
> -Jon
>
> On Wed, Nov 17, 2021 at 4:35 AM Joris Van den Bossche
> <jorisvandenboss...@gmail.com> wrote:
> >
> > In addition, would it be useful to be able to change this
> max_row_group_length
> > from Python?
> > Currently that writer property can't be changed from Python, you can only
> > specify the row_group_size (chunk_size in C++)
> > when writing a table, but that's currently only useful to set it to
> > something that is smaller than the max_row_group_length.
> >
> > Joris
>

Reply via email to