What are the tradeoffs between a low and large and row group size? Is it that a low value allows for quicker random access (as we can seek row groups based on the number of rows they have), while a larger value allows for higher dict-encoding and compression ratios?
Best, Jorge On Wed, Nov 17, 2021 at 9:11 PM Jonathan Keane <jke...@gmail.com> wrote: > This doesn't address the large number of row groups ticket that was > raised, but for some visibility: there is some work to change the row > group sizing based on the size of data instead of a static number of > rows [1] as well as exposing a few more knobs to tune [2] > > There is a bit of prior art in the R implementation for attempting to > get a reasonable row group size based on the shape of the data > (basically, aims to have row groups that have 250 Million cells in > them). [3] > > [1] https://issues.apache.org/jira/browse/ARROW-4542 > [2] https://issues.apache.org/jira/browse/ARROW-14426 and > https://issues.apache.org/jira/browse/ARROW-14427 > [3] > https://github.com/apache/arrow/blob/641554b0bcce587549bfcfd0cde3cb4bc23054aa/r/R/parquet.R#L204-L222 > > -Jon > > On Wed, Nov 17, 2021 at 4:35 AM Joris Van den Bossche > <jorisvandenboss...@gmail.com> wrote: > > > > In addition, would it be useful to be able to change this > max_row_group_length > > from Python? > > Currently that writer property can't be changed from Python, you can only > > specify the row_group_size (chunk_size in C++) > > when writing a table, but that's currently only useful to set it to > > something that is smaller than the max_row_group_length. > > > > Joris >