>
> I was wondering if anyone could elaborate on why the default maximum row
> group length is set to 67108864<
> https://github.com/apache/arrow/blob/5c936560c1da003baf714d67dc92f25670730c84/cpp/src/parquet/properties.h#L97>.
> From Apache Parquet's documentation, the recommended row group size is
> between 512 MB and 1 GB.<https://parquet.apache.org/documentation/latest/>
> For a Float64Array whose length is 67108864, I believe its size would be
> approximately 545 MB, which is on the low end of that interval.


I don't think we currently have any heuristic around row group size and the
row count (we probably should try adding one).  Even the default seems
pretty high, since in general parquet files are going to have more then one
column per row group.


> I experimented with setting the default maximum row group length to larger
> values and noticed pyarrow cannot import Parquet files containing row
> groups whose lengths exceed 2147483647 rows (int32 max). However, I was
> able to read these files in using the C++ Arrow bindings.

This is surprising, and without seeing the exact error it sounds like  a
bug.  Could you open a JIRA to discuss (or check if there is already one
tracking this).


On Mon, Nov 15, 2021 at 12:23 PM Sarah Gilmore <sgilm...@mathworks.com>
wrote:

> Hi all,
>
> I was wondering if anyone could elaborate on why the default maximum row
> group length is set to 67108864<
> https://github.com/apache/arrow/blob/5c936560c1da003baf714d67dc92f25670730c84/cpp/src/parquet/properties.h#L97>.
> From Apache Parquet's documentation, the recommended row group size is
> between 512 MB and 1 GB.<https://parquet.apache.org/documentation/latest/>
> For a Float64Array whose length is 67108864, I believe its size would be
> approximately 545 MB, which is on the low end of that interval.
>
> I was wondering if there was a particular reason why 67108864 was chosen
> as the maximum row group length. I experimented with setting the default
> maximum row group length to larger values and noticed pyarrow cannot import
> Parquet files containing row groups whose lengths exceed 2147483647 rows
> (int32 max). However, I was able to read these files in using the C++ Arrow
> bindings.
>
>
> Best,
> Sarah
>
>
>

Reply via email to