Re: [Python][Parquet]Why the default row_group_size is 64M?

Micah Kornfield Tue, 22 Feb 2022 11:52:27 -0800

>
> What is the reason for this? Do you plan to change the default?

I think there is some confusion, I do believe this is the number of rows
but I'd guess it was set to 64M because it wasn't carefully adapted from
parquet-mr which I would guess uses byte size and therefore it aligns well
with the HDFS block size.

I don't recall seeing any open issues to change it.  It looks like for
datasets [1] the default is 1 million, so maybe we should try to align
these two.  I don't have a strong opinion here, but my impression is that
the current thinking is to generally have 1 row-group per file, and
eliminate entire files.  For sub-file pruning, I think column indexes are a
better solution but they have not been implemented in pyarrow yet.

[1]
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html#pyarrow.dataset.write_dataset

On Tue, Feb 22, 2022 at 4:50 AM Marnix van den Broek <
[email protected]> wrote:

> hi Shawn,
>
> I expect this is the default because Parquet comes from the Hadoop
> ecosystem, and the Hadoop block size is usually set to 64MB. Why would you
> need a different default? You can set it to the size that fits your use
> case best, right?
>
> Marnix
>
>
>
> On Tue, Feb 22, 2022 at 1:42 PM Shawn Zeng <[email protected]> wrote:
>
>> For a clarification, I am referring to pyarrow.parquet.write_table
>>
>> Shawn Zeng <[email protected]> 于2022年2月22日周二 20:40写道：
>>
>>> Hi,
>>>
>>> The default row_group_size is really large, which means a large table
>>> smaller than 64M rows will not get the benefits of row group level
>>> statistics. What is the reason for this? Do you plan to change the default?
>>>
>>> Thanks,
>>> Shawn
>>>
>>

Re: [Python][Parquet]Why the default row_group_size is 64M?

Reply via email to