> > What is the reason for this? Do you plan to change the default?
I think there is some confusion, I do believe this is the number of rows but I'd guess it was set to 64M because it wasn't carefully adapted from parquet-mr which I would guess uses byte size and therefore it aligns well with the HDFS block size. I don't recall seeing any open issues to change it. It looks like for datasets [1] the default is 1 million, so maybe we should try to align these two. I don't have a strong opinion here, but my impression is that the current thinking is to generally have 1 row-group per file, and eliminate entire files. For sub-file pruning, I think column indexes are a better solution but they have not been implemented in pyarrow yet. [1] https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html#pyarrow.dataset.write_dataset On Tue, Feb 22, 2022 at 4:50 AM Marnix van den Broek < [email protected]> wrote: > hi Shawn, > > I expect this is the default because Parquet comes from the Hadoop > ecosystem, and the Hadoop block size is usually set to 64MB. Why would you > need a different default? You can set it to the size that fits your use > case best, right? > > Marnix > > > > On Tue, Feb 22, 2022 at 1:42 PM Shawn Zeng <[email protected]> wrote: > >> For a clarification, I am referring to pyarrow.parquet.write_table >> >> Shawn Zeng <[email protected]> 于2022年2月22日周二 20:40写道: >> >>> Hi, >>> >>> The default row_group_size is really large, which means a large table >>> smaller than 64M rows will not get the benefits of row group level >>> statistics. What is the reason for this? Do you plan to change the default? >>> >>> Thanks, >>> Shawn >>> >>
