IIRC some writers (perhaps parquet-rs?) always write a single row
FWIW parquet-rs will write multiple row groups depending on the
configuration. The defaults will write row groups of up to 1M rows [1].
Perhaps you might be thinking of parquet-cpp which for a very long time
had very high defaults, leading it to often just create a single massive
row group [2]? I believe this was a bug and has been fixed.
Kind Regards,
Raphael
[1]:
https://docs.rs/parquet/latest/parquet/file/properties/constant.DEFAULT_MAX_ROW_GROUP_SIZE.html
[2]: https://github.com/apache/arrow/pull/36012
On 29/08/2024 15:11, Antoine Pitrou wrote:
On Thu, 29 Aug 2024 12:33:25 +0200
Alkis Evlogimenos
<[email protected]>
wrote:
The simplest fix for a writer is to limit row groups to 2^31
logical bytes and then run encoding/compression.
I would be curious to see how complex the required logic ends up,
especially when taking account nested types. A pathological case would
be a nested type with more than 2^31 repeated values in a single "row".
Given that row groups are
typically targeting a size of 64/128MB that should work rather well unless
the data in question is of extremely low entropy and compresses too well.
IIRC some writers (perhaps parquet-rs?) always write a single row
group, however large the data.
Regards
Antoine.