alamb commented on code in PR #48468: URL: https://github.com/apache/arrow/pull/48468#discussion_r2698022644
########## cpp/src/parquet/properties.h: ########## @@ -160,6 +160,7 @@ static constexpr bool DEFAULT_IS_DICTIONARY_ENABLED = true; static constexpr int64_t DEFAULT_DICTIONARY_PAGE_SIZE_LIMIT = kDefaultDataPageSize; static constexpr int64_t DEFAULT_WRITE_BATCH_SIZE = 1024; static constexpr int64_t DEFAULT_MAX_ROW_GROUP_LENGTH = 1024 * 1024; +static constexpr int64_t DEFAULT_MAX_ROW_GROUP_BYTES = 128 * 1024 * 1024; Review Comment: > Is there a particular reason for this value? AFAIK some Parquet implementation (is it Parquet Rust? @alamb ) writes a single row group per file by default. The default row group size in the rust writer is 1M rows (1024*1024). https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_max_row_group_size I believe at least at some point in the past, the DuckDB Parquet writer wrote a single large row group -- I am not sure if that is the current behavior or not -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
