alamb commented on code in PR #48468:
URL: https://github.com/apache/arrow/pull/48468#discussion_r2698022644


##########
cpp/src/parquet/properties.h:
##########
@@ -160,6 +160,7 @@ static constexpr bool DEFAULT_IS_DICTIONARY_ENABLED = true;
 static constexpr int64_t DEFAULT_DICTIONARY_PAGE_SIZE_LIMIT = 
kDefaultDataPageSize;
 static constexpr int64_t DEFAULT_WRITE_BATCH_SIZE = 1024;
 static constexpr int64_t DEFAULT_MAX_ROW_GROUP_LENGTH = 1024 * 1024;
+static constexpr int64_t DEFAULT_MAX_ROW_GROUP_BYTES = 128 * 1024 * 1024;

Review Comment:
   > Is there a particular reason for this value? AFAIK some Parquet 
implementation (is it Parquet Rust? @alamb ) writes a single row group per file 
by default.
   
   The default row group size in the rust writer is 1M rows (1024*1024) -- NOT 
bytes
   
   
https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_max_row_group_size
   
   I looked through and didn't find any setting for max row group size in 
bytes. 
   
   I believe at least at some point in the past, the DuckDB Parquet writer 
wrote a single large row group -- I am not sure if that is the current behavior 
or not
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to