Re: [PR] GH-48467: [C++][Parquet] Add configure to limit the row group size in bytes [arrow]

via GitHub Thu, 29 Jan 2026 18:38:47 -0800


wgtmac commented on code in PR #48468:
URL: https://github.com/apache/arrow/pull/48468#discussion_r2744366379



##########
cpp/src/parquet/file_writer.cc:
##########
@@ -68,6 +68,12 @@ int64_t RowGroupWriter::total_compressed_bytes_written() 
const {
   return contents_->total_compressed_bytes_written();
 }
 
+int64_t RowGroupWriter::EstimatedTotalCompressedBytes() const {
+  return contents_->total_compressed_bytes() +
+         contents_->total_compressed_bytes_written() +
+         contents_->EstimatedBufferedValueBytes();

Review Comment:
   I agree that in most common cases, buffered values contribute to only a 
small fraction of the total row group size. There is a caveat in this approach: 
dict entries are buffered and thus not counted in this case. If we have many 
columns, we may underestimate a lot because dictionary encoding is enabled by 
default. 
   
   If it is too hard to decide, how about providing a config to let users to 
choose which to use: written pages only, or written pages plus buffered values?
   
   > If we choose 1, we can change to estimate batch size based on avg row size 
and written row numbers to avoid ignoring too many buffered bytes.
   
   This approach also have obvious caveat since data pages usually do not have 
same boundary of rows.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-48467: [C++][Parquet] Add configure to limit the row group size in bytes [arrow]

Reply via email to