wgtmac commented on code in PR #48468:
URL: https://github.com/apache/arrow/pull/48468#discussion_r2744366379
##########
cpp/src/parquet/file_writer.cc:
##########
@@ -68,6 +68,12 @@ int64_t RowGroupWriter::total_compressed_bytes_written()
const {
return contents_->total_compressed_bytes_written();
}
+int64_t RowGroupWriter::EstimatedTotalCompressedBytes() const {
+ return contents_->total_compressed_bytes() +
+ contents_->total_compressed_bytes_written() +
+ contents_->EstimatedBufferedValueBytes();
Review Comment:
I agree that in most common cases, buffered values contribute to only a
small fraction of the total row group size. There is a caveat in this approach:
dict entries are buffered and thus not counted in this case. If we have many
columns, we may underestimate a lot because dictionary encoding is enabled by
default.
If it is too hard to decide, how about providing a config to let users to
choose which to use: written pages only, or written pages plus buffered values?
> If we choose 1, we can change to estimate batch size based on avg row size
and written row numbers to avoid ignoring too many buffered bytes.
This approach also have obvious caveat since data pages usually do not have
same boundary of rows.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]