yonipeleg33 commented on code in PR #48468:
URL: https://github.com/apache/arrow/pull/48468#discussion_r2798374779
##########
cpp/src/parquet/file_writer.cc:
##########
@@ -640,6 +667,29 @@ void ParquetFileWriter::AddKeyValueMetadata(
}
}
+std::optional<double> ParquetFileWriter::EstimateCompressedBytesPerRow() const
{
Review Comment:
FWICT, `parquet.block.bytes` and its equivalent in arrow should look at the
_encoded_ data size, not the _compressed_ size.
There's also a `parquet.page.size` which explicitly talks about compressed
data.
For reference, see parquet-hadoop's
[README](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/README.md),
and also its implementation (which I used as a reference as well):
[InternalParquetRecordWriter.checkBlockSizeReached](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L183)
calculates the current size by calling `getBufferedSize`, which is
[documented](https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ColumnWriteStore.java#L54)
as:
> approximate size of the buffered encoded binary data
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]