Re: [PR] GH-48467: [C++][Parquet] Add configure to limit the row group size in bytes [arrow]

via GitHub Thu, 12 Feb 2026 03:35:30 -0800


yonipeleg33 commented on code in PR #48468:
URL: https://github.com/apache/arrow/pull/48468#discussion_r2798374779



##########
cpp/src/parquet/file_writer.cc:
##########
@@ -640,6 +667,29 @@ void ParquetFileWriter::AddKeyValueMetadata(
   }
 }
 
+std::optional<double> ParquetFileWriter::EstimateCompressedBytesPerRow() const 
{

Review Comment:
   FWICT, `parquet.block.bytes` and its equivalent in arrow should look at the 
_encoded_ data size, not the _compressed_ size.
   There's also a `parquet.page.size` which explicitly talks about compressed 
data.
   For reference, see parquet-hadoop's 
[README](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/README.md),
 and also its implementation (which I used as a reference as well):
   
[InternalParquetRecordWriter.checkBlockSizeReached](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java#L183)
 calculates the current size by calling `getBufferedSize`, which is 
[documented](https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ColumnWriteStore.java#L54)
 as:
   > approximate size of the buffered encoded binary data
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-48467: [C++][Parquet] Add configure to limit the row group size in bytes [arrow]

Reply via email to