wgtmac commented on PR #46972: URL: https://github.com/apache/arrow/pull/46972#issuecomment-3057746671
> Ah I think you're thinking of the [`write_batch_size`](https://github.com/apache/arrow/blob/0b34e6bed40d48ae44a137afd196af94d9117e3b/cpp/src/parquet/properties.h#L160) parameter that's used by the Arrow API. This is a number of rows and defaults to 1024. I used the column writer based API rather than the Arrow API though. I just realized that large `properties_->write_batch_size()` makes it difficult to precisely split data pages based on `properties_->data_pagesize()`. To implement https://github.com/apache/arrow/issues/47030, we have to adjust batch size to satisfy the new `properties_->max_rows_per_data_page()`. Perhaps we need to slightly change the meaning of `properties_->write_batch_size()` to be the maximum number of values in a batch to write to a ColumnWriter. Does it make sense? @adamreeve @pitrou -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
