wgtmac commented on PR #46972:
URL: https://github.com/apache/arrow/pull/46972#issuecomment-3057746671

   > Ah I think you're thinking of the 
[`write_batch_size`](https://github.com/apache/arrow/blob/0b34e6bed40d48ae44a137afd196af94d9117e3b/cpp/src/parquet/properties.h#L160)
 parameter that's used by the Arrow API. This is a number of rows and defaults 
to 1024. I used the column writer based API rather than the Arrow API though.
   
   I just realized that large `properties_->write_batch_size()` makes it 
difficult to precisely split data pages based on 
`properties_->data_pagesize()`. To implement 
https://github.com/apache/arrow/issues/47030, we have to adjust batch size to 
satisfy the new `properties_->max_rows_per_data_page()`. Perhaps we need to 
slightly change the meaning of `properties_->write_batch_size()` to be the 
maximum number of values in a batch to write to a  ColumnWriter. Does it make 
sense? @adamreeve @pitrou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to