westonpace commented on pull request #11556:
URL: https://github.com/apache/arrow/pull/11556#issuecomment-953269749


   Yes, the parquet writer has a configurable max row group size but it does 
not have a configurable min row group size.  The latter is helpful in 
particular for dataset writing because each incoming batch is split into N 
smaller partition batches.  If we then turn around and write those batches 
immediately we can often end up with a bunch of small row groups which is 
undesirable.  Also, the behavior of the max row group size is not quite what 
I'd want.  For example, if the max row group size is 1 million rows and I send 
a bunch of batches with 1.1 million rows then I'll end up with a bunch of row 
groups with 1 million rows and a bunch of row groups with 100k rows.
   
   We could push all these features down into the writers themselves I suppose. 
 It might be better from a separation of concerns point of view.  Although it 
would make it a little harder to enforce `max_rows_staged` unless we also added 
"force write" operation to the writers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to