yonipeleg33 opened a new pull request, #9357:
URL: https://github.com/apache/arrow-rs/pull/9357

   # Which issue does this PR close?
   
   This PR implements another suggestion introduced in 
https://github.com/apache/arrow-rs/issues/1213:
   > Add functionality to flush based on a bytes threshold instead of, or in 
addition to, the current row threshold
   
   So not "Closes" anything new.
   
   # Rationale for this change
   
   A best effort to match Spark's (or more specifically, Hadoop's) 
`parquet.block.size` configuration behaviour, as documented in parquet-hadoop's 
[README](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/README.md):
   > **Property:** `parquet.block.size`
   > **Description:** The block size in bytes...
   
   Since arrow's parquet writer writes batches, it's inherently different than 
Hadoop's per-record writer behaviour - so the behaviour of 
`max_row_group_bytes` will be different than Hadoop's `parquet.block.size`, but 
this is the closest I could reasonably get (see details below).
   
   # What changes are included in this PR?
   
   **Configuration changes**
   
   - New optional `max_row_group_bytes` configuration option in 
`WriterProperties`
   - Rename existing `max_row_group_size` private property to 
`max_row_group_row_count`
   - **Backwards compatible:** No public APIs changed: 
       - `set_max_row_group_size()` and `max_row_group_size()` still remain 
with their existing signatures.
       - added `set_max_row_group_row_count()` and `max_row_group_row_count()` 
which expose the `Option<usize>` type.
       - If `set_max_row_group_row_count(None)` is called, 
`max_row_group_size()` will return `usize::MAX`.
   
   **Writer changes**
   
   `ArrowWriter::write` now supports any combination of these two properties 
(row count and row bytes):
   - Both are unset -> Write everything in one row group.
   - One is set -> Respect only this one (either bytes or rows amount).
   - Both are set -> Respect the lower of them: Open a new row group when 
either row count or byte size limits reached
   
   Byte limit is calculated once per batch (as opposed to Hadoop's per-record 
calculation):
   Before writing each batch, compute the average row size in bytes based on 
previous writes, and flush or split the batch according to that average before 
hitting the limit.
   This means that the first batch will always be written as a whole (unless 
row count limit is also set).
   
   # Are these changes tested?
   
   Yes - added unit tests to check all different combinations of these two 
properties being set.
   
   # Are there any user-facing changes?
   
   Yes: Introducing new APIs to configure byte limits on row groups, and slight 
change to existing one (returning `usize::MAX` from `max_row_group_size()` if 
it was unset by the user).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to