yonipeleg33 opened a new pull request, #9357: URL: https://github.com/apache/arrow-rs/pull/9357
# Which issue does this PR close? This PR implements another suggestion introduced in https://github.com/apache/arrow-rs/issues/1213: > Add functionality to flush based on a bytes threshold instead of, or in addition to, the current row threshold So not "Closes" anything new. # Rationale for this change A best effort to match Spark's (or more specifically, Hadoop's) `parquet.block.size` configuration behaviour, as documented in parquet-hadoop's [README](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/README.md): > **Property:** `parquet.block.size` > **Description:** The block size in bytes... Since arrow's parquet writer writes batches, it's inherently different than Hadoop's per-record writer behaviour - so the behaviour of `max_row_group_bytes` will be different than Hadoop's `parquet.block.size`, but this is the closest I could reasonably get (see details below). # What changes are included in this PR? **Configuration changes** - New optional `max_row_group_bytes` configuration option in `WriterProperties` - Rename existing `max_row_group_size` private property to `max_row_group_row_count` - **Backwards compatible:** No public APIs changed: - `set_max_row_group_size()` and `max_row_group_size()` still remain with their existing signatures. - added `set_max_row_group_row_count()` and `max_row_group_row_count()` which expose the `Option<usize>` type. - If `set_max_row_group_row_count(None)` is called, `max_row_group_size()` will return `usize::MAX`. **Writer changes** `ArrowWriter::write` now supports any combination of these two properties (row count and row bytes): - Both are unset -> Write everything in one row group. - One is set -> Respect only this one (either bytes or rows amount). - Both are set -> Respect the lower of them: Open a new row group when either row count or byte size limits reached Byte limit is calculated once per batch (as opposed to Hadoop's per-record calculation): Before writing each batch, compute the average row size in bytes based on previous writes, and flush or split the batch according to that average before hitting the limit. This means that the first batch will always be written as a whole (unless row count limit is also set). # Are these changes tested? Yes - added unit tests to check all different combinations of these two properties being set. # Are there any user-facing changes? Yes: Introducing new APIs to configure byte limits on row groups, and slight change to existing one (returning `usize::MAX` from `max_row_group_size()` if it was unset by the user). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
