devinjdangelo opened a new pull request, #7655: URL: https://github.com/apache/arrow-datafusion/pull/7655
This PR is draft as it depends on <forthcoming> ## Which issue does this PR close? Closes https://github.com/apache/arrow-datafusion/issues/7591 Closes https://github.com/apache/arrow-datafusion/issues/7589 Closes #7590 Related to https://github.com/apache/arrow-rs/issues/1718 ## Rationale for this change The original parallel parquet writer independently serialized rowgroups. This has two major disadvantages: 1. High memory usage as each parallel RowGroup is entirely buffered in memory before being flushed to ObjectStore. 2. No easy way to respect the `WriterProperties.max_row_group_size` setting as the number of RowGroups is determined by the parallelism. This PR implements a new approach which parallelizes parquet serialization primarily by serializing the different columns of a single row group in parallel. Different row groups can still be serialized in parallel if the ## What changes are included in this PR? 1. New process to break apart an `ArrowRowGroupWriter` into `ArrowColumnWriter` components and send each to dedicated parallel task. 2. Respects the `WriterProperties.max_row_group_size` setting, closing a row group as soon as this threshold is reached. 3. If `RecordBatches` are being produced fast enough, the next row group can begin serializing before the prior has finished. 4. User Configurable limits on the amount of `RowGroups` and `RecordBatches` that are allowed to accumulate in memory before backpressure kicks in. These are made user configurable since optimal values will depend on the systems mix of cpu, memory, and I/O resources. 5. Refactor parallel parquet code to improve readabiltiy ## Are these changes tested? Yes, by existing tests. ## Benchmarking tldr, up to ~20% faster than #7632 and 90%+ lower memory overhead. <100MB of additional memory was required vs non parallel version in the single file case. ### Single File Output = True, 3.6GB Parquet File | Single File Parallelism Off | Single File Parallelism On -- | -- | -- Execution Time (s) | 27.55 | 8.94 Memory Usage (MB) | 279.8 | 364.9 We can now also parallelize the serialization of multiple output parquet files at the same time (e.g. 4 threads working on each of 4 parquet files, keeping 16 threads busy). The more parquet files we write relative to the threads in our system the less the benefit of single file level parallelism. ### Single file Output = False, ~3.6Gb of total parquet files written #### Execution Time(s) Number of Output Files | Single File Parallelism Off | Single File Parallelism On | Speed Up -- | -- | -- | -- 2 | 16.57 | 8.03 | 106% 4 | 10.98 | 8.93 | 23% 8 | 9.33 | 8.92 | 5% #### Memory Usage (MB) Number of Output Files | Single File Parallelism Off | Single File Parallelism On | Memory Overhead (MB) -- | -- | -- | -- 2 | 1971.4 | 2326.4 | 355 4 | 2551.0 | 2548.7 | -2.3 8 | 3408.1 | 3231.1 | -177 ## Are there any user-facing changes? Faster parquet serialization -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
