devinjdangelo opened a new pull request, #7655:
URL: https://github.com/apache/arrow-datafusion/pull/7655

   This PR is draft as it depends on <forthcoming>
   
   ## Which issue does this PR close?
   
   Closes https://github.com/apache/arrow-datafusion/issues/7591
   Closes https://github.com/apache/arrow-datafusion/issues/7589
   Closes #7590
   Related to https://github.com/apache/arrow-rs/issues/1718
   
   ## Rationale for this change
   
   The original parallel parquet writer independently serialized rowgroups. 
This has two major disadvantages:
   
   1. High memory usage as each parallel RowGroup is entirely buffered in 
memory before being flushed to ObjectStore.
   2. No easy way to respect the `WriterProperties.max_row_group_size` setting 
as the number of RowGroups is determined by the parallelism.
   
   This PR implements a new approach which parallelizes parquet serialization 
primarily by serializing the different columns of a single row group in 
parallel. Different row groups can still be serialized in parallel if the 
   
   ## What changes are included in this PR?
   
   1. New process to break apart an `ArrowRowGroupWriter` into 
`ArrowColumnWriter` components and send each to dedicated parallel task.
   2. Respects the `WriterProperties.max_row_group_size` setting, closing a row 
group as soon as this threshold is reached.
   3. If `RecordBatches` are being produced fast enough, the next row group can 
begin serializing before the prior has finished.
   4. User Configurable limits on the amount of `RowGroups` and `RecordBatches` 
that are allowed to accumulate in memory before backpressure kicks in. These 
are made user configurable since optimal values will depend on the systems mix 
of cpu, memory, and I/O resources.
   5. Refactor parallel parquet code to improve readabiltiy
   
   ## Are these changes tested?
   
   Yes, by existing tests.
   
   ## Benchmarking
   
   tldr, up to ~20% faster than #7632 and 90%+ lower memory overhead. <100MB of 
additional memory was required vs non parallel version in the single file case.
   
   ### Single File Output = True, 3.6GB Parquet File
     | Single File Parallelism Off | Single File Parallelism On
   -- | -- | --
   Execution Time (s) | 27.55 | 8.94
   Memory Usage (MB) | 279.8 | 364.9
   
   We can now also parallelize the serialization of multiple output parquet 
files at the same time (e.g. 4 threads working on each of 4 parquet files, 
keeping 16 threads busy). The more parquet files we write relative to the 
threads in our system the less the benefit of single file level parallelism.
   
   ### Single file Output = False, ~3.6Gb of total parquet files written
   
   #### Execution Time(s)
   Number of Output Files | Single File Parallelism Off | Single File 
Parallelism On | Speed Up
   -- | -- | -- | --
   2 | 16.57 | 8.03 | 106%
   4 | 10.98 | 8.93 | 23%
   8 | 9.33 | 8.92 | 5%
   
   #### Memory Usage (MB)
   Number of Output Files | Single File Parallelism Off | Single File 
Parallelism On | Memory Overhead (MB)
   -- | -- | -- | --
   2 | 1971.4 | 2326.4 | 355
   4 | 2551.0 | 2548.7 | -2.3
   8 | 3408.1 | 3231.1 | -177
   
   ## Are there any user-facing changes?
   
   Faster parquet serialization


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to