[PR] Allow Setting Minimum Parallelism with RowCount Based Demuxer [arrow-datafusion]

via GitHub Mon, 16 Oct 2023 16:00:31 -0700


devinjdangelo opened a new pull request, #7841:
URL: https://github.com/apache/arrow-datafusion/pull/7841


   ## Which issue does this PR close?
   
   Addresses performance regression of #7791
   
   ## Rationale for this change
   
   #7791 introduced a row count targeting execution time partitioning strategy 
for DataSinks. The initial implementation only writes a single file at a time, 
which guarantees that only 1 file will ever be written with 
<`soft_max_rows_per_output_file` rows and all others will have >= 
`soft_max_rows_per_output_file`. This PR introduces a new setting 
`minimum_parallel_output_files` which will write N files in parallel, each 
targeting `soft_max_rows_per_output_file`. This allows the user to configure a 
balance between parallelism and achieving the desired file size. 
   
   The behavior of this PR is identical to #7791 if 
minimum_parallel_output_files is set to 1.
   
   ## What changes are included in this PR?
   
   - Adds `minimum_parallel_output_files` config setting
   - Creates new file writers on-demand as batches arrive, so if there is only 
1 batch only 1 file will be written regardless of the 
`minimum_parallel_output_files` setting. 
   - Updates tests to account for this new setting
   
   ## Are these changes tested?
   
   Yes by existing tests
   
   ## Are there any user-facing changes?
   
   Default behavior is now to output at least 4 files in parallel even if 
`soft_max_rows_per_output_file` is not reached.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Allow Setting Minimum Parallelism with RowCount Based Demuxer [arrow-datafusion]

Reply via email to