Re: [PR] Allow Setting Minimum Parallelism with RowCount Based Demuxer [arrow-datafusion]

via GitHub Sat, 21 Oct 2023 04:45:27 -0700


devinjdangelo commented on code in PR #7841:
URL: https://github.com/apache/arrow-datafusion/pull/7841#discussion_r1367716153



##########
datafusion/common/src/config.rs:
##########
@@ -255,6 +255,12 @@ config_namespace! {
         /// Number of files to read in parallel when inferring schema and 
statistics
         pub meta_fetch_concurrency: usize, default = 32
 
+        /// Guarentees a minimum level of output files running in parallel.
+        /// RecordBatches will be distributed in round robin fashion to each
+        /// parallel writer. Each writer is closed and a new file opened once
+        /// soft_max_rows_per_output_file is reached.
+        pub minimum_parallel_output_files: usize, default = 4

Review Comment:
   The returns to additional cores seems to decline very fast beyond 4 tasks in 
my testing. I believe this is because ~4 parallel serialization tasks no longer 
bottlenecks the end-to-end execution plan. Going beyond 4 tasks mostly gives 
higher memory usage and smaller output files for little benefit.
   
   My testing is mostly on a 32core system. I have not tested on enough 
different configurations to know if core_count/8 is a reasonable default or if 
a static 4 tasks is a decent default. 
   
   It will also depend a lot on the actual execution plan. If you are writing a 
pre-cached in memory dataset, then you definitely want 1 task/output file per 
core.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Allow Setting Minimum Parallelism with RowCount Based Demuxer [arrow-datafusion]

Reply via email to