Re: [PR] Allow Setting Minimum Parallelism with RowCount Based Demuxer [arrow-datafusion]

via GitHub Sat, 21 Oct 2023 04:47:43 -0700


devinjdangelo commented on code in PR #7841:
URL: https://github.com/apache/arrow-datafusion/pull/7841#discussion_r1367716400



##########
datafusion/common/src/config.rs:
##########
@@ -255,6 +255,12 @@ config_namespace! {
         /// Number of files to read in parallel when inferring schema and 
statistics
         pub meta_fetch_concurrency: usize, default = 32
 
+        /// Guarentees a minimum level of output files running in parallel.
+        /// RecordBatches will be distributed in round robin fashion to each
+        /// parallel writer. Each writer is closed and a new file opened once
+        /// soft_max_rows_per_output_file is reached.
+        pub minimum_parallel_output_files: usize, default = 4

Review Comment:
   I plan to work on a statement level option soon, so you could easily do:
   
   ```sql
   copy my_in_memory_table to my_dir (format parquet, output_files 32);
   ```
   
   to boost the parallelism for specific plans that benefit from it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Allow Setting Minimum Parallelism with RowCount Based Demuxer [arrow-datafusion]

Reply via email to