kosiew commented on issue #18513:
URL: https://github.com/apache/datafusion/issues/18513#issuecomment-3695592071

   hi @AdamGS 
   
   I investigated this and the extra RepartitionExec in the filtered plan isn’t 
an arbitrary inefficiency—it’s inserted by the distribution optimizer to raise 
the number of partitions when it estimates parallel round-robin repartitioning 
will be beneficial.
   The factors are governed by target_partitions, 
enable_round_robin_repartition, and repartition_file_scans settings. 
   
   
https://github.com/kosiew/datafusion/blob/4960284541a8394034fd7f82833571fd601633bf/datafusion/physical-optimizer/src/enforce_distribution.rs#L1181-L1341
   
   Since file-scan repartitioning is enabled by default, even small inputs may 
be repartitioned for parallelism; you can turn it off or lower 
target_partitions if the overhead outweighs the benefit for tiny datasets.
   
   
https://github.com/kosiew/datafusion/blob/4960284541a8394034fd7f82833571fd601633bf/datafusion/common/src/config.rs#L952-L966
   
   
https://github.com/kosiew/datafusion/blob/4960284541a8394034fd7f82833571fd601633bf/datafusion/physical-optimizer/src/enforce_distribution.rs#L1181-L1341


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to