alamb commented on PR #17040:
URL: https://github.com/apache/datafusion/pull/17040#issuecomment-3154577025

   > Great explanation! Thanks.
   > 
   > This configuration is important to tune, but at the same time it's really 
challenging to find the optimal value. In some workloads on small dataset, the 
optimal number of partitions may also lie between 1 and the number of physical 
cores.
   > 
   > I've been thinking about something similar for memory-limited scenarios: 
if the memory budget is very tight, using a smaller `target_partition` and 
`batch_size` can actually lead to faster query completion.
   > 
   > I'm looking forward to seeing some ML magic help us find the optimal 
tuning.
   
   That is a good point
   
   I think the classic industrial approach is to use cost models to predict the 
size of intermediates, and thus memory consumption. However, that comes with 
all the problems of cardinality estimation. 
   
   I think the "state of the art" approach these days is to do something 
dynamic -- like when the plan starts hitting memory pressure to reconfigure the 
plan / partitioning at that time and redistribute memory. 
   
   However, I don't know of any real world system that does this, at least not 
to great effect, and I think it would be very complicated to implement
   
   Maybe we can start by  add a note to the tuning guide that when memory 
budget is very tight, using fewer target partitions can be helpful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to