alamb commented on PR #17040: URL: https://github.com/apache/datafusion/pull/17040#issuecomment-3154577025
> Great explanation! Thanks. > > This configuration is important to tune, but at the same time it's really challenging to find the optimal value. In some workloads on small dataset, the optimal number of partitions may also lie between 1 and the number of physical cores. > > I've been thinking about something similar for memory-limited scenarios: if the memory budget is very tight, using a smaller `target_partition` and `batch_size` can actually lead to faster query completion. > > I'm looking forward to seeing some ML magic help us find the optimal tuning. That is a good point I think the classic industrial approach is to use cost models to predict the size of intermediates, and thus memory consumption. However, that comes with all the problems of cardinality estimation. I think the "state of the art" approach these days is to do something dynamic -- like when the plan starts hitting memory pressure to reconfigure the plan / partitioning at that time and redistribute memory. However, I don't know of any real world system that does this, at least not to great effect, and I think it would be very complicated to implement Maybe we can start by add a note to the tuning guide that when memory budget is very tight, using fewer target partitions can be helpful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org