tustvold opened a new issue #1731: URL: https://github.com/apache/arrow-datafusion/issues/1731
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** The `Repartition` optimizer recurses into plans and looks for operators with an output partition count less than the target partition count, and on finding such a partition adds a `RepartitionExec` node. Unfortunately this does not take into account how this might impact the concurrency of the plan as a whole. For example, in IOx it is fairly common to have a `UnionExec` with a list of many `IOxReadFilterNode` as children. Each of these `IOxReadFilterNode` is a single partition, and therefore the `Repartition` optimizer repartitions the output of each of these `IOxReadFilterNode`. This leads to a partition explosion, with the `UnionExec` going from potentially having `x` partitions, to `n * x` where `n` is the target parallelism. In some of our plans this is leading to thousands of partitions, which is not ideal :grin: A similar quirk can occur when there is an operator further up the chain that imposes some partitioning constraint, e.g. a sort, but isn't the direct parent. I'm fairly certain that I've seen plans generated that `RepartitionExec` into a `LimitExec` into a `CoalescePartitionsExec`. **Describe the solution you'd like** I don't really know how to solve this, but I wanted to flag up that the current greedy approach can be pretty sub-optimal, and that perhaps a more holistic optimizer that can take into account the plan as a whole might be beneficial -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
