tustvold opened a new issue #1731:
URL: https://github.com/apache/arrow-datafusion/issues/1731


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   The `Repartition` optimizer recurses into plans and looks for operators with 
an output partition count less than the target partition count, and on finding 
such a partition adds a `RepartitionExec` node.
   
   Unfortunately this does not take into account how this might impact the 
concurrency of the plan as a whole.
   
   For example, in IOx it is fairly common to have a `UnionExec` with a list of 
many `IOxReadFilterNode` as children. Each of these `IOxReadFilterNode` is a 
single partition, and therefore the `Repartition` optimizer repartitions the 
output of each of these `IOxReadFilterNode`.
   
   This leads to a partition explosion, with the `UnionExec` going from 
potentially having `x` partitions, to `n * x` where `n` is the target 
parallelism. In some of our plans this is leading to thousands of partitions, 
which is not ideal :grin: 
   
   A similar quirk can occur when there is an operator further up the chain 
that imposes some partitioning constraint, e.g. a sort, but isn't the direct 
parent. I'm fairly certain that I've seen plans generated that 
`RepartitionExec` into a `LimitExec` into a `CoalescePartitionsExec`. 
   
   **Describe the solution you'd like**
   
   I don't really know how to solve this, but I wanted to flag up that the 
current greedy approach can be pretty sub-optimal, and that perhaps a more 
holistic optimizer that can take into account the plan as a whole might be 
beneficial
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to