edmondop commented on issue #9370: URL: https://github.com/apache/datafusion/issues/9370#issuecomment-2156109681
I spent a bit of extra time on this and I have some thoughts worth sharing. `pull_from_input` is the task that pulls the record and send them to the ouput channels. However, it uses a mutable batch partitioner that computes the hash and publish results, see https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/repartition/mod.rs#L766. Therefore, we cannot at the moment call `partition_iter` in parallel on the same input. We have two strategies: 1. Refactor the BatchPartitioner so to we can invoke it in parallel (i.e. maybe using locking? haven't explored this much) 2. Repartition the input that we feed in `pull_from_input`. Our current setup does (2) but it does by creating multiple nodes in the plan, which is confusing. It is possible to create an internal RepartitionExec when the partitioning is Hash partitioning and the input partition is one, if we want to simplify the plan. I am not sure how much refactor is possible though, would appreciate help / thoughts -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
