edmondop commented on issue #9370:
URL: https://github.com/apache/datafusion/issues/9370#issuecomment-2156109681

   I spent a bit of extra time on this and I have some thoughts worth sharing.
   
   `pull_from_input` is the task that pulls the record and send them to the 
ouput channels. However, it uses a mutable batch partitioner that computes the 
hash and publish results, see
   
https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/repartition/mod.rs#L766.
 Therefore, we cannot at the moment call `partition_iter` in parallel on the 
same input. We have two strategies:
   1. Refactor the BatchPartitioner so to we can invoke it in parallel (i.e. 
maybe using locking? haven't explored this much)
   2. Repartition the input that we feed in `pull_from_input`.
   
   Our current setup does (2) but it does by creating multiple nodes in the 
plan, which is confusing. It is possible to create an internal RepartitionExec 
when the partitioning is Hash partitioning and the input partition is one, if 
we want to simplify the plan. I am not sure how much refactor is possible 
though, would appreciate help / thoughts


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to