Re: [I] Why hashing benefits from partitioning? [arrow-datafusion]

via GitHub Mon, 16 Oct 2023 23:17:38 -0700


YjyJeff commented on issue #7834:
URL: 
https://github.com/apache/arrow-datafusion/issues/7834#issuecomment-1765738419


   > This is to increase parallelism, the documentation has 
explained，[`add_roundrobin_on_top` 
doc](https://github.com/apache/arrow-datafusion/blob/cb2d03c826c2104ee6960cd6c3652146f800ae47/datafusion/core/src/physical_optimizer/enforce_distribution.rs#L947C13-L947C13)
   
   I know round robin can increase the partition, my point is that: 
**increasing the parallelism here will decrease the performance**.  
`RepartitionExec: partitioning=Hash` can also increase the parallelism. 
   
   In this example, if we add the round-robin, we will get the partitioning: `1 
-> 2 -> 2`. If we remove the round robin, we will get the partitioning: `1 -> 
2`.   After the `AggregateExec: mode=FinalPartitioned`, we both have 2 
partitions. What's more, `RepartitionExec` is expensive if both input and 
output partitions are large. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Why hashing benefits from partitioning? [arrow-datafusion]

Reply via email to