[GitHub] [arrow-datafusion] Dandandan commented on issue #1404: Hash partitioning not working properly

GitBox Mon, 06 Dec 2021 07:13:06 -0800


Dandandan commented on issue #1404:
URL: 
https://github.com/apache/arrow-datafusion/issues/1404#issuecomment-986868097



   Hey @andrei-ionescu 
   
   I am not sure if it's really not working properly.
   
   You specified a hash expression `partitioning_columns` and `72` partitions.
   
   If you use `repartition` the expression will be hashes and divided over 
those 72 partitions.
   
   The *only* guarantee of hash-repartition  is that equal values (based on the 
expression) will end up in the same partition.
   
   This is based on a simple formula `hash(expr) % n_partitions`.
   
   However, two things can happen
   
   * Two different values can end up in the same partition.
   * A partition `n` can have no values - no `hash(expr) % n_partitions` equals 
to `n`.
   
   Does this address your issue?
   
   I also opened https://github.com/apache/arrow-datafusion/issues/1405 to have 
a look at empty batches out of repartitioning, but that is more related to 
performance instead of correctness.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan commented on issue #1404: Hash partitioning not working properly

Reply via email to