[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506535#comment-13506535
 ] 

Radim Kolar commented on MAPREDUCE-4827:
----------------------------------------

We can get another fast hash function from Knutt book if you are so paranoid.

Thus the hash function is:

h(k) = floor(m * (kA - floor(kA)))

In this case, the value of m is not critical and we typically choose a power of 
2 so that we can get the following efficient procedure on most digital 
computers:

    Choose m = 2p.
    Multiply the w bits of k by floor(A * 2w) to obtain a 2w bit product.
    Extract the p most significant bits of the lower half of this product.

    It seems that:

    A = (sqrt(5)-1)/2 = 0.6180339887

    is a good choice 


Knuth, "Sorting and Searching", v. 3 of "The Art of Computer Programming"
                
> Increase hash quality of HashPartitioner
> ----------------------------------------
>
>                 Key: MAPREDUCE-4827
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4827
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Radim Kolar
>         Attachments: betterhash1.txt
>
>
> hash partitioner is using object.hashCode() for splitting keys into 
> partitions. This results in bad distributions because hashCode() quality is 
> poor. 
> These hashCode() functions are sometimes written by hand (very poor quality) 
> and sometimes generated from by commons lang code (poor quality). Applying 
> some transformation on top of hashCode() provides better distribution.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to