[ https://issues.apache.org/jira/browse/MAPREDUCE-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506535#comment-13506535 ]
Radim Kolar commented on MAPREDUCE-4827: ---------------------------------------- We can get another fast hash function from Knutt book if you are so paranoid. Thus the hash function is: h(k) = floor(m * (kA - floor(kA))) In this case, the value of m is not critical and we typically choose a power of 2 so that we can get the following efficient procedure on most digital computers: Choose m = 2p. Multiply the w bits of k by floor(A * 2w) to obtain a 2w bit product. Extract the p most significant bits of the lower half of this product. It seems that: A = (sqrt(5)-1)/2 = 0.6180339887 is a good choice Knuth, "Sorting and Searching", v. 3 of "The Art of Computer Programming" > Increase hash quality of HashPartitioner > ---------------------------------------- > > Key: MAPREDUCE-4827 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4827 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Reporter: Radim Kolar > Attachments: betterhash1.txt > > > hash partitioner is using object.hashCode() for splitting keys into > partitions. This results in bad distributions because hashCode() quality is > poor. > These hashCode() functions are sometimes written by hand (very poor quality) > and sometimes generated from by commons lang code (poor quality). Applying > some transformation on top of hashCode() provides better distribution. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira