[ https://issues.apache.org/jira/browse/MAHOUT-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13837560#comment-13837560 ]
Tharindu Rusira edited comment on MAHOUT-1242 at 12/3/13 10:46 AM: ------------------------------------------------------------------- Hi [~dweiss], I'm currently working on this issue and for the time being I attach a simple implementation of the final step of murmurhash3 as you suggested. Your feedback is highly appreciated. Thanks. P.S. not tested was (Author: tharindu_rusira): Hi [~dawidweiss], I'm currently working on this issue and for the time being I attach a simple implementation of the final step of murmurhash3 as you suggested. Your feedback is highly appreciated. Thanks. P.S. not tested > No key redistribution function for associative maps > --------------------------------------------------- > > Key: MAHOUT-1242 > URL: https://issues.apache.org/jira/browse/MAHOUT-1242 > Project: Mahout > Issue Type: Improvement > Components: collections, Math > Reporter: Dawid Weiss > Attachments: MAHOUT-1242.patch > > > All integer-based maps currently use HashFunctions.hash(int) which just > returns the key value: > {code} > /** > * Returns a hashcode for the specified value. > * > * @return a hash code value for the specified value. > */ > public static int hash(int value) { > return value; > //return value * 0x278DDE6D; // see > org.apache.mahout.math.jet.random.engine.DRand > /* > value &= 0x7FFFFFFF; // make it >=0 > int hashCode = 0; > do hashCode = 31*hashCode + value%10; > while ((value /= 10) > 0); > return 28629151*hashCode; // spread even further; h*31^5 > */ > } > {code} > This easily leads to very degenerate behavior on keys that have constant > lower bits (long collision chains). A simple (and strong) hash function like > the final step of murmurhash3 goes a long way at ensuring the keys > distribution is more uniform regardless of the input distribution. -- This message was sent by Atlassian JIRA (v6.1#6144)