[ https://issues.apache.org/jira/browse/MAHOUT-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Suneel Marthi updated MAHOUT-1242: ---------------------------------- Affects Version/s: 0.7 0.8 Fix Version/s: 0.9 Assignee: Suneel Marthi > No key redistribution function for associative maps > --------------------------------------------------- > > Key: MAHOUT-1242 > URL: https://issues.apache.org/jira/browse/MAHOUT-1242 > Project: Mahout > Issue Type: Improvement > Components: collections, Math > Affects Versions: 0.7, 0.8 > Reporter: Dawid Weiss > Assignee: Suneel Marthi > Fix For: 0.9 > > Attachments: MAHOUT-1242.patch > > > All integer-based maps currently use HashFunctions.hash(int) which just > returns the key value: > {code} > /** > * Returns a hashcode for the specified value. > * > * @return a hash code value for the specified value. > */ > public static int hash(int value) { > return value; > //return value * 0x278DDE6D; // see > org.apache.mahout.math.jet.random.engine.DRand > /* > value &= 0x7FFFFFFF; // make it >=0 > int hashCode = 0; > do hashCode = 31*hashCode + value%10; > while ((value /= 10) > 0); > return 28629151*hashCode; // spread even further; h*31^5 > */ > } > {code} > This easily leads to very degenerate behavior on keys that have constant > lower bits (long collision chains). A simple (and strong) hash function like > the final step of murmurhash3 goes a long way at ensuring the keys > distribution is more uniform regardless of the input distribution. -- This message was sent by Atlassian JIRA (v6.1#6144)