[ https://issues.apache.org/jira/browse/MAHOUT-937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated MAHOUT-937: ----------------------------- Resolution: Fixed Status: Resolved (was: Patch Available) > Collocations Job Partitioner not being configured properly > ---------------------------------------------------------- > > Key: MAHOUT-937 > URL: https://issues.apache.org/jira/browse/MAHOUT-937 > Project: Mahout > Issue Type: Bug > Affects Versions: 0.5 > Reporter: Mat Kelcey > Assignee: Sean Owen > Priority: Minor > Fix For: 0.6 > > Attachments: GramKeyPartitioner.java, MAHOUT-937.patch > > > The first pass of the collocations discovery job (as described by > CollocDriver.generateCollocations) uses the > org.apache.mahout.vectorizer.collocations.llr.GramKeyPartitioner partitioner. > This partitoner has an instance variable offset that is supposed to be set by > a call to setOffsets() but this call is never made (not sure why? is this > method expected to be called by the Hadoop framework itself?) > The offset not being set results in getPartition always returning 0 and so > all intermediate data is sent to the one reducer. > I couldn't quite understand what this partitioning was meant to be doing, but > simply hashing the Grams primary string representation (ie without the > leading 'type' byte) does what is required... > {code} > public class GramKeyPartitioner extends Partitioner<GramKey, Gram> { > @Override > public int getPartition(GramKey key, Gram value, int numPartitions) { > // exclude first byte which is the key type > byte[] keyBytesWithoutTypeByte = new byte[key.getPrimaryLength()-1]; > System.arraycopy(key.getBytes(), 1, keyBytesWithoutTypeByte, 0, > keyBytesWithoutTypeByte.length); > int hash = WritableComparator.hashBytes(keyBytesWithoutTypeByte, > keyBytesWithoutTypeByte.length); > return (hash & Integer.MAX_VALUE) % numPartitions; > } > > } > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira