[ https://issues.apache.org/jira/browse/MAHOUT-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177031#comment-13177031 ]
Hudson commented on MAHOUT-937: ------------------------------- Integrated in Mahout-Quality #1278 (See [https://builds.apache.org/job/Mahout-Quality/1278/]) MAHOUT-937 make partitioner send to different reducers (as intended it seems) by just using the hash of primary bytes srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1225420 Files : * /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/collocations/llr/GramKeyPartitioner.java > Collocations Job Partitioner not being configured properly > ---------------------------------------------------------- > > Key: MAHOUT-937 > URL: https://issues.apache.org/jira/browse/MAHOUT-937 > Project: Mahout > Issue Type: Bug > Affects Versions: 0.5 > Reporter: Mat Kelcey > Assignee: Sean Owen > Priority: Minor > Fix For: 0.6 > > Attachments: GramKeyPartitioner.java, MAHOUT-937.patch > > > The first pass of the collocations discovery job (as described by > CollocDriver.generateCollocations) uses the > org.apache.mahout.vectorizer.collocations.llr.GramKeyPartitioner partitioner. > This partitoner has an instance variable offset that is supposed to be set by > a call to setOffsets() but this call is never made (not sure why? is this > method expected to be called by the Hadoop framework itself?) > The offset not being set results in getPartition always returning 0 and so > all intermediate data is sent to the one reducer. > I couldn't quite understand what this partitioning was meant to be doing, but > simply hashing the Grams primary string representation (ie without the > leading 'type' byte) does what is required... > {code} > public class GramKeyPartitioner extends Partitioner<GramKey, Gram> { > @Override > public int getPartition(GramKey key, Gram value, int numPartitions) { > // exclude first byte which is the key type > byte[] keyBytesWithoutTypeByte = new byte[key.getPrimaryLength()-1]; > System.arraycopy(key.getBytes(), 1, keyBytesWithoutTypeByte, 0, > keyBytesWithoutTypeByte.length); > int hash = WritableComparator.hashBytes(keyBytesWithoutTypeByte, > keyBytesWithoutTypeByte.length); > return (hash & Integer.MAX_VALUE) % numPartitions; > } > > } > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira