[ https://issues.apache.org/jira/browse/MAHOUT-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840560#action_12840560 ]
Sean Owen commented on MAHOUT-320: ---------------------------------- Oh what are we referring to by 'binary'? Bigram has some bits like this that seem to be able to compare based on a byte representation -- assuming that's a nice optimization within Hadoop: @Override public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { int ret; try { int firstb1 = WritableComparator.readVInt(b1, s1); int firstb2 = WritableComparator.readVInt(b2, s2); ret = firstb1 - firstb2; } catch (IOException ioe) { throw new IllegalArgumentException(ioe); } return ret; } (Though we gotta fix this returning firstb1 - firstb2 thing -- overflow makes this result incorrect for about 1 in 16 possible pair values!) > Modify IntPairWritable in LDA implementation to be binary comparable to > improve performance. > -------------------------------------------------------------------------------------------- > > Key: MAHOUT-320 > URL: https://issues.apache.org/jira/browse/MAHOUT-320 > Project: Mahout > Issue Type: Improvement > Components: Clustering > Affects Versions: 0.3 > Reporter: Drew Farris > Assignee: Robin Anil > Priority: Minor > Attachments: MAHOUT-320.patch > > > Per discussion with Robin, modifying o.a.m.clustering.lda.IntPairWritable to > be binary comparable will improve the performance of the comparison > operations during a sort because no marshaling will need to occur to compare > IntPairWritable instances. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.