[ 
https://issues.apache.org/jira/browse/MAHOUT-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840560#action_12840560
 ] 

Sean Owen commented on MAHOUT-320:
----------------------------------

Oh what are we referring to by 'binary'?

Bigram has some bits like this that seem to be able to compare based on a byte 
representation -- assuming that's a nice optimization within Hadoop:

   @Override
    public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
      int ret;
      try {
        int firstb1 = WritableComparator.readVInt(b1, s1);
        int firstb2 = WritableComparator.readVInt(b2, s2);
        ret = firstb1 - firstb2;
      } catch (IOException ioe) {
        throw new IllegalArgumentException(ioe);
      }
      return ret;
    }

(Though we gotta fix this returning firstb1 - firstb2 thing -- overflow makes 
this result incorrect for about 1 in 16 possible pair values!)

> Modify IntPairWritable in LDA implementation to be binary comparable to 
> improve performance.
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-320
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-320
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Assignee: Robin Anil
>            Priority: Minor
>         Attachments: MAHOUT-320.patch
>
>
> Per discussion with Robin, modifying o.a.m.clustering.lda.IntPairWritable to 
> be binary comparable will improve the performance of the comparison 
> operations during a sort because no marshaling will need to occur to compare 
> IntPairWritable instances.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to