[ 
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756526#action_12756526
 ] 

Shashikant Kore commented on MAHOUT-165:
----------------------------------------

Since, I couldn't apply Ted's patch to trunk, I only tested the IntDoubleHash 
in isolation.  Performance-wise, it is as good as Colt.

But, there are other issues with the basic implementation.

1. The class exposes the internal index array instead of keys. The internal 
array may have empty slots (marked with value -1). This is not consistent with 
a typical hash implementation. The side effect is extra work by the callee to 
only check the keys greater than zero. 

2. The clone() method has a bug. Instead of copying the entire index & value 
array, it only copies the count of valid values in the map. 

3. We don't need right now, but there is no remove() method. 

Of course, all these are fixable issues.  But, if we again need something 
similar, Colt will prove to be of great help. 

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, 
> mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The 
> present implementation of this hash map is not as efficient as some of the 
> other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of 
>  Colt performance an order of magnitude better than OrderedIntDoubleMapping. 
> For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an 
> experimental dataset, the current implementation takes 50 minutes. Using 
> Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the 
> delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to