[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Jake Mannix (JIRA) Tue, 17 Nov 2009 10:45:03 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779036#action_12779036
 ]


Jake Mannix commented on MAHOUT-165:
------------------------------------

And continuing on the topic of code adoption: Colt has no unit tests.   *eep*

So we might need to draft up a nice effort towards writing a bunch, once we get 
this incorporated.

I guess we can do this piece-by-piece - include the code first, but don't 
change any implementation to use Colt.  Then as we hook it into things, add 
more unit tests and verify that they work, and gradually get more test coverage 
over the adopted code.  We'd need to label it in the documentation as "no 
lifeguard present: for expert use only" until it was better tested in this 
case, because once it was available, people using Mahout could conceivably just 
start coding using Colt primitives even if we aren't directly linking to them 
at first in our own classes.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, 
> MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The 
> present implementation of this hash map is not as efficient as some of the 
> other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of 
>  Colt performance an order of magnitude better than OrderedIntDoubleMapping. 
> For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an 
> experimental dataset, the current implementation takes 50 minutes. Using 
> Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the 
> delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Reply via email to