[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Jake Mannix (JIRA) Tue, 17 Nov 2009 07:34:04 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778937#action_12778937
 ]


Jake Mannix commented on MAHOUT-165:
------------------------------------

If we're going to try out a patch which includes Colt, we really need the Colt 
source, properly cleaned of offending material, not just a jar, right?  

Similarly, as I mentioned above, cern.colt.matrix.* classes are pretty 
important, and can be included if a little care is made in pulling out the 
hep.aida.* dependencies.  Shashi, your colt.jar doesn't have the 
cern.colt.matrix included, do they?

If I post a patch with the entire (cleaned) source tree of colt, can we apply 
it?  What is the procedure for doing this kind of thing?  Are we keeping the 
package hierarchy intact, or should we do a swap of cern.colt to 
org.mahout.colt?  If this kind of thing is done, we'll want it to live in it's 
own maven sub-module in here, I would imagine.

On the topic of dependencies, colt internally has a dependency on Doug Lea's 
original edu.oswego concurrent library, which is public domain, so that's ok, 
but should be upgraded to java.util.concurrent.  Unfortunately, not all classes 
in edu.oswego.concurrent have counterparts in java.util.concurrent yet: the 
fork/join framework doesn't make it itno core java until 1.7, and is used 
inside of colt... so there's a dependency on concurrent.jar... does the apache 
maven repo have concurrent 1.3.4 in it?  ibiblio does appears to...

Sorry to make things complicated - colt has a lot more than just a SparseVector 
implementation, so if we're going to include it, we should make sure to get the 
benefit of all it has to offer.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, 
> MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The 
> present implementation of this hash map is not as efficient as some of the 
> other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of 
>  Colt performance an order of magnitude better than OrderedIntDoubleMapping. 
> For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an 
> experimental dataset, the current implementation takes 50 minutes. Using 
> Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the 
> delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Reply via email to