[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Jake Mannix (JIRA) Thu, 12 Nov 2009 09:41:16 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777082#action_12777082
 ]


Jake Mannix commented on MAHOUT-165:
------------------------------------

Ok then, let's try out Colt, unless we have a more permissive policy in here 
about MTJ than the c-math guys have: they didn't want MTJ because using it 
required either including a jar file of the output of f2j translations of some 
Fortran code... which is ok for us as long as it's apache-compatible, since we 
don't have the hard "no external dependencies" requirement that they have.  

What Shashi wrote before was this, when he attached the modified colt jar:

bq. Jar for Colt after removing the LGPL code of hep.aida and the the dependent 
classes. The classes in colt.matrix.* are removed as they require hep.aida.

I actually stripped the hep.aida.* dependencies out of even the colt.matrix.* 
classes in Colt on my local gitrepo, which keeps pretty much all of the 
functionality intact.  I can make an updated patch which has the full source 
code for that, so that we can include it instead of just having a jar.

Do we want to try comparing both MTJ and Colt?

Also: do we think our linear API is "complete" enough to solidify on as a 
wrapper for whatever is plugged in underneath?  Some of the changes which have 
been discussed in other tickets and on the list are

* pulling Writable off of the interface, so that not every impl is hooked into 
such a coupling to Hadoop, then wrapping it with a Writable wrapper / subclass 
to add that functionality
* the double aggregate(BinaryDoubleFunction aggregator, UnaryFunction map) and 
double aggregate(Vector other, BinaryDoubleFunction aggregator, 
BinaryDoubleFunction map) methods for abstracting away inner products and 
norms.  Not necessary, but very easily implemented in AbstractVector so that 
nobody needs to worry about these methods if they don't like programming that 
way.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, 
> MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The 
> present implementation of this hash map is not as efficient as some of the 
> other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of 
>  Colt performance an order of magnitude better than OrderedIntDoubleMapping. 
> For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an 
> experimental dataset, the current implementation takes 50 minutes. Using 
> Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the 
> delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Reply via email to