[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Jake Mannix (JIRA) Tue, 17 Nov 2009 11:11:03 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779050#action_12779050
 ]


Jake Mannix commented on MAHOUT-165:
------------------------------------

bq. The colt tree could also be put into a separate module that lives alongside 
core, util, examples, built independently as a part of the maven build - 
optionally at first, activated via a build profile.

+1 - I like this.

bq. As far as package names, would it be better to map cern.colt.* to 
org.apache.mahout.colt.* ? - that way there's no potential for the old being 
confused for the new in builds, etc.

I personally think this is the way to go, but does it reduce confusion, or 
increase it?  People who are used to using colt will see familiar classes, but 
in strange places.  If we're really going to overhaul the whole library over 
time, this makes sense, I guess.

bq. Would the cern.jet.* libraries be included as well?

In the vivisection I've performed (locally) on the last updated version of 
Colt, the colt.jet.* packages were able to be preserved without running into 
any licensing or dependency problems, so I've kept them, but they do duplicate 
some work we already have: there's a ton of random distributions, and stats for 
computing quantiles, a MersenneTwister impl, etc.

We could include them at first, and then do some perf testing / api twiddling 
over time to see which impls we want to keep where there are duplicates?

The only parts of colt which are removed are hep.aida.* and corejava.* (the 
latter is LGPL, but is not needed).  At the top level, what's left are 
cern.colt, cern.jet, and cern.clhep, but the latter can be removed also, 
because I'm pretty sure Mahout doesn't need to know the double value of 
Planck's constant (besides, as a former theorist, on principle I should note 
that the value of hbar is definitively (double)1, with no units, in natural 
units).

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, 
> MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The 
> present implementation of this hash map is not as efficient as some of the 
> other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of 
>  Colt performance an order of magnitude better than OrderedIntDoubleMapping. 
> For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an 
> experimental dataset, the current implementation takes 50 minutes. Using 
> Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the 
> delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Reply via email to