[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Ted Dunning (JIRA) Thu, 12 Nov 2009 09:57:16 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777089#action_12777089
 ]


Ted Dunning commented on MAHOUT-165:
------------------------------------


bq. pulling Writable off of the interface, so that not every impl is hooked 
into such a coupling to Hadoop, then wrapping it with a Writable wrapper / 
subclass to add that functionality

+1

Same thing should be done with row and column labels.

Not sure how to handle matrices of indefinite dimension which are probably 
important for some of what we do.  Perhaps just declare them as very, very 
large in a wrapper.

bq. the double aggregate(BinaryDoubleFunction aggregator, UnaryFunction map) 
and double aggregate(Vector other, BinaryDoubleFunction aggregator, 
BinaryDoubleFunction map) methods for abstracting away inner products and 
norms.  Not necessary, but very easily implemented in AbstractVector so that 
nobody needs to worry about these methods if they don't like programming that 
way.

These are very handy function.  Row and/or column aggregator functions are also 
important.

Colt gets a big boost in speed by testing in the implementation for special 
combinations of these functional constructs.  That lets it implement dot and 
sum with bespoke code and avoid the function call overhead (with associated 
risk of the JVM not in-lining enough).

Another big change is that Colt makes extensive use of view semantics.  I think 
that this is a really good idea, but it does differ a bit from what we have 
done so far.




> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, 
> MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The 
> present implementation of this hash map is not as efficient as some of the 
> other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of 
>  Colt performance an order of magnitude better than OrderedIntDoubleMapping. 
> For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an 
> experimental dataset, the current implementation takes 50 minutes. Using 
> Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the 
> delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Reply via email to