[ 
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761018#action_12761018
 ] 

Jake Mannix commented on MAHOUT-165:
------------------------------------

Ted, some notes on your patch: 

  * with the two different specialized subclasses of SparseVector (HashVector, 
optimized for random access, and OrderedIntDoubleVector, optimized for 
iteration speed) being created, it seems like utilities like the TFDFMapper and 
so forth should be able to chose which impl to use, instead of getting 
hardcoded to use on or the other.

  * also, your current implementation of IntDoubleHash appears to sometimes 
throw "java.lang.RuntimeException: Impossible confusion in IntDoubleHash" 
exceptions sometimes, which sounds troubling. :)   

Attached is my current attempt at reviving this patch.  Currently has failing 
tests.  If it's easier to just svn up your own version, feel free to ignore, 
but I thought applying one which already compiles might help a little.

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, 
> MAHOUT-165-updated.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The 
> present implementation of this hash map is not as efficient as some of the 
> other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of 
>  Colt performance an order of magnitude better than OrderedIntDoubleMapping. 
> For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an 
> experimental dataset, the current implementation takes 50 minutes. Using 
> Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the 
> delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to