[ 
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756790#action_12756790
 ] 

Ted Dunning commented on MAHOUT-165:
------------------------------------

bq. 3. We don't need right now, but there is no remove() method. 

remove will be a PITA to get right.  The problem is with collisions and the 
double hashing.  When you remove something, you don't know what other keys may 
have collided with what you are removing.  That means that you need to leave a 
marker behind so that other searches will still view that slot as occupied.  
Repeated insert/remove/insert will ultimately cause the array to resize itself. 
 

I would propose extending the current empty index mark (-1) to include a 
formerly occupied mark (-2).  Then the scanning would have to be clever enough 
to treat empty and formerly occupied differently.



> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: colt.jar, mahout-165-trove.patch, MAHOUT-165.patch, 
> mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The 
> present implementation of this hash map is not as efficient as some of the 
> other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of 
>  Colt performance an order of magnitude better than OrderedIntDoubleMapping. 
> For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an 
> experimental dataset, the current implementation takes 50 minutes. Using 
> Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the 
> delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to