[ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779139#action_12779139 ]
Ted Dunning commented on MAHOUT-165: ------------------------------------ bq. Well, I'm not sure how much of the "making a whole ASF project" overhead is necessary just yet (given how much work goes into a new project), but at least having it live someplace like google-code would give it that option. Grant, In light of this, can you clarify your comment about Google Code? Were you really thinking of forking Colt onto Google code, then making it a TLP in Apache? Or did you imagine that forking it onto Google code would be done in order to establish provenance of the code before bringing it under mahout? Would importing it into Mahout and then budding it out at the right time be a viable alternative to that? Can a sub-project have a sub-sub-project? (:-) in case you didn't notice) Making Colt a TLP or a commons project is an attractive long run idea, but getting it into Mahout now is nicer. > Using better primitives hash for sparse vector for performance gains > -------------------------------------------------------------------- > > Key: MAHOUT-165 > URL: https://issues.apache.org/jira/browse/MAHOUT-165 > Project: Mahout > Issue Type: Improvement > Components: Matrix > Affects Versions: 0.2 > Reporter: Shashikant Kore > Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: colt.jar, mahout-165-trove.patch, > MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch > > > In SparseVector, we need primitives hash map for index and values. The > present implementation of this hash map is not as efficient as some of the > other implementations in non-Apache projects. > In an experiment, I found that, for get/set operations, the primitive hash of > Colt performance an order of magnitude better than OrderedIntDoubleMapping. > For iteration it is 2x slower, though. > Using Colt in Sparsevector improved performance of canopy generation. For an > experimental dataset, the current implementation takes 50 minutes. Using > Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the > delay. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.