[ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777050#action_12777050 ]
Ted Dunning commented on MAHOUT-165: ------------------------------------ My issues (which I used for quite some time) were probably either remediable or irrelevant. The remediable problem was that the API was opaque for new-comers and very difficult to extend with new matrix implementations. If we take Colt as a starting point and fix some of the extension and opacity issues, then this problem goes away. My second issue is that more modern libraries like MTJ can achieve about 4x the raw performance of Colt. As Grant rightly points out, that probably doesn't matter to us right away since the goal here is scaling rather than raw hot-iron performance on a single box. Moreover, as Grant also points out, we will have a pluggable interface which should allow us to switch if the commons math guys ever come around. > Using better primitives hash for sparse vector for performance gains > -------------------------------------------------------------------- > > Key: MAHOUT-165 > URL: https://issues.apache.org/jira/browse/MAHOUT-165 > Project: Mahout > Issue Type: Improvement > Components: Matrix > Affects Versions: 0.2 > Reporter: Shashikant Kore > Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: colt.jar, mahout-165-trove.patch, > MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch > > > In SparseVector, we need primitives hash map for index and values. The > present implementation of this hash map is not as efficient as some of the > other implementations in non-Apache projects. > In an experiment, I found that, for get/set operations, the primitive hash of > Colt performance an order of magnitude better than OrderedIntDoubleMapping. > For iteration it is 2x slower, though. > Using Colt in Sparsevector improved performance of canopy generation. For an > experimental dataset, the current implementation takes 50 minutes. Using > Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the > delay. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.