[ https://issues.apache.org/jira/browse/MAHOUT-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184068#comment-15184068 ]
ASF GitHub Bot commented on MAHOUT-1640: ---------------------------------------- Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/81#issuecomment-193518798 @vigna i apologize it took so long. it completely went off the radar. > Better collections would significantly improve vector-operation speed > --------------------------------------------------------------------- > > Key: MAHOUT-1640 > URL: https://issues.apache.org/jira/browse/MAHOUT-1640 > Project: Mahout > Issue Type: Improvement > Components: collections > Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version > 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 > x86_64 i386 MacBookPro10,1 Darwin > java version "1.8.0_31" > Java(TM) SE Runtime Environment (build 1.8.0_31-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode) > Reporter: Sebastiano Vigna > Assignee: Suneel Marthi > Labels: legacy, math, scala > Attachments: fastutil.patch, speed-fastutil, speed-std > > > The collections currently used by Mahout to implement sparse vectors are > extremely slow. The proposed patch (localized to RandomAccessSparseVector) > uses fastutil's maps and the speed improvements in vector benchmarks are very > significant. It would be interesting to see whether these improvements > percolate to high-level classes using sparse vectors. > I had to patch two unit tests (an off-by-one bug and an overfitting bug; both > were exposed by the different order in which key/values were returned by > iterators). > The included files speed-std and speed-fastutil show the speed improvement. > Some more speed might be gained by using everywhere the standard > java.util.Map.Entry interface instead of Element. > DISCLAIMER: The "Times" set of tests has been run multiplying two identical > vectors. The standard tests multiply two random vectors, so in fact they just > test the speed of the underlying map remove() method, as almost all products > are zero. This is not very realistic and was heavily penalizing fastutil's > "true deletions". Better tests, with a typical overlap of nonzero entries, > would be even more realistic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)