Sebastiano Vigna created MAHOUT-1640:
----------------------------------------
Summary: Better collections would significantly improve
vector-operation speed
Key: MAHOUT-1640
URL: https://issues.apache.org/jira/browse/MAHOUT-1640
Project: Mahout
Issue Type: Improvement
Components: collections
Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version 14.1.0:
Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 x86_64 i386
MacBookPro10,1 Darwin
java version "1.8.0_31"
Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
Reporter: Sebastiano Vigna
The collections currently used by Mahout to implement sparse vectors are
extremely slow. The proposed patch (localized to RandomAccessSparseVector) uses
fastutil's maps and the speed improvements in vector benchmarks are very
significant. It would be interesting to see whether these improvements
percolate to high-level classes using sparse vectors.
I had to patch two unit tests (an off-by-one bug and an overfitting bug; both
were exposed by the different order in which key/values were returned by
iterators).
The included files speed-std and speed-fastutil show the speed improvement.
Some more speed might be gained by using everywhere the standard
java.util.Map.Entry interface instead of Element.
DISCLAIMER: The "Times" set of tests has been run multiplying two identical
vectors. The standard tests multiply two random vectors, so in fact they just
test the speed of the underlying map remove() method, as almost all products
are zero. This is not very realistic and was heavily penalizing fastutil's
"true deletions". Better tests, with a typical overlap of nonzero entries,
would be even more realistic.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)