Dear developers,
I'm writing to suggest to improve significantly Mahout's speed by replacing the 
current, Colt-based collections with faster collections. These are results from 
benchmarks at java-performance.info comparing fastutil and Mahout in get 
operations (Mahout collections were not included in the java-performance.info 
tests):

tests.maptests.primitive.MahoutMapTest (10000) = 2176.1182139999996
tests.maptests.primitive.FastUtilMapTest (10000) = 782.8528527999999
tests.maptests.primitive.MahoutMapTest (100000) = 2630.1235654
tests.maptests.primitive.FastUtilMapTest (100000) = 1074.9035660000002
tests.maptests.primitive.MahoutMapTest (1000000) = 3969.1322968
tests.maptests.primitive.FastUtilMapTest (1000000) = 1940.7466792

This is with fastutil 6.6.1, which is comparable in speed to Koloboke or the GS 
collections (the java-performance.info tests use an older, slower version), 
and, I believe, faster for the purposes of Mahout. Get operations in Mahout 
collections are 2-3x slower.

I modified locally RandomAccessSparseVector to use fastutil, and run some of 
the VectorBenchmarks.

0    [main] INFO  org.apache.mahout.benchmark.VectorBenchmarks  - Create (copy) 
RandSparseVector        mean   = 12.57us;       mean   = 64.88us;
32935 [main] INFO  org.apache.mahout.benchmark.VectorBenchmarks  - Create 
(incrementally) RandSparseVector
mean   = 31.77us;       mean   = 79.33us;
244212 [main] INFO  org.apache.mahout.benchmark.VectorBenchmarks  - Plus 
RandSparseVector 
mean   = 47.36us;       mean   = 101.63us;

On the left you can find the fastutil timings, on the right the Mahout timings. 
The only case in which I saw a slowdown is for some dense/sparse products:

429433 [main] INFO  org.apache.mahout.benchmark.VectorBenchmarks  - Times 
Rand.fn(Dense)        mean   = 78us;  mean   = 52.47us;

but I think this is due to the different way removals are handled: Mahout uses 
tombstones (and thus slows down all subsequent operations), whereas fastutil 
does true deletions, which are slightly slower at remove time, but make 
subsequent operations faster. Also, iteration over a fastutil-based 
RandomAccessSparseVector is slowed down by having to return non-standard 
Element instances instead of Map.Entry instances (as fastutil or the JDK would 
do naturally).

If you'd like to benchmark the speed at a high level, the one-file drop-in is 
included (you'll need to add fastutil 6.6.1 as a dependency to mahout-math). As 
I said, things can be improved by using a standard Map.Entry 
(Int2DoubleMap.Entry) instead of Element. But this is a more pervasive change.

Ciao,

                                        seba


PS: One caveat: presently fastutil does not shrink backing arrays, which might 
not be what you want. It will, however, from the next release.

Reply via email to