FWIW I prefer Koloboke at this stage. Fastutil is Apache licensed, both
are. The only drawback is the massive size of the dependency. The resulting
assembly artifact could be slimmed to only include required classes. Yes at
this stage I would have used a library though and not made custom code.
On Jan 17, 2015 10:35 PM, "Sebastiano Vigna" <vi...@di.unimi.it> wrote:

> Dear developers,
> I'm writing to suggest to improve significantly Mahout's speed by
> replacing the current, Colt-based collections with faster collections.
> These are results from benchmarks at java-performance.info comparing
> fastutil and Mahout in get operations (Mahout collections were not included
> in the java-performance.info tests):
>
> tests.maptests.primitive.MahoutMapTest (10000) = 2176.1182139999996
> tests.maptests.primitive.FastUtilMapTest (10000) = 782.8528527999999
> tests.maptests.primitive.MahoutMapTest (100000) = 2630.1235654
> tests.maptests.primitive.FastUtilMapTest (100000) = 1074.9035660000002
> tests.maptests.primitive.MahoutMapTest (1000000) = 3969.1322968
> tests.maptests.primitive.FastUtilMapTest (1000000) = 1940.7466792
>
> This is with fastutil 6.6.1, which is comparable in speed to Koloboke or
> the GS collections (the java-performance.info tests use an older, slower
> version), and, I believe, faster for the purposes of Mahout. Get operations
> in Mahout collections are 2-3x slower.
>
> I modified locally RandomAccessSparseVector to use fastutil, and run some
> of the VectorBenchmarks.
>
> 0    [main] INFO  org.apache.mahout.benchmark.VectorBenchmarks  - Create
> (copy) RandSparseVector        mean   = 12.57us;       mean   = 64.88us;
> 32935 [main] INFO  org.apache.mahout.benchmark.VectorBenchmarks  - Create
> (incrementally) RandSparseVector
> mean   = 31.77us;       mean   = 79.33us;
> 244212 [main] INFO  org.apache.mahout.benchmark.VectorBenchmarks  - Plus
> RandSparseVector
> mean   = 47.36us;       mean   = 101.63us;
>
> On the left you can find the fastutil timings, on the right the Mahout
> timings. The only case in which I saw a slowdown is for some dense/sparse
> products:
>
> 429433 [main] INFO  org.apache.mahout.benchmark.VectorBenchmarks  - Times
> Rand.fn(Dense)        mean   = 78us;  mean   = 52.47us;
>
> but I think this is due to the different way removals are handled: Mahout
> uses tombstones (and thus slows down all subsequent operations), whereas
> fastutil does true deletions, which are slightly slower at remove time, but
> make subsequent operations faster. Also, iteration over a fastutil-based
> RandomAccessSparseVector is slowed down by having to return non-standard
> Element instances instead of Map.Entry instances (as fastutil or the JDK
> would do naturally).
>
> If you'd like to benchmark the speed at a high level, the one-file drop-in
> is included (you'll need to add fastutil 6.6.1 as a dependency to
> mahout-math). As I said, things can be improved by using a standard
> Map.Entry (Int2DoubleMap.Entry) instead of Element. But this is a more
> pervasive change.
>
> Ciao,
>
>                                         seba
>
>
>
>
> PS: One caveat: presently fastutil does not shrink backing arrays, which
> might not be what you want. It will, however, from the next release.
>
>
>

Reply via email to