Sean, is Koloboke smaller? And why does, pardon me, size matter?

On Sun, Jan 18, 2015 at 3:18 AM, Sean Owen <sro...@gmail.com> wrote:

> FWIW I prefer Koloboke at this stage. Fastutil is Apache licensed, both
> are. The only drawback is the massive size of the dependency. The resulting
> assembly artifact could be slimmed to only include required classes. Yes at
> this stage I would have used a library though and not made custom code.
> On Jan 17, 2015 10:35 PM, "Sebastiano Vigna" <vi...@di.unimi.it> wrote:
>
> > Dear developers,
> > I'm writing to suggest to improve significantly Mahout's speed by
> > replacing the current, Colt-based collections with faster collections.
> > These are results from benchmarks at java-performance.info comparing
> > fastutil and Mahout in get operations (Mahout collections were not
> included
> > in the java-performance.info tests):
> >
> > tests.maptests.primitive.MahoutMapTest (10000) = 2176.1182139999996
> > tests.maptests.primitive.FastUtilMapTest (10000) = 782.8528527999999
> > tests.maptests.primitive.MahoutMapTest (100000) = 2630.1235654
> > tests.maptests.primitive.FastUtilMapTest (100000) = 1074.9035660000002
> > tests.maptests.primitive.MahoutMapTest (1000000) = 3969.1322968
> > tests.maptests.primitive.FastUtilMapTest (1000000) = 1940.7466792
> >
> > This is with fastutil 6.6.1, which is comparable in speed to Koloboke or
> > the GS collections (the java-performance.info tests use an older, slower
> > version), and, I believe, faster for the purposes of Mahout. Get
> operations
> > in Mahout collections are 2-3x slower.
> >
> > I modified locally RandomAccessSparseVector to use fastutil, and run some
> > of the VectorBenchmarks.
> >
> > 0    [main] INFO  org.apache.mahout.benchmark.VectorBenchmarks  - Create
> > (copy) RandSparseVector        mean   = 12.57us;       mean   = 64.88us;
> > 32935 [main] INFO  org.apache.mahout.benchmark.VectorBenchmarks  - Create
> > (incrementally) RandSparseVector
> > mean   = 31.77us;       mean   = 79.33us;
> > 244212 [main] INFO  org.apache.mahout.benchmark.VectorBenchmarks  - Plus
> > RandSparseVector
> > mean   = 47.36us;       mean   = 101.63us;
> >
> > On the left you can find the fastutil timings, on the right the Mahout
> > timings. The only case in which I saw a slowdown is for some dense/sparse
> > products:
> >
> > 429433 [main] INFO  org.apache.mahout.benchmark.VectorBenchmarks  - Times
> > Rand.fn(Dense)        mean   = 78us;  mean   = 52.47us;
> >
> > but I think this is due to the different way removals are handled: Mahout
> > uses tombstones (and thus slows down all subsequent operations), whereas
> > fastutil does true deletions, which are slightly slower at remove time,
> but
> > make subsequent operations faster. Also, iteration over a fastutil-based
> > RandomAccessSparseVector is slowed down by having to return non-standard
> > Element instances instead of Map.Entry instances (as fastutil or the JDK
> > would do naturally).
> >
> > If you'd like to benchmark the speed at a high level, the one-file
> drop-in
> > is included (you'll need to add fastutil 6.6.1 as a dependency to
> > mahout-math). As I said, things can be improved by using a standard
> > Map.Entry (Int2DoubleMap.Entry) instead of Element. But this is a more
> > pervasive change.
> >
> > Ciao,
> >
> >                                         seba
> >
> >
> >
> >
> > PS: One caveat: presently fastutil does not shrink backing arrays, which
> > might not be what you want. It will, however, from the next release.
> >
> >
> >
>

Reply via email to