Sean, is Koloboke smaller? And why does, pardon me, size matter?
On Sun, Jan 18, 2015 at 3:18 AM, Sean Owen <sro...@gmail.com> wrote: > FWIW I prefer Koloboke at this stage. Fastutil is Apache licensed, both > are. The only drawback is the massive size of the dependency. The resulting > assembly artifact could be slimmed to only include required classes. Yes at > this stage I would have used a library though and not made custom code. > On Jan 17, 2015 10:35 PM, "Sebastiano Vigna" <vi...@di.unimi.it> wrote: > > > Dear developers, > > I'm writing to suggest to improve significantly Mahout's speed by > > replacing the current, Colt-based collections with faster collections. > > These are results from benchmarks at java-performance.info comparing > > fastutil and Mahout in get operations (Mahout collections were not > included > > in the java-performance.info tests): > > > > tests.maptests.primitive.MahoutMapTest (10000) = 2176.1182139999996 > > tests.maptests.primitive.FastUtilMapTest (10000) = 782.8528527999999 > > tests.maptests.primitive.MahoutMapTest (100000) = 2630.1235654 > > tests.maptests.primitive.FastUtilMapTest (100000) = 1074.9035660000002 > > tests.maptests.primitive.MahoutMapTest (1000000) = 3969.1322968 > > tests.maptests.primitive.FastUtilMapTest (1000000) = 1940.7466792 > > > > This is with fastutil 6.6.1, which is comparable in speed to Koloboke or > > the GS collections (the java-performance.info tests use an older, slower > > version), and, I believe, faster for the purposes of Mahout. Get > operations > > in Mahout collections are 2-3x slower. > > > > I modified locally RandomAccessSparseVector to use fastutil, and run some > > of the VectorBenchmarks. > > > > 0 [main] INFO org.apache.mahout.benchmark.VectorBenchmarks - Create > > (copy) RandSparseVector mean = 12.57us; mean = 64.88us; > > 32935 [main] INFO org.apache.mahout.benchmark.VectorBenchmarks - Create > > (incrementally) RandSparseVector > > mean = 31.77us; mean = 79.33us; > > 244212 [main] INFO org.apache.mahout.benchmark.VectorBenchmarks - Plus > > RandSparseVector > > mean = 47.36us; mean = 101.63us; > > > > On the left you can find the fastutil timings, on the right the Mahout > > timings. The only case in which I saw a slowdown is for some dense/sparse > > products: > > > > 429433 [main] INFO org.apache.mahout.benchmark.VectorBenchmarks - Times > > Rand.fn(Dense) mean = 78us; mean = 52.47us; > > > > but I think this is due to the different way removals are handled: Mahout > > uses tombstones (and thus slows down all subsequent operations), whereas > > fastutil does true deletions, which are slightly slower at remove time, > but > > make subsequent operations faster. Also, iteration over a fastutil-based > > RandomAccessSparseVector is slowed down by having to return non-standard > > Element instances instead of Map.Entry instances (as fastutil or the JDK > > would do naturally). > > > > If you'd like to benchmark the speed at a high level, the one-file > drop-in > > is included (you'll need to add fastutil 6.6.1 as a dependency to > > mahout-math). As I said, things can be improved by using a standard > > Map.Entry (Int2DoubleMap.Entry) instead of Element. But this is a more > > pervasive change. > > > > Ciao, > > > > seba > > > > > > > > > > PS: One caveat: presently fastutil does not shrink backing arrays, which > > might not be what you want. It will, however, from the next release. > > > > > > >