Nice. There is even still a huge potential for optimization in the spark bindings.
-s Am 05.07.2014 15:21 schrieb "Andrew Musselman" <andrew.mussel...@gmail.com>: > Crazy awesome. > > > On Jul 5, 2014, at 4:19 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > > > > I compared spark-itemsimilatity to the Hadoop version on sample data > that is 8.7 M, 49290 x 139738 using my little 2 machine cluster and got the > following speedup. > > > > Platform Elapsed Time > > Mahout Hadoop 0:20:37 > > Mahout Spark 0:02:19 > > > > This isn’t quite apples to apples because the Spark version does all the > dictionary management, which is usually two extra jobs tacked on before and > after the Hadoop job. I’ve done the complete pipeline using Hadoop and > Spark now and can say that not only is it faster now but the old Hadoop way > required keeping track of 10x more intermediate data and connecting up many > more jobs to get the pipeline working. Now it’s just one job. You don’t > need to worry about ID translation anymore and you get over 10x faster > completion — this is one of those times when speed meets ease-of-use. >