Nice. There is even still a huge potential for optimization in the spark
bindings.

-s
Am 05.07.2014 15:21 schrieb "Andrew Musselman" <andrew.mussel...@gmail.com>:

> Crazy awesome.
>
> > On Jul 5, 2014, at 4:19 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> >
> > I compared  spark-itemsimilatity to the Hadoop version on sample data
> that is 8.7 M, 49290 x 139738 using my little 2 machine cluster and got the
> following speedup.
> >
> > Platform            Elapsed Time
> > Mahout Hadoop    0:20:37
> > Mahout Spark        0:02:19
> >
> > This isn’t quite apples to apples because the Spark version does all the
> dictionary management, which is usually two extra jobs tacked on before and
> after the Hadoop job. I’ve done the complete pipeline using Hadoop and
> Spark now and can say that not only is it faster now but the old Hadoop way
> required keeping track of 10x more intermediate data and connecting up many
> more jobs to get the pipeline working. Now it’s just one job. You don’t
> need to worry about ID translation anymore and you get over 10x faster
> completion — this is one of those times when speed meets ease-of-use.
>

Reply via email to