Re: Running Mahout on a Spark cluster

Hoa Nguyen Fri, 22 Sep 2017 20:06:57 -0700

Hey all,

Thanks for the offers of help. I've been able to narrow down some of the
problems to version incompatibility and I just wanted to give an update.
Just to back track a bit, my initial goal was to run Mahout on a
distributed cluster whether that was running Hadoop Map Reduce or Spark.

I started out trying to get it to run on Spark, which I have some
familiarity, but that didn't seem to work. While the error messages seem to
indicate there weren't enough resources on the workers ("WARN
scheduler.TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient memory"), I'm pretty sure that wasn't the case, not only because
it's a 4 node cluster of m4.xlarges, I was able to run another, simpler
Spark batch job on that same distributed cluster.

After a bit of wrangling, I was able to narrow down some of the issues. It
turns out I was kind of blindly using this repo
https://github.com/pferrel/3-input-cooc as a guide without fully realizing
that it was from several years ago and based on Mahout 0.10.0, Scala 2.10
and Spark 1.1.1 That is significantly different from my environment, which
has Mahout 0.13.0 and Spark 2.1.1 installed, which also means I have to use
Scala 2.11. After modifying the build.sbt file to account for those
versions, I now have compile type mismatch issues that I'm just not that
savvy to fix (see attached screenshot if you're interested).

Anyway, the good news that I was able to finally get Mahout code running on
Hadoop map-reduce, but also after a bit wrangling. It turned out my
instances were running Ubuntu 14 and apparently that doesn't play well with
Hadoop 2.7.4, which prevented me from running any sample Mahout code (from
here: https://github.com/apache/mahout/tree/master/examples/bin) that
relied on map-reduce. Those problems went away after I installed Hadoop
2.8.1 instead. Now I'm able to get the shell scripts running on a
distributed Hadoop cluster (yay!).

Anyway, if anyone has more recent and working Spark Scala code that uses
Mahout that they can point me to, I'd appreciate it.

Many thanks!
Hoa

On Fri, Sep 22, 2017 at 1:09 AM, Trevor Grant <trevor.d.gr...@gmail.com>
wrote:

> Hi Hoa,
>
> A few things could be happening here, I haven't run across that specific
> error.
>
> 1) Spark 2.x - Mahout 0.13.0: Mahout 0.13.0 WILL run on Spark 2.x, however
> you need to build from source (not the binaries).  You can do this by
> downloading mahout source or cloning the repo and building with:
> mvn clean install -Pspark-2.1,scala-2.11 -DskipTests
>
> 2) Have you setup spark with Kryo serialization? How you do this depends on
> if you're in the shell/zeppelin or using spark submit.
>
> However, for both of these cases- it shouldn't have even run local afaik so
> the fact it did tells me you probably have gotten this far?
>
> Assuming you've done 1 and 2, can you share some code? I'll see if I can
> recreate on my end.
>
> Thanks!
>
> tg
>
> On Thu, Sep 21, 2017 at 9:37 PM, Hoa Nguyen <h...@insightdatascience.com>
> wrote:
>
> > I apologize in advance if this is too much of a newbie question but I'm
> > having a hard time running any Mahout example code in a distributed Spark
> > cluster. The code runs as advertised when Spark is running locally on one
> > machine but the minute I point Spark to a cluster and master url, I can't
> > get it to work, drawing the error: "WARN scheduler.TaskSchedulerImpl:
> > Initial job has not accepted any resources; check your cluster UI to
> ensure
> > that workers are registered and have sufficient memory"
> >
> > I know my Spark cluster is configured and working correctly because I ran
> > non-Mahout code and it runs on a distributed cluster fine. What am I
> doing
> > wrong? The only thing I can think of is that my Spark version is too
> recent
> > -- 2.1.1 -- for the Mahout version I'm using -- 0.13.0. Is that it or am
> I
> > doing something else wrong?
> >
> > Thanks for any advice,
> > Hoa
> >
>

Re: Running Mahout on a Spark cluster

Reply via email to