Re: Running Mahout on a Spark cluster

Pat Ferrel Tue, 03 Oct 2017 12:59:04 -0700

Actually if you require scala 2.11 and spark 2.1 you have to use the current 
master (o.13.0 does not support these) and also can’t use sbt, unless you have 
some trick I haven’t discovered.



On Oct 3, 2017, at 12:55 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

I’m the aforementioned pferrel

@Hoa, thanks for that reference, I forgot I had that example. First don’t use 
the Hadoop part of Mahout, it is not supported and will be deprecated. The 
Spark version of cooccurrence will be supported. You find it in the 
SimilarityAnalysis object.

If you go back to the last release you should be able to make that 
https://github.com/pferrel/3-input-cooc 
<https://github.com/pferrel/3-input-cooc> work with version updates to 
Mahout-0.13.0 and dependencies. To use the latest master of Mahout, there are 
the problems listed below.


I’m having a hard time building with sbt using the mahout-spark module when I 
build that latest mahout master with `mvn clean install`. This puts the 
mahout-spark module in the local ~/.m2 maven cache. The structure doesn’t match 
what SBT expects the path and filenames to be.

The build.sbt  `libraryDependencies` line *should* IMO be:
`"org.apache.mahout" %% "mahout-spark-2.1" % “0.13.1-SNAPSHOT`

This is parsed by sbt to yield the path of :
org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar

unfortunately the outcome of `mvn clean install` currently is (I think):
org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-spark-0.13.1-SNAPSHOT-spark_2.1.jar

I can’t find a way to make SBT parse that structure and name.


On Oct 2, 2017, at 11:02 PM, Trevor Grant <trevor.d.gr...@gmail.com> wrote:

Code pointer:
https://github.com/rawkintrevo/cylons/tree/master/eigenfaces

However, I build Mahout (0.13.1-SNAPSHOT) locally with

mvn clean install -Pscala-2.11,spark-2.1,viennacl-omp -DskipTests

That's how maven was able to pick those up.


On Fri, Sep 22, 2017 at 10:06 PM, Hoa Nguyen <h...@insightdatascience.com>
wrote:

> Hey all,
> 
> Thanks for the offers of help. I've been able to narrow down some of the
> problems to version incompatibility and I just wanted to give an update.
> Just to back track a bit, my initial goal was to run Mahout on a
> distributed cluster whether that was running Hadoop Map Reduce or Spark.
> 
> I started out trying to get it to run on Spark, which I have some
> familiarity, but that didn't seem to work. While the error messages seem to
> indicate there weren't enough resources on the workers ("WARN
> scheduler.TaskSchedulerImpl: Initial job has not accepted any resources;
> check your cluster UI to ensure that workers are registered and have
> sufficient memory"), I'm pretty sure that wasn't the case, not only because
> it's a 4 node cluster of m4.xlarges, I was able to run another, simpler
> Spark batch job on that same distributed cluster.
> 
> After a bit of wrangling, I was able to narrow down some of the issues. It
> turns out I was kind of blindly using this repo https://github.com/
> pferrel/3-input-cooc as a guide without fully realizing that it was from
> several years ago and based on Mahout 0.10.0, Scala 2.10 and Spark 1.1.1
> That is significantly different from my environment, which has Mahout
> 0.13.0 and Spark 2.1.1 installed, which also means I have to use Scala
> 2.11. After modifying the build.sbt file to account for those versions, I
> now have compile type mismatch issues that I'm just not that savvy to fix
> (see attached screenshot if you're interested).
> 
> Anyway, the good news that I was able to finally get Mahout code running
> on Hadoop map-reduce, but also after a bit wrangling. It turned out my
> instances were running Ubuntu 14 and apparently that doesn't play well with
> Hadoop 2.7.4, which prevented me from running any sample Mahout code (from
> here: https://github.com/apache/mahout/tree/master/examples/bin) that
> relied on map-reduce. Those problems went away after I installed Hadoop
> 2.8.1 instead. Now I'm able to get the shell scripts running on a
> distributed Hadoop cluster (yay!).
> 
> Anyway, if anyone has more recent and working Spark Scala code that uses
> Mahout that they can point me to, I'd appreciate it.
> 
> Many thanks!
> Hoa
> 
> On Fri, Sep 22, 2017 at 1:09 AM, Trevor Grant <trevor.d.gr...@gmail.com>
> wrote:
> 
>> Hi Hoa,
>> 
>> A few things could be happening here, I haven't run across that specific
>> error.
>> 
>> 1) Spark 2.x - Mahout 0.13.0: Mahout 0.13.0 WILL run on Spark 2.x, however
>> you need to build from source (not the binaries).  You can do this by
>> downloading mahout source or cloning the repo and building with:
>> mvn clean install -Pspark-2.1,scala-2.11 -DskipTests
>> 
>> 2) Have you setup spark with Kryo serialization? How you do this depends
>> on
>> if you're in the shell/zeppelin or using spark submit.
>> 
>> However, for both of these cases- it shouldn't have even run local afaik
>> so
>> the fact it did tells me you probably have gotten this far?
>> 
>> Assuming you've done 1 and 2, can you share some code? I'll see if I can
>> recreate on my end.
>> 
>> Thanks!
>> 
>> tg
>> 
>> On Thu, Sep 21, 2017 at 9:37 PM, Hoa Nguyen <h...@insightdatascience.com>
>> wrote:
>> 
>>> I apologize in advance if this is too much of a newbie question but I'm
>>> having a hard time running any Mahout example code in a distributed
>> Spark
>>> cluster. The code runs as advertised when Spark is running locally on
>> one
>>> machine but the minute I point Spark to a cluster and master url, I
>> can't
>>> get it to work, drawing the error: "WARN scheduler.TaskSchedulerImpl:
>>> Initial job has not accepted any resources; check your cluster UI to
>> ensure
>>> that workers are registered and have sufficient memory"
>>> 
>>> I know my Spark cluster is configured and working correctly because I
>> ran
>>> non-Mahout code and it runs on a distributed cluster fine. What am I
>> doing
>>> wrong? The only thing I can think of is that my Spark version is too
>> recent
>>> -- 2.1.1 -- for the Mahout version I'm using -- 0.13.0. Is that it or
>> am I
>>> doing something else wrong?
>>> 
>>> Thanks for any advice,
>>> Hoa
>>> 
>> 
> 
>

Re: Running Mahout on a Spark cluster

Reply via email to