Re: Running Mahout on a Spark cluster

2017-10-03 Thread Pat Ferrel
Thanks Trevor,

this encoding leaves the Scala version hard coded. But this is an appreciated 
clue and will get me going. There may be a way to use the %% with this or just 
explicitly add the scala version string.

@Hoa, I plan to update that repo.


On Oct 3, 2017, at 1:26 PM, Trevor Grant  wrote:

The spark is included via maven classifier-

the sbt line should be

libraryDependencies += "org.apache.mahout" % "mahout-spark_2.11" %
"0.13.1-SNAPSHOT" classifier "spark_2.1"


On Tue, Oct 3, 2017 at 2:55 PM, Pat Ferrel  wrote:

> I’m the aforementioned pferrel
> 
> @Hoa, thanks for that reference, I forgot I had that example. First don’t
> use the Hadoop part of Mahout, it is not supported and will be deprecated.
> The Spark version of cooccurrence will be supported. You find it in the
> SimilarityAnalysis object.
> 
> If you go back to the last release you should be able to make that
> https://github.com/pferrel/3-input-cooc  input-cooc> work with version updates to Mahout-0.13.0 and dependencies.
> To use the latest master of Mahout, there are the problems listed below.
> 
> 
> I’m having a hard time building with sbt using the mahout-spark module
> when I build that latest mahout master with `mvn clean install`. This puts
> the mahout-spark module in the local ~/.m2 maven cache. The structure
> doesn’t match what SBT expects the path and filenames to be.
> 
> The build.sbt  `libraryDependencies` line *should* IMO be:
> `"org.apache.mahout" %% "mahout-spark-2.1" % “0.13.1-SNAPSHOT`
> 
> This is parsed by sbt to yield the path of :
> org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/
> mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar
> 
> unfortunately the outcome of `mvn clean install` currently is (I think):
> org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-
> spark-0.13.1-SNAPSHOT-spark_2.1.jar
> 
> I can’t find a way to make SBT parse that structure and name.
> 
> 
> On Oct 2, 2017, at 11:02 PM, Trevor Grant 
> wrote:
> 
> Code pointer:
> https://github.com/rawkintrevo/cylons/tree/master/eigenfaces
> 
> However, I build Mahout (0.13.1-SNAPSHOT) locally with
> 
> mvn clean install -Pscala-2.11,spark-2.1,viennacl-omp -DskipTests
> 
> That's how maven was able to pick those up.
> 
> 
> On Fri, Sep 22, 2017 at 10:06 PM, Hoa Nguyen 
> wrote:
> 
>> Hey all,
>> 
>> Thanks for the offers of help. I've been able to narrow down some of the
>> problems to version incompatibility and I just wanted to give an update.
>> Just to back track a bit, my initial goal was to run Mahout on a
>> distributed cluster whether that was running Hadoop Map Reduce or Spark.
>> 
>> I started out trying to get it to run on Spark, which I have some
>> familiarity, but that didn't seem to work. While the error messages seem
> to
>> indicate there weren't enough resources on the workers ("WARN
>> scheduler.TaskSchedulerImpl: Initial job has not accepted any resources;
>> check your cluster UI to ensure that workers are registered and have
>> sufficient memory"), I'm pretty sure that wasn't the case, not only
> because
>> it's a 4 node cluster of m4.xlarges, I was able to run another, simpler
>> Spark batch job on that same distributed cluster.
>> 
>> After a bit of wrangling, I was able to narrow down some of the issues.
> It
>> turns out I was kind of blindly using this repo https://github.com/
>> pferrel/3-input-cooc as a guide without fully realizing that it was from
>> several years ago and based on Mahout 0.10.0, Scala 2.10 and Spark 1.1.1
>> That is significantly different from my environment, which has Mahout
>> 0.13.0 and Spark 2.1.1 installed, which also means I have to use Scala
>> 2.11. After modifying the build.sbt file to account for those versions, I
>> now have compile type mismatch issues that I'm just not that savvy to fix
>> (see attached screenshot if you're interested).
>> 
>> Anyway, the good news that I was able to finally get Mahout code running
>> on Hadoop map-reduce, but also after a bit wrangling. It turned out my
>> instances were running Ubuntu 14 and apparently that doesn't play well
> with
>> Hadoop 2.7.4, which prevented me from running any sample Mahout code
> (from
>> here: https://github.com/apache/mahout/tree/master/examples/bin) that
>> relied on map-reduce. Those problems went away after I installed Hadoop
>> 2.8.1 instead. Now I'm able to get the shell scripts running on a
>> distributed Hadoop cluster (yay!).
>> 
>> Anyway, if anyone has more recent and working Spark Scala code that uses
>> Mahout that they can point me to, I'd appreciate it.
>> 
>> Many thanks!
>> Hoa
>> 
>> On Fri, Sep 22, 2017 at 1:09 AM, Trevor Grant 
>> wrote:
>> 
>>> Hi Hoa,
>>> 
>>> A few things could be happening here, I haven't run across that specific
>>> error.
>>> 
>>> 1) Spark 2.x - Mahout 0.13.0: Mahout 0.13.0 WILL run on Spark 2.x,
> however
>>> you 

Re: Running Mahout on a Spark cluster

2017-10-03 Thread Trevor Grant
The spark is included via maven classifier-

the sbt line should be

libraryDependencies += "org.apache.mahout" % "mahout-spark_2.11" %
"0.13.1-SNAPSHOT" classifier "spark_2.1"


On Tue, Oct 3, 2017 at 2:55 PM, Pat Ferrel  wrote:

> I’m the aforementioned pferrel
>
> @Hoa, thanks for that reference, I forgot I had that example. First don’t
> use the Hadoop part of Mahout, it is not supported and will be deprecated.
> The Spark version of cooccurrence will be supported. You find it in the
> SimilarityAnalysis object.
>
> If you go back to the last release you should be able to make that
> https://github.com/pferrel/3-input-cooc  input-cooc> work with version updates to Mahout-0.13.0 and dependencies.
> To use the latest master of Mahout, there are the problems listed below.
>
>
> I’m having a hard time building with sbt using the mahout-spark module
> when I build that latest mahout master with `mvn clean install`. This puts
> the mahout-spark module in the local ~/.m2 maven cache. The structure
> doesn’t match what SBT expects the path and filenames to be.
>
> The build.sbt  `libraryDependencies` line *should* IMO be:
> `"org.apache.mahout" %% "mahout-spark-2.1" % “0.13.1-SNAPSHOT`
>
> This is parsed by sbt to yield the path of :
> org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/
> mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar
>
> unfortunately the outcome of `mvn clean install` currently is (I think):
> org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-
> spark-0.13.1-SNAPSHOT-spark_2.1.jar
>
> I can’t find a way to make SBT parse that structure and name.
>
>
> On Oct 2, 2017, at 11:02 PM, Trevor Grant 
> wrote:
>
> Code pointer:
> https://github.com/rawkintrevo/cylons/tree/master/eigenfaces
>
> However, I build Mahout (0.13.1-SNAPSHOT) locally with
>
> mvn clean install -Pscala-2.11,spark-2.1,viennacl-omp -DskipTests
>
> That's how maven was able to pick those up.
>
>
> On Fri, Sep 22, 2017 at 10:06 PM, Hoa Nguyen 
> wrote:
>
> > Hey all,
> >
> > Thanks for the offers of help. I've been able to narrow down some of the
> > problems to version incompatibility and I just wanted to give an update.
> > Just to back track a bit, my initial goal was to run Mahout on a
> > distributed cluster whether that was running Hadoop Map Reduce or Spark.
> >
> > I started out trying to get it to run on Spark, which I have some
> > familiarity, but that didn't seem to work. While the error messages seem
> to
> > indicate there weren't enough resources on the workers ("WARN
> > scheduler.TaskSchedulerImpl: Initial job has not accepted any resources;
> > check your cluster UI to ensure that workers are registered and have
> > sufficient memory"), I'm pretty sure that wasn't the case, not only
> because
> > it's a 4 node cluster of m4.xlarges, I was able to run another, simpler
> > Spark batch job on that same distributed cluster.
> >
> > After a bit of wrangling, I was able to narrow down some of the issues.
> It
> > turns out I was kind of blindly using this repo https://github.com/
> > pferrel/3-input-cooc as a guide without fully realizing that it was from
> > several years ago and based on Mahout 0.10.0, Scala 2.10 and Spark 1.1.1
> > That is significantly different from my environment, which has Mahout
> > 0.13.0 and Spark 2.1.1 installed, which also means I have to use Scala
> > 2.11. After modifying the build.sbt file to account for those versions, I
> > now have compile type mismatch issues that I'm just not that savvy to fix
> > (see attached screenshot if you're interested).
> >
> > Anyway, the good news that I was able to finally get Mahout code running
> > on Hadoop map-reduce, but also after a bit wrangling. It turned out my
> > instances were running Ubuntu 14 and apparently that doesn't play well
> with
> > Hadoop 2.7.4, which prevented me from running any sample Mahout code
> (from
> > here: https://github.com/apache/mahout/tree/master/examples/bin) that
> > relied on map-reduce. Those problems went away after I installed Hadoop
> > 2.8.1 instead. Now I'm able to get the shell scripts running on a
> > distributed Hadoop cluster (yay!).
> >
> > Anyway, if anyone has more recent and working Spark Scala code that uses
> > Mahout that they can point me to, I'd appreciate it.
> >
> > Many thanks!
> > Hoa
> >
> > On Fri, Sep 22, 2017 at 1:09 AM, Trevor Grant 
> > wrote:
> >
> >> Hi Hoa,
> >>
> >> A few things could be happening here, I haven't run across that specific
> >> error.
> >>
> >> 1) Spark 2.x - Mahout 0.13.0: Mahout 0.13.0 WILL run on Spark 2.x,
> however
> >> you need to build from source (not the binaries).  You can do this by
> >> downloading mahout source or cloning the repo and building with:
> >> mvn clean install -Pspark-2.1,scala-2.11 -DskipTests
> >>
> >> 2) Have you setup spark with Kryo serialization? How you do this depends
> >> on
> >> if you're in the 

Re: Running Mahout on a Spark cluster

2017-10-03 Thread Pat Ferrel
Actually if you require scala 2.11 and spark 2.1 you have to use the current 
master (o.13.0 does not support these) and also can’t use sbt, unless you have 
some trick I haven’t discovered.


On Oct 3, 2017, at 12:55 PM, Pat Ferrel  wrote:

I’m the aforementioned pferrel

@Hoa, thanks for that reference, I forgot I had that example. First don’t use 
the Hadoop part of Mahout, it is not supported and will be deprecated. The 
Spark version of cooccurrence will be supported. You find it in the 
SimilarityAnalysis object.

If you go back to the last release you should be able to make that 
https://github.com/pferrel/3-input-cooc 
 work with version updates to 
Mahout-0.13.0 and dependencies. To use the latest master of Mahout, there are 
the problems listed below.


I’m having a hard time building with sbt using the mahout-spark module when I 
build that latest mahout master with `mvn clean install`. This puts the 
mahout-spark module in the local ~/.m2 maven cache. The structure doesn’t match 
what SBT expects the path and filenames to be.

The build.sbt  `libraryDependencies` line *should* IMO be:
`"org.apache.mahout" %% "mahout-spark-2.1" % “0.13.1-SNAPSHOT`

This is parsed by sbt to yield the path of :
org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar

unfortunately the outcome of `mvn clean install` currently is (I think):
org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-spark-0.13.1-SNAPSHOT-spark_2.1.jar

I can’t find a way to make SBT parse that structure and name.


On Oct 2, 2017, at 11:02 PM, Trevor Grant  wrote:

Code pointer:
https://github.com/rawkintrevo/cylons/tree/master/eigenfaces

However, I build Mahout (0.13.1-SNAPSHOT) locally with

mvn clean install -Pscala-2.11,spark-2.1,viennacl-omp -DskipTests

That's how maven was able to pick those up.


On Fri, Sep 22, 2017 at 10:06 PM, Hoa Nguyen 
wrote:

> Hey all,
> 
> Thanks for the offers of help. I've been able to narrow down some of the
> problems to version incompatibility and I just wanted to give an update.
> Just to back track a bit, my initial goal was to run Mahout on a
> distributed cluster whether that was running Hadoop Map Reduce or Spark.
> 
> I started out trying to get it to run on Spark, which I have some
> familiarity, but that didn't seem to work. While the error messages seem to
> indicate there weren't enough resources on the workers ("WARN
> scheduler.TaskSchedulerImpl: Initial job has not accepted any resources;
> check your cluster UI to ensure that workers are registered and have
> sufficient memory"), I'm pretty sure that wasn't the case, not only because
> it's a 4 node cluster of m4.xlarges, I was able to run another, simpler
> Spark batch job on that same distributed cluster.
> 
> After a bit of wrangling, I was able to narrow down some of the issues. It
> turns out I was kind of blindly using this repo https://github.com/
> pferrel/3-input-cooc as a guide without fully realizing that it was from
> several years ago and based on Mahout 0.10.0, Scala 2.10 and Spark 1.1.1
> That is significantly different from my environment, which has Mahout
> 0.13.0 and Spark 2.1.1 installed, which also means I have to use Scala
> 2.11. After modifying the build.sbt file to account for those versions, I
> now have compile type mismatch issues that I'm just not that savvy to fix
> (see attached screenshot if you're interested).
> 
> Anyway, the good news that I was able to finally get Mahout code running
> on Hadoop map-reduce, but also after a bit wrangling. It turned out my
> instances were running Ubuntu 14 and apparently that doesn't play well with
> Hadoop 2.7.4, which prevented me from running any sample Mahout code (from
> here: https://github.com/apache/mahout/tree/master/examples/bin) that
> relied on map-reduce. Those problems went away after I installed Hadoop
> 2.8.1 instead. Now I'm able to get the shell scripts running on a
> distributed Hadoop cluster (yay!).
> 
> Anyway, if anyone has more recent and working Spark Scala code that uses
> Mahout that they can point me to, I'd appreciate it.
> 
> Many thanks!
> Hoa
> 
> On Fri, Sep 22, 2017 at 1:09 AM, Trevor Grant 
> wrote:
> 
>> Hi Hoa,
>> 
>> A few things could be happening here, I haven't run across that specific
>> error.
>> 
>> 1) Spark 2.x - Mahout 0.13.0: Mahout 0.13.0 WILL run on Spark 2.x, however
>> you need to build from source (not the binaries).  You can do this by
>> downloading mahout source or cloning the repo and building with:
>> mvn clean install -Pspark-2.1,scala-2.11 -DskipTests
>> 
>> 2) Have you setup spark with Kryo serialization? How you do this depends
>> on
>> if you're in the shell/zeppelin or using spark submit.
>> 
>> However, for both of these cases- it shouldn't have even run local afaik
>> so
>> the fact it did tells me you probably have gotten 

Re: Running Mahout on a Spark cluster

2017-10-03 Thread Pat Ferrel
I’m the aforementioned pferrel

@Hoa, thanks for that reference, I forgot I had that example. First don’t use 
the Hadoop part of Mahout, it is not supported and will be deprecated. The 
Spark version of cooccurrence will be supported. You find it in the 
SimilarityAnalysis object.

If you go back to the last release you should be able to make that 
https://github.com/pferrel/3-input-cooc 
 work with version updates to 
Mahout-0.13.0 and dependencies. To use the latest master of Mahout, there are 
the problems listed below.


I’m having a hard time building with sbt using the mahout-spark module when I 
build that latest mahout master with `mvn clean install`. This puts the 
mahout-spark module in the local ~/.m2 maven cache. The structure doesn’t match 
what SBT expects the path and filenames to be.

The build.sbt  `libraryDependencies` line *should* IMO be:
`"org.apache.mahout" %% "mahout-spark-2.1" % “0.13.1-SNAPSHOT`

This is parsed by sbt to yield the path of :
org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar

unfortunately the outcome of `mvn clean install` currently is (I think):
org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-spark-0.13.1-SNAPSHOT-spark_2.1.jar

I can’t find a way to make SBT parse that structure and name.


On Oct 2, 2017, at 11:02 PM, Trevor Grant  wrote:

Code pointer:
https://github.com/rawkintrevo/cylons/tree/master/eigenfaces

However, I build Mahout (0.13.1-SNAPSHOT) locally with

mvn clean install -Pscala-2.11,spark-2.1,viennacl-omp -DskipTests

That's how maven was able to pick those up.


On Fri, Sep 22, 2017 at 10:06 PM, Hoa Nguyen 
wrote:

> Hey all,
> 
> Thanks for the offers of help. I've been able to narrow down some of the
> problems to version incompatibility and I just wanted to give an update.
> Just to back track a bit, my initial goal was to run Mahout on a
> distributed cluster whether that was running Hadoop Map Reduce or Spark.
> 
> I started out trying to get it to run on Spark, which I have some
> familiarity, but that didn't seem to work. While the error messages seem to
> indicate there weren't enough resources on the workers ("WARN
> scheduler.TaskSchedulerImpl: Initial job has not accepted any resources;
> check your cluster UI to ensure that workers are registered and have
> sufficient memory"), I'm pretty sure that wasn't the case, not only because
> it's a 4 node cluster of m4.xlarges, I was able to run another, simpler
> Spark batch job on that same distributed cluster.
> 
> After a bit of wrangling, I was able to narrow down some of the issues. It
> turns out I was kind of blindly using this repo https://github.com/
> pferrel/3-input-cooc as a guide without fully realizing that it was from
> several years ago and based on Mahout 0.10.0, Scala 2.10 and Spark 1.1.1
> That is significantly different from my environment, which has Mahout
> 0.13.0 and Spark 2.1.1 installed, which also means I have to use Scala
> 2.11. After modifying the build.sbt file to account for those versions, I
> now have compile type mismatch issues that I'm just not that savvy to fix
> (see attached screenshot if you're interested).
> 
> Anyway, the good news that I was able to finally get Mahout code running
> on Hadoop map-reduce, but also after a bit wrangling. It turned out my
> instances were running Ubuntu 14 and apparently that doesn't play well with
> Hadoop 2.7.4, which prevented me from running any sample Mahout code (from
> here: https://github.com/apache/mahout/tree/master/examples/bin) that
> relied on map-reduce. Those problems went away after I installed Hadoop
> 2.8.1 instead. Now I'm able to get the shell scripts running on a
> distributed Hadoop cluster (yay!).
> 
> Anyway, if anyone has more recent and working Spark Scala code that uses
> Mahout that they can point me to, I'd appreciate it.
> 
> Many thanks!
> Hoa
> 
> On Fri, Sep 22, 2017 at 1:09 AM, Trevor Grant 
> wrote:
> 
>> Hi Hoa,
>> 
>> A few things could be happening here, I haven't run across that specific
>> error.
>> 
>> 1) Spark 2.x - Mahout 0.13.0: Mahout 0.13.0 WILL run on Spark 2.x, however
>> you need to build from source (not the binaries).  You can do this by
>> downloading mahout source or cloning the repo and building with:
>> mvn clean install -Pspark-2.1,scala-2.11 -DskipTests
>> 
>> 2) Have you setup spark with Kryo serialization? How you do this depends
>> on
>> if you're in the shell/zeppelin or using spark submit.
>> 
>> However, for both of these cases- it shouldn't have even run local afaik
>> so
>> the fact it did tells me you probably have gotten this far?
>> 
>> Assuming you've done 1 and 2, can you share some code? I'll see if I can
>> recreate on my end.
>> 
>> Thanks!
>> 
>> tg
>> 
>> On Thu, Sep 21, 2017 at 9:37 PM, Hoa Nguyen 
>> wrote:
>> 
>>> I apologize in advance if 

Re: Running Mahout on a Spark cluster

2017-10-03 Thread Trevor Grant
Code pointer:
https://github.com/rawkintrevo/cylons/tree/master/eigenfaces

However, I build Mahout (0.13.1-SNAPSHOT) locally with

mvn clean install -Pscala-2.11,spark-2.1,viennacl-omp -DskipTests

That's how maven was able to pick those up.


On Fri, Sep 22, 2017 at 10:06 PM, Hoa Nguyen 
wrote:

> Hey all,
>
> Thanks for the offers of help. I've been able to narrow down some of the
> problems to version incompatibility and I just wanted to give an update.
> Just to back track a bit, my initial goal was to run Mahout on a
> distributed cluster whether that was running Hadoop Map Reduce or Spark.
>
> I started out trying to get it to run on Spark, which I have some
> familiarity, but that didn't seem to work. While the error messages seem to
> indicate there weren't enough resources on the workers ("WARN
> scheduler.TaskSchedulerImpl: Initial job has not accepted any resources;
> check your cluster UI to ensure that workers are registered and have
> sufficient memory"), I'm pretty sure that wasn't the case, not only because
> it's a 4 node cluster of m4.xlarges, I was able to run another, simpler
> Spark batch job on that same distributed cluster.
>
> After a bit of wrangling, I was able to narrow down some of the issues. It
> turns out I was kind of blindly using this repo https://github.com/
> pferrel/3-input-cooc as a guide without fully realizing that it was from
> several years ago and based on Mahout 0.10.0, Scala 2.10 and Spark 1.1.1
> That is significantly different from my environment, which has Mahout
> 0.13.0 and Spark 2.1.1 installed, which also means I have to use Scala
> 2.11. After modifying the build.sbt file to account for those versions, I
> now have compile type mismatch issues that I'm just not that savvy to fix
> (see attached screenshot if you're interested).
>
> Anyway, the good news that I was able to finally get Mahout code running
> on Hadoop map-reduce, but also after a bit wrangling. It turned out my
> instances were running Ubuntu 14 and apparently that doesn't play well with
> Hadoop 2.7.4, which prevented me from running any sample Mahout code (from
> here: https://github.com/apache/mahout/tree/master/examples/bin) that
> relied on map-reduce. Those problems went away after I installed Hadoop
> 2.8.1 instead. Now I'm able to get the shell scripts running on a
> distributed Hadoop cluster (yay!).
>
> Anyway, if anyone has more recent and working Spark Scala code that uses
> Mahout that they can point me to, I'd appreciate it.
>
> Many thanks!
> Hoa
>
> On Fri, Sep 22, 2017 at 1:09 AM, Trevor Grant 
> wrote:
>
>> Hi Hoa,
>>
>> A few things could be happening here, I haven't run across that specific
>> error.
>>
>> 1) Spark 2.x - Mahout 0.13.0: Mahout 0.13.0 WILL run on Spark 2.x, however
>> you need to build from source (not the binaries).  You can do this by
>> downloading mahout source or cloning the repo and building with:
>> mvn clean install -Pspark-2.1,scala-2.11 -DskipTests
>>
>> 2) Have you setup spark with Kryo serialization? How you do this depends
>> on
>> if you're in the shell/zeppelin or using spark submit.
>>
>> However, for both of these cases- it shouldn't have even run local afaik
>> so
>> the fact it did tells me you probably have gotten this far?
>>
>> Assuming you've done 1 and 2, can you share some code? I'll see if I can
>> recreate on my end.
>>
>> Thanks!
>>
>> tg
>>
>> On Thu, Sep 21, 2017 at 9:37 PM, Hoa Nguyen 
>> wrote:
>>
>> > I apologize in advance if this is too much of a newbie question but I'm
>> > having a hard time running any Mahout example code in a distributed
>> Spark
>> > cluster. The code runs as advertised when Spark is running locally on
>> one
>> > machine but the minute I point Spark to a cluster and master url, I
>> can't
>> > get it to work, drawing the error: "WARN scheduler.TaskSchedulerImpl:
>> > Initial job has not accepted any resources; check your cluster UI to
>> ensure
>> > that workers are registered and have sufficient memory"
>> >
>> > I know my Spark cluster is configured and working correctly because I
>> ran
>> > non-Mahout code and it runs on a distributed cluster fine. What am I
>> doing
>> > wrong? The only thing I can think of is that my Spark version is too
>> recent
>> > -- 2.1.1 -- for the Mahout version I'm using -- 0.13.0. Is that it or
>> am I
>> > doing something else wrong?
>> >
>> > Thanks for any advice,
>> > Hoa
>> >
>>
>
>


Re: Running Mahout on a Spark cluster

2017-10-03 Thread Trevor Grant
Hey- sorry for long delay. I've been traveling.

Pat Ferrel was telling me he was having some simlar issues with
Spark+Mahout+SBT recently, and that we need to re-examine our naming
conventions on JARs.

Fwiw- I have several project that use Spark+Mahout in Spark 2.1/Scala-2.11,
and we even test this in our Travis CI tests, but the trick is- we use
Maven for the build. Any chance you could use maven?  If not, maybe Pat can
chime in here, I'm just not an SBT user, so I'm not 100% sure what to tell
you.



On Fri, Sep 22, 2017 at 10:06 PM, Hoa Nguyen 
wrote:

> Hey all,
>
> Thanks for the offers of help. I've been able to narrow down some of the
> problems to version incompatibility and I just wanted to give an update.
> Just to back track a bit, my initial goal was to run Mahout on a
> distributed cluster whether that was running Hadoop Map Reduce or Spark.
>
> I started out trying to get it to run on Spark, which I have some
> familiarity, but that didn't seem to work. While the error messages seem to
> indicate there weren't enough resources on the workers ("WARN
> scheduler.TaskSchedulerImpl: Initial job has not accepted any resources;
> check your cluster UI to ensure that workers are registered and have
> sufficient memory"), I'm pretty sure that wasn't the case, not only because
> it's a 4 node cluster of m4.xlarges, I was able to run another, simpler
> Spark batch job on that same distributed cluster.
>
> After a bit of wrangling, I was able to narrow down some of the issues. It
> turns out I was kind of blindly using this repo https://github.com/
> pferrel/3-input-cooc as a guide without fully realizing that it was from
> several years ago and based on Mahout 0.10.0, Scala 2.10 and Spark 1.1.1
> That is significantly different from my environment, which has Mahout
> 0.13.0 and Spark 2.1.1 installed, which also means I have to use Scala
> 2.11. After modifying the build.sbt file to account for those versions, I
> now have compile type mismatch issues that I'm just not that savvy to fix
> (see attached screenshot if you're interested).
>
> Anyway, the good news that I was able to finally get Mahout code running
> on Hadoop map-reduce, but also after a bit wrangling. It turned out my
> instances were running Ubuntu 14 and apparently that doesn't play well with
> Hadoop 2.7.4, which prevented me from running any sample Mahout code (from
> here: https://github.com/apache/mahout/tree/master/examples/bin) that
> relied on map-reduce. Those problems went away after I installed Hadoop
> 2.8.1 instead. Now I'm able to get the shell scripts running on a
> distributed Hadoop cluster (yay!).
>
> Anyway, if anyone has more recent and working Spark Scala code that uses
> Mahout that they can point me to, I'd appreciate it.
>
> Many thanks!
> Hoa
>
> On Fri, Sep 22, 2017 at 1:09 AM, Trevor Grant 
> wrote:
>
>> Hi Hoa,
>>
>> A few things could be happening here, I haven't run across that specific
>> error.
>>
>> 1) Spark 2.x - Mahout 0.13.0: Mahout 0.13.0 WILL run on Spark 2.x, however
>> you need to build from source (not the binaries).  You can do this by
>> downloading mahout source or cloning the repo and building with:
>> mvn clean install -Pspark-2.1,scala-2.11 -DskipTests
>>
>> 2) Have you setup spark with Kryo serialization? How you do this depends
>> on
>> if you're in the shell/zeppelin or using spark submit.
>>
>> However, for both of these cases- it shouldn't have even run local afaik
>> so
>> the fact it did tells me you probably have gotten this far?
>>
>> Assuming you've done 1 and 2, can you share some code? I'll see if I can
>> recreate on my end.
>>
>> Thanks!
>>
>> tg
>>
>> On Thu, Sep 21, 2017 at 9:37 PM, Hoa Nguyen 
>> wrote:
>>
>> > I apologize in advance if this is too much of a newbie question but I'm
>> > having a hard time running any Mahout example code in a distributed
>> Spark
>> > cluster. The code runs as advertised when Spark is running locally on
>> one
>> > machine but the minute I point Spark to a cluster and master url, I
>> can't
>> > get it to work, drawing the error: "WARN scheduler.TaskSchedulerImpl:
>> > Initial job has not accepted any resources; check your cluster UI to
>> ensure
>> > that workers are registered and have sufficient memory"
>> >
>> > I know my Spark cluster is configured and working correctly because I
>> ran
>> > non-Mahout code and it runs on a distributed cluster fine. What am I
>> doing
>> > wrong? The only thing I can think of is that my Spark version is too
>> recent
>> > -- 2.1.1 -- for the Mahout version I'm using -- 0.13.0. Is that it or
>> am I
>> > doing something else wrong?
>> >
>> > Thanks for any advice,
>> > Hoa
>> >
>>
>
>


Re: Running Mahout on a Spark cluster

2017-09-22 Thread Hoa Nguyen
Hey all,

Thanks for the offers of help. I've been able to narrow down some of the
problems to version incompatibility and I just wanted to give an update.
Just to back track a bit, my initial goal was to run Mahout on a
distributed cluster whether that was running Hadoop Map Reduce or Spark.

I started out trying to get it to run on Spark, which I have some
familiarity, but that didn't seem to work. While the error messages seem to
indicate there weren't enough resources on the workers ("WARN
scheduler.TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient memory"), I'm pretty sure that wasn't the case, not only because
it's a 4 node cluster of m4.xlarges, I was able to run another, simpler
Spark batch job on that same distributed cluster.

After a bit of wrangling, I was able to narrow down some of the issues. It
turns out I was kind of blindly using this repo
https://github.com/pferrel/3-input-cooc as a guide without fully realizing
that it was from several years ago and based on Mahout 0.10.0, Scala 2.10
and Spark 1.1.1 That is significantly different from my environment, which
has Mahout 0.13.0 and Spark 2.1.1 installed, which also means I have to use
Scala 2.11. After modifying the build.sbt file to account for those
versions, I now have compile type mismatch issues that I'm just not that
savvy to fix (see attached screenshot if you're interested).

Anyway, the good news that I was able to finally get Mahout code running on
Hadoop map-reduce, but also after a bit wrangling. It turned out my
instances were running Ubuntu 14 and apparently that doesn't play well with
Hadoop 2.7.4, which prevented me from running any sample Mahout code (from
here: https://github.com/apache/mahout/tree/master/examples/bin) that
relied on map-reduce. Those problems went away after I installed Hadoop
2.8.1 instead. Now I'm able to get the shell scripts running on a
distributed Hadoop cluster (yay!).

Anyway, if anyone has more recent and working Spark Scala code that uses
Mahout that they can point me to, I'd appreciate it.

Many thanks!
Hoa

On Fri, Sep 22, 2017 at 1:09 AM, Trevor Grant 
wrote:

> Hi Hoa,
>
> A few things could be happening here, I haven't run across that specific
> error.
>
> 1) Spark 2.x - Mahout 0.13.0: Mahout 0.13.0 WILL run on Spark 2.x, however
> you need to build from source (not the binaries).  You can do this by
> downloading mahout source or cloning the repo and building with:
> mvn clean install -Pspark-2.1,scala-2.11 -DskipTests
>
> 2) Have you setup spark with Kryo serialization? How you do this depends on
> if you're in the shell/zeppelin or using spark submit.
>
> However, for both of these cases- it shouldn't have even run local afaik so
> the fact it did tells me you probably have gotten this far?
>
> Assuming you've done 1 and 2, can you share some code? I'll see if I can
> recreate on my end.
>
> Thanks!
>
> tg
>
> On Thu, Sep 21, 2017 at 9:37 PM, Hoa Nguyen 
> wrote:
>
> > I apologize in advance if this is too much of a newbie question but I'm
> > having a hard time running any Mahout example code in a distributed Spark
> > cluster. The code runs as advertised when Spark is running locally on one
> > machine but the minute I point Spark to a cluster and master url, I can't
> > get it to work, drawing the error: "WARN scheduler.TaskSchedulerImpl:
> > Initial job has not accepted any resources; check your cluster UI to
> ensure
> > that workers are registered and have sufficient memory"
> >
> > I know my Spark cluster is configured and working correctly because I ran
> > non-Mahout code and it runs on a distributed cluster fine. What am I
> doing
> > wrong? The only thing I can think of is that my Spark version is too
> recent
> > -- 2.1.1 -- for the Mahout version I'm using -- 0.13.0. Is that it or am
> I
> > doing something else wrong?
> >
> > Thanks for any advice,
> > Hoa
> >
>


Re: Running Mahout on a Spark cluster

2017-09-21 Thread Trevor Grant
Hi Hoa,

A few things could be happening here, I haven't run across that specific
error.

1) Spark 2.x - Mahout 0.13.0: Mahout 0.13.0 WILL run on Spark 2.x, however
you need to build from source (not the binaries).  You can do this by
downloading mahout source or cloning the repo and building with:
mvn clean install -Pspark-2.1,scala-2.11 -DskipTests

2) Have you setup spark with Kryo serialization? How you do this depends on
if you're in the shell/zeppelin or using spark submit.

However, for both of these cases- it shouldn't have even run local afaik so
the fact it did tells me you probably have gotten this far?

Assuming you've done 1 and 2, can you share some code? I'll see if I can
recreate on my end.

Thanks!

tg

On Thu, Sep 21, 2017 at 9:37 PM, Hoa Nguyen 
wrote:

> I apologize in advance if this is too much of a newbie question but I'm
> having a hard time running any Mahout example code in a distributed Spark
> cluster. The code runs as advertised when Spark is running locally on one
> machine but the minute I point Spark to a cluster and master url, I can't
> get it to work, drawing the error: "WARN scheduler.TaskSchedulerImpl:
> Initial job has not accepted any resources; check your cluster UI to ensure
> that workers are registered and have sufficient memory"
>
> I know my Spark cluster is configured and working correctly because I ran
> non-Mahout code and it runs on a distributed cluster fine. What am I doing
> wrong? The only thing I can think of is that my Spark version is too recent
> -- 2.1.1 -- for the Mahout version I'm using -- 0.13.0. Is that it or am I
> doing something else wrong?
>
> Thanks for any advice,
> Hoa
>