Re: Spark setup

Dmitriy Lyubimov Thu, 10 Apr 2014 13:28:56 -0700

PS.

I see 3 ways to run (embed) this work.

(1) The ideal way is to write dsl scripts and run them thru Mahout spark
shell.  But when it is finished, we just need to download Mahout, compile
it and point MAHOUT_HOME to it. This is it. then one can launch things
either interactively thru shell, or just by passing a script to it. Just
like it happens with R.

Unfortunately this is work-in-progress, move along, nothing to see here....

(2) write a quick test within mahout project, recompile mahout and launch
your code. In this case mahout will take care of shipping mahout jars to
backend automatically, and since your code is included in them, nothing
else is required.

(3) create a standalone projejct that depends on mahout-spark artifact. In
this case, it works pretty much like (2) except if one writes closures to
be used in any code (e.g. mapBlock or custom spark pipeline continuations),
then the closure code must be also shipped to backend. This becomes a bit
more hairy -- you need to compile your application and add its jars to the
call thath creates Mahout Context, otherwise attempt to run one's code in
the back may generate ClassNotFounds.

(4) What about CLI?... -- So what about it? Option (1) should supersede
need for any CLI. As it stands, there is no CLI support, nor there are any
future plans to support them at this point.

-d

Basically, if you writing a 3rd party application to test, then you just
need mahout source compiled with MAHOUT_HOME pointing to it. One's
application should take care of its own classpath which is done
automatically if one uses maven. If you import maven into Idea, then you
can use Idea's launcher to take care of client classpath for you. Backend
classpath is taken care by mahout; but you still need to ship your
application jars to the spark session, for which there's an "extra jars"
parameter of mathoutSparkContext call .

On Thu, Apr 10, 2014 at 1:00 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

>
>
>
> On Thu, Apr 10, 2014 at 12:00 PM, Pat Ferrel <p...@occamsmachete.com>wrote:
>
>> What is the recommended Spark setup?
>>
>
> Check out their docs. We don't have any special instructions for mahout.
>
> The main point behind 0.9.0 release is that it now supports master HA thru
> zookeeper, so for that reason alone you probably don't want to use mesos.
>
> You may want to use mesos to have pre-allocated workers per spark session
> (so called "coarse grained" mode). if you shoot a lot of short-running
> queries (1sec or less), this is a significant win in QPS and response time.
> (fine grained mode will add about 3 seconds to start all the workers lazily
> to pipeline time).
>
> In our case we are dealing with stuff that runs over 3 seconds for most
> part, so assuming 0.9.0 HA is stable enough (which i haven't tried yet),
> there's no reason for us to go mesos, multi-master standalone with
> zookeeper is good enough.
>
>
>>
>> I imagine most of us will have HDFS configured (with either local files
>> or an actual cluster).
>>
>
> Hadoop DFS API  is pretty much the only persistence api supported by
> Mahout Spark Bindings at this point. So yes, you would want to have
> hdfs-only cluster running 1.x or 2 doesn't matter. i use cdh 4 distros.
>
>
>> Since most of Mahout is recommended to be run on Hadoop 1.x we should use
>> Mesos? https://github.com/mesos/hadoop
>>
>> This would mean we'd need to have at least Hadoop 1.2.1 (in mesos and
>> current mahout pom). We'd use Mesos to manage hadoop and spark jobs but
>> HDFS would be controlled separately by hadoop itself.
>>
>
> I think i addressed this. no we are not bound by the MR part of mahout
> since Spark runs on whatever. like i said, with 0.9.0 + Mahout combo i
> would forego mesos -- unless it turns out meaningfully faster or more
> stable.
>
>
>
>>
>> Is this about right? Is there a setup doc I missed?
>
>
> i dont think one needed.
>
>

Re: Spark setup

Reply via email to