RE: Spark setup

Saikat Kanjilal Thu, 10 Apr 2014 14:13:11 -0700

Just FYI.. on 1 I have a fairly good start on 1 but am in the midst of figuring 
out and resolving some classpath issues around getting the unit tests to work 
in the context of the shell.


> Date: Thu, 10 Apr 2014 13:28:11 -0700
> Subject: Re: Spark setup
> From: [email protected]
> To: [email protected]
> 
> PS.
> 
> I see 3 ways to run (embed) this work.
> 
> (1) The ideal way is to write dsl scripts and run them thru Mahout spark
> shell.  But when it is finished, we just need to download Mahout, compile
> it and point MAHOUT_HOME to it. This is it. then one can launch things
> either interactively thru shell, or just by passing a script to it. Just
> like it happens with R.
> 
> Unfortunately this is work-in-progress, move along, nothing to see here....
> 
> (2) write a quick test within mahout project, recompile mahout and launch
> your code. In this case mahout will take care of shipping mahout jars to
> backend automatically, and since your code is included in them, nothing
> else is required.
> 
> (3) create a standalone projejct that depends on mahout-spark artifact. In
> this case, it works pretty much like (2) except if one writes closures to
> be used in any code (e.g. mapBlock or custom spark pipeline continuations),
> then the closure code must be also shipped to backend. This becomes a bit
> more hairy -- you need to compile your application and add its jars to the
> call thath creates Mahout Context, otherwise attempt to run one's code in
> the back may generate ClassNotFounds.
> 
> (4) What about CLI?... -- So what about it? Option (1) should supersede
> need for any CLI. As it stands, there is no CLI support, nor there are any
> future plans to support them at this point.
> 
> -d
> 
> 
> Basically, if you writing a 3rd party application to test, then you just
> need mahout source compiled with MAHOUT_HOME pointing to it. One's
> application should take care of its own classpath which is done
> automatically if one uses maven. If you import maven into Idea, then you
> can use Idea's launcher to take care of client classpath for you. Backend
> classpath is taken care by mahout; but you still need to ship your
> application jars to the spark session, for which there's an "extra jars"
> parameter of mathoutSparkContext call .
> 
> 
> 
> 
> On Thu, Apr 10, 2014 at 1:00 PM, Dmitriy Lyubimov <[email protected]> wrote:
> 
> >
> >
> >
> > On Thu, Apr 10, 2014 at 12:00 PM, Pat Ferrel <[email protected]>wrote:
> >
> >> What is the recommended Spark setup?
> >>
> >
> > Check out their docs. We don't have any special instructions for mahout.
> >
> > The main point behind 0.9.0 release is that it now supports master HA thru
> > zookeeper, so for that reason alone you probably don't want to use mesos.
> >
> > You may want to use mesos to have pre-allocated workers per spark session
> > (so called "coarse grained" mode). if you shoot a lot of short-running
> > queries (1sec or less), this is a significant win in QPS and response time.
> > (fine grained mode will add about 3 seconds to start all the workers lazily
> > to pipeline time).
> >
> > In our case we are dealing with stuff that runs over 3 seconds for most
> > part, so assuming 0.9.0 HA is stable enough (which i haven't tried yet),
> > there's no reason for us to go mesos, multi-master standalone with
> > zookeeper is good enough.
> >
> >
> >>
> >> I imagine most of us will have HDFS configured (with either local files
> >> or an actual cluster).
> >>
> >
> > Hadoop DFS API  is pretty much the only persistence api supported by
> > Mahout Spark Bindings at this point. So yes, you would want to have
> > hdfs-only cluster running 1.x or 2 doesn't matter. i use cdh 4 distros.
> >
> >
> >> Since most of Mahout is recommended to be run on Hadoop 1.x we should use
> >> Mesos? https://github.com/mesos/hadoop
> >>
> >> This would mean we'd need to have at least Hadoop 1.2.1 (in mesos and
> >> current mahout pom). We'd use Mesos to manage hadoop and spark jobs but
> >> HDFS would be controlled separately by hadoop itself.
> >>
> >
> > I think i addressed this. no we are not bound by the MR part of mahout
> > since Spark runs on whatever. like i said, with 0.9.0 + Mahout combo i
> > would forego mesos -- unless it turns out meaningfully faster or more
> > stable.
> >
> >
> >
> >>
> >> Is this about right? Is there a setup doc I missed?
> >
> >
> > i dont think one needed.
> >
> >

RE: Spark setup

Reply via email to