PS. I see 3 ways to run (embed) this work.
(1) The ideal way is to write dsl scripts and run them thru Mahout spark shell. But when it is finished, we just need to download Mahout, compile it and point MAHOUT_HOME to it. This is it. then one can launch things either interactively thru shell, or just by passing a script to it. Just like it happens with R. Unfortunately this is work-in-progress, move along, nothing to see here.... (2) write a quick test within mahout project, recompile mahout and launch your code. In this case mahout will take care of shipping mahout jars to backend automatically, and since your code is included in them, nothing else is required. (3) create a standalone projejct that depends on mahout-spark artifact. In this case, it works pretty much like (2) except if one writes closures to be used in any code (e.g. mapBlock or custom spark pipeline continuations), then the closure code must be also shipped to backend. This becomes a bit more hairy -- you need to compile your application and add its jars to the call thath creates Mahout Context, otherwise attempt to run one's code in the back may generate ClassNotFounds. (4) What about CLI?... -- So what about it? Option (1) should supersede need for any CLI. As it stands, there is no CLI support, nor there are any future plans to support them at this point. -d Basically, if you writing a 3rd party application to test, then you just need mahout source compiled with MAHOUT_HOME pointing to it. One's application should take care of its own classpath which is done automatically if one uses maven. If you import maven into Idea, then you can use Idea's launcher to take care of client classpath for you. Backend classpath is taken care by mahout; but you still need to ship your application jars to the spark session, for which there's an "extra jars" parameter of mathoutSparkContext call . On Thu, Apr 10, 2014 at 1:00 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > > > > On Thu, Apr 10, 2014 at 12:00 PM, Pat Ferrel <p...@occamsmachete.com>wrote: > >> What is the recommended Spark setup? >> > > Check out their docs. We don't have any special instructions for mahout. > > The main point behind 0.9.0 release is that it now supports master HA thru > zookeeper, so for that reason alone you probably don't want to use mesos. > > You may want to use mesos to have pre-allocated workers per spark session > (so called "coarse grained" mode). if you shoot a lot of short-running > queries (1sec or less), this is a significant win in QPS and response time. > (fine grained mode will add about 3 seconds to start all the workers lazily > to pipeline time). > > In our case we are dealing with stuff that runs over 3 seconds for most > part, so assuming 0.9.0 HA is stable enough (which i haven't tried yet), > there's no reason for us to go mesos, multi-master standalone with > zookeeper is good enough. > > >> >> I imagine most of us will have HDFS configured (with either local files >> or an actual cluster). >> > > Hadoop DFS API is pretty much the only persistence api supported by > Mahout Spark Bindings at this point. So yes, you would want to have > hdfs-only cluster running 1.x or 2 doesn't matter. i use cdh 4 distros. > > >> Since most of Mahout is recommended to be run on Hadoop 1.x we should use >> Mesos? https://github.com/mesos/hadoop >> >> This would mean we'd need to have at least Hadoop 1.2.1 (in mesos and >> current mahout pom). We'd use Mesos to manage hadoop and spark jobs but >> HDFS would be controlled separately by hadoop itself. >> > > I think i addressed this. no we are not bound by the MR part of mahout > since Spark runs on whatever. like i said, with 0.9.0 + Mahout combo i > would forego mesos -- unless it turns out meaningfully faster or more > stable. > > > >> >> Is this about right? Is there a setup doc I missed? > > > i dont think one needed. > >