@Pat, In regards to your question on JIRA, this is Dmitry's email about running mahout on spark.
Sent from my iPhone > On Apr 11, 2014, at 7:52 PM, "Andrew Musselman" <[email protected]> > wrote: > > We've used Mesos at a client to run both Hadoop and Spark jobs in the same > setup. It's been a good experience so far. > > I haven't used YARN on any projects yet but it looks like you need to > rebuild Spark to run on it currently: > https://spark.apache.org/docs/0.9.0/running-on-yarn.html > > Why not officially support Hadoop v2 and recommend YARN for that, as well > as supporting Mesos? > > Another question is how long we will support Hadoop v1. > > >> On Fri, Apr 11, 2014 at 1:43 PM, Ted Dunning <[email protected]> wrote: >> >> I am pretty sure that mesos supports both map reduce and spark. >> >> In general, though, the biggest design consideration in which resource >> manager to use is to comply with local standards and traditions. >> >> For playing around, stand-alone spark is fine. >> >> >> >> On Thu, Apr 10, 2014 at 4:29 PM, Dmitriy Lyubimov <[email protected]> >> wrote: >> >>>> On Thu, Apr 10, 2014 at 4:20 PM, Pat Ferrel <[email protected]> >>> wrote: >>> >>>> Hmm, that leaves Spark and Hadoop to manage tasks independently. Not >>> ideal >>>> if you are running both hadoop and spark jobs simultaneously. >>> >>> I think the only resource manager that semi-officially supports both >>> MapReduce and spark is Yarn. This sounds neat in theory, but in practice >> i >>> think one discovers too many hoops to jump thru. I am also inertly >> dubious >>> about quality and performance of Yarn compared to others. >>> >>> >>>> >>>> If you have a single user cluster or are running jobs in a pipeline I >>>> suppose you don't need Mesos. >>>> >>>> >>>> On Apr 10, 2014, at 1:00 PM, Dmitriy Lyubimov <[email protected]> >> wrote: >>>> >>>> On Thu, Apr 10, 2014 at 12:00 PM, Pat Ferrel <[email protected]> >>>> wrote: >>>> >>>>> What is the recommended Spark setup? >>>> >>>> Check out their docs. We don't have any special instructions for >> mahout. >>>> >>>> The main point behind 0.9.0 release is that it now supports master HA >>> thru >>>> zookeeper, so for that reason alone you probably don't want to use >> mesos. >>>> >>>> You may want to use mesos to have pre-allocated workers per spark >> session >>>> (so called "coarse grained" mode). if you shoot a lot of short-running >>>> queries (1sec or less), this is a significant win in QPS and response >>> time. >>>> (fine grained mode will add about 3 seconds to start all the workers >>> lazily >>>> to pipeline time). >>>> >>>> In our case we are dealing with stuff that runs over 3 seconds for most >>>> part, so assuming 0.9.0 HA is stable enough (which i haven't tried >> yet), >>>> there's no reason for us to go mesos, multi-master standalone with >>>> zookeeper is good enough. >>>> >>>> >>>>> >>>>> I imagine most of us will have HDFS configured (with either local >> files >>>> or >>>>> an actual cluster). >>>> >>>> Hadoop DFS API is pretty much the only persistence api supported by >>> Mahout >>>> Spark Bindings at this point. So yes, you would want to have hdfs-only >>>> cluster running 1.x or 2 doesn't matter. i use cdh 4 distros. >>>> >>>> >>>>> Since most of Mahout is recommended to be run on Hadoop 1.x we should >>> use >>>>> Mesos? https://github.com/mesos/hadoop >>>>> >>>>> This would mean we'd need to have at least Hadoop 1.2.1 (in mesos and >>>>> current mahout pom). We'd use Mesos to manage hadoop and spark jobs >> but >>>>> HDFS would be controlled separately by hadoop itself. >>>> >>>> I think i addressed this. no we are not bound by the MR part of mahout >>>> since Spark runs on whatever. like i said, with 0.9.0 + Mahout combo i >>>> would forego mesos -- unless it turns out meaningfully faster or more >>>> stable. >>>> >>>> >>>> >>>>> >>>>> Is this about right? Is there a setup doc I missed? >>>> >>>> >>>> i dont think one needed. >>
