I like the idea of using scala to drive the workflow. Spark already comes with a scheduler, why not program a plugin to schedule other types of tasks (copy file, send email, etc.)? Scala could handle any logic required by the pipeline. Passing objects (including RDDs) between tasks is also easier. I don't know if this is an overuse of Spark scheduler, but sounds like a good tool. The only issue would be releasing resources that is not used at intermediate steps.
On Fri, Jul 11, 2014 at 12:05 PM, Wei Tan <w...@us.ibm.com> wrote: > Just curious: how about using scala to drive the workflow? I guess if you > use other tools (oozie, etc) you lose the advantage of reading from RDD -- > you have to read from HDFS. > > Best regards, > Wei > > --------------------------------- > Wei Tan, PhD > Research Staff Member > IBM T. J. Watson Research Center > *http://researcher.ibm.com/person/us-wtan* > <http://researcher.ibm.com/person/us-wtan> > > > > From: "k.tham" <kevins...@gmail.com> > To: u...@spark.incubator.apache.org, > Date: 07/10/2014 01:20 PM > Subject: Recommended pipeline automation tool? Oozie? > ------------------------------ > > > > I'm just wondering what's the general recommendation for data pipeline > automation. > > Say, I want to run Spark Job A, then B, then invoke script C, then do D, > and > if D fails, do E, and if Job A fails, send email F, etc... > > It looks like Oozie might be the best choice. But I'd like some > advice/suggestions. > > Thanks! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > -- Li @vrilleup