I like the idea of using scala to drive the workflow. Spark already comes
with a scheduler, why not program a plugin to schedule other types of tasks
(copy file, send email, etc.)? Scala could handle any logic required by the
pipeline. Passing objects (including RDDs) between tasks is also easier. I
don't know if this is an overuse of Spark scheduler, but sounds like a good
tool. The only issue would be releasing resources that is not used at
intermediate steps.

On Fri, Jul 11, 2014 at 12:05 PM, Wei Tan <w...@us.ibm.com> wrote:

> Just curious: how about using scala to drive the workflow? I guess if you
> use other tools (oozie, etc) you lose the advantage of reading from RDD --
> you have to read from HDFS.
>
> Best regards,
> Wei
>
> ---------------------------------
> Wei Tan, PhD
> Research Staff Member
> IBM T. J. Watson Research Center
> *http://researcher.ibm.com/person/us-wtan*
> <http://researcher.ibm.com/person/us-wtan>
>
>
>
> From:        "k.tham" <kevins...@gmail.com>
> To:        u...@spark.incubator.apache.org,
> Date:        07/10/2014 01:20 PM
> Subject:        Recommended pipeline automation tool? Oozie?
> ------------------------------
>
>
>
> I'm just wondering what's the general recommendation for data pipeline
> automation.
>
> Say, I want to run Spark Job A, then B, then invoke script C, then do D,
> and
> if D fails, do E, and if Job A fails, send email F, etc...
>
> It looks like Oozie might be the best choice. But I'd like some
> advice/suggestions.
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>
>


-- 
Li
@vrilleup

Reply via email to