subject:"Re\: Recommended pipeline automation tool\? Oozie\?"

Re: Recommended pipeline automation tool? Oozie?

2014-07-15 Thread Dean Wampler

If you're already using Scala for Spark programming and you hate Oozie XML
as much as I do ;), you might check out Scoozie, a Scala DSL for Oozie:
https://github.com/klout/scoozie


On Thu, Jul 10, 2014 at 5:52 PM, Andrei  wrote:

> I used both - Oozie and Luigi - but found them inflexible and still
> overcomplicated, especially in presence of Spark.
>
> Oozie has a fixed list of building blocks, which is pretty limiting. For
> example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are
> out of scope (of course, you can always write wrapper as Java or Shell
> action, but does it really need to be so complicated?). Another issue with
> Oozie is passing variables between actions. There's Oozie context that is
> suitable for passing key-value pairs (both strings) between actions, but
> for more complex objects (say, FileInputStream that should be closed at
> last step only) you have to do some advanced kung fu.
>
> Luigi, on other hand, has its niche - complicated dataflows with many
> tasks that depend on each other. Basically, there are tasks (this is where
> you define computations) and targets (something that can "exist" - file on
> disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it
> creates a plan for achieving this. Luigi is really shiny when your workflow
> fits this model, but one step away and you are in trouble. For example,
> consider simple pipeline: run MR job and output temporary data, run another
> MR job and output final data, clean temporary data. You can make target
> Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1,
> right? Not so easy. How do you check that Clean task is achieved? If you
> just test whether temporary directory is empty or not, you catch both cases
> - when all tasks are done and when they are not even started yet. Luigi
> allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single
> "run()" method, but ruins the entire idea.
>
> And of course, both of these frameworks are optimized for standard
> MapReduce jobs, which is probably not what you want on Spark mailing list
> :)
>
> Experience with these frameworks, however, gave me some insights about
> typical data pipelines.
>
> 1. Pipelines are mostly linear. Oozie, Luigi and number of other
> frameworks allow branching, but most pipelines actually consist of moving
> data from source to destination with possibly some transformations in
> between (I'll be glad if somebody share use cases when you really need
> branching).
> 2. Transactional logic is important. Either everything, or nothing.
> Otherwise it's really easy to get into inconsistent state.
> 3. Extensibility is important. You never know what will need in a week or
> two.
>
> So eventually I decided that it is much easier to create your own pipeline
> instead of trying to adopt your code to existing frameworks. My latest
> pipeline incarnation simply consists of a list of steps that are started
> sequentially. Each step is a class with at least these methods:
>
>  * run() - launch this step
>  * fail() - what to do if step fails
>  * finalize() - (optional) what to do when all steps are done
>
> For example, if you want to add possibility to run Spark jobs, you just
> create SparkStep and configure it with required code. If you want Hive
> query - just create HiveStep and configure it with Hive connection
> settings. I use YAML file to configure steps and Context (basically,
> Map[String, Any]) to pass variables between them. I also use configurable
> Reporter available for all steps to report the progress.
>
> Hopefully, this will give you some insights about best pipeline for your
> specific case.
>
>
>
> On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown  wrote:
>
>>
>> We use Luigi for this purpose.  (Our pipelines are typically on AWS (no
>> EMR) backed by S3 and using combinations of Python jobs, non-Spark
>> Java/Scala, and Spark.  We run Spark jobs by connecting drivers/clients to
>> the master, and those are what is invoked from Luigi.)
>>
>> —
>> p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>>
>>
>> On Thu, Jul 10, 2014 at 10:20 AM, k.tham  wrote:
>>
>>> I'm just wondering what's the general recommendation for data pipeline
>>> automation.
>>>
>>> Say, I want to run Spark Job A, then B, then invoke script C, then do D,
>>> and
>>> if D fails, do E, and if Job A fails, send email F, etc...
>>>
>>> It looks like Oozie might be the best choice. But I'd like some
>>> advice/suggestions.
>>>
>>> Thanks!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>
>>
>


-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread Li Pu

I like the idea of using scala to drive the workflow. Spark already comes
with a scheduler, why not program a plugin to schedule other types of tasks
(copy file, send email, etc.)? Scala could handle any logic required by the
pipeline. Passing objects (including RDDs) between tasks is also easier. I
don't know if this is an overuse of Spark scheduler, but sounds like a good
tool. The only issue would be releasing resources that is not used at
intermediate steps.

On Fri, Jul 11, 2014 at 12:05 PM, Wei Tan  wrote:

> Just curious: how about using scala to drive the workflow? I guess if you
> use other tools (oozie, etc) you lose the advantage of reading from RDD --
> you have to read from HDFS.
>
> Best regards,
> Wei
>
> -
> Wei Tan, PhD
> Research Staff Member
> IBM T. J. Watson Research Center
> *http://researcher.ibm.com/person/us-wtan*
> 
>
>
>
> From:"k.tham" 
> To:u...@spark.incubator.apache.org,
> Date:07/10/2014 01:20 PM
> Subject:Recommended pipeline automation tool? Oozie?
> --
>
>
>
> I'm just wondering what's the general recommendation for data pipeline
> automation.
>
> Say, I want to run Spark Job A, then B, then invoke script C, then do D,
> and
> if D fails, do E, and if Job A fails, send email F, etc...
>
> It looks like Oozie might be the best choice. But I'd like some
> advice/suggestions.
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>
>

-- 
Li
@vrilleup

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread Wei Tan

Just curious: how about using scala to drive the workflow? I guess if you 
use other tools (oozie, etc) you lose the advantage of reading from RDD -- 
you have to read from HDFS.

Best regards,
Wei

-
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan



From:   "k.tham" 
To: u...@spark.incubator.apache.org, 
Date:   07/10/2014 01:20 PM
Subject:Recommended pipeline automation tool? Oozie?



I'm just wondering what's the general recommendation for data pipeline
automation.

Say, I want to run Spark Job A, then B, then invoke script C, then do D, 
and
if D fails, do E, and if Job A fails, send email F, etc...

It looks like Oozie might be the best choice. But I'd like some
advice/suggestions.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread Nick Pentreath

Did you use "old" azkaban or azkaban 2.5? It has been completely rewritten.

Not saying it is the best but I found it way better than oozie for example.

Sent from my iPhone

> On 11 Jul 2014, at 09:24, "明风"  wrote:
> 
> We use Azkaban for a short time and suffer a lot. Finally we almost rewrite 
> it totally. Don’t recommend it really.
> 
> 发件人: Nick Pentreath 
> 答复: 
> 日期: 2014年7月11日 星期五 下午3:18
> 至: 
> 主题: Re: Recommended pipeline automation tool? Oozie?
> 
> You may look into the new Azkaban - which while being quite heavyweight is 
> actually quite pleasant to use when set up.
> 
> You can run spark jobs (spark-submit) using azkaban shell commands and pass 
> paremeters between jobs. It supports dependencies, simple dags and scheduling 
> with retries. 
> 
> I'm digging deeper and it may be worthwhile extending it with a Spark job 
> type...
> 
> It's probably best for mixed Hadoop / Spark clusters...
> —
> Sent from Mailbox
> 
> 
>> On Fri, Jul 11, 2014 at 12:52 AM, Andrei  wrote:
>> I used both - Oozie and Luigi - but found them inflexible and still 
>> overcomplicated, especially in presence of Spark. 
>> 
>> Oozie has a fixed list of building blocks, which is pretty limiting. For 
>> example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are out 
>> of scope (of course, you can always write wrapper as Java or Shell action, 
>> but does it really need to be so complicated?). Another issue with Oozie is 
>> passing variables between actions. There's Oozie context that is suitable 
>> for passing key-value pairs (both strings) between actions, but for more 
>> complex objects (say, FileInputStream that should be closed at last step 
>> only) you have to do some advanced kung fu. 
>> 
>> Luigi, on other hand, has its niche - complicated dataflows with many tasks 
>> that depend on each other. Basically, there are tasks (this is where you 
>> define computations) and targets (something that can "exist" - file on disk, 
>> entry in ZooKeeper, etc.). You ask Luigi to get some target, and it creates 
>> a plan for achieving this. Luigi is really shiny when your workflow fits 
>> this model, but one step away and you are in trouble. For example, consider 
>> simple pipeline: run MR job and output temporary data, run another MR job 
>> and output final data, clean temporary data. You can make target Clean, that 
>> depends on target MRJob2 that, in its turn, depends on MRJob1, right? Not so 
>> easy. How do you check that Clean task is achieved? If you just test whether 
>> temporary directory is empty or not, you catch both cases - when all tasks 
>> are done and when they are not even started yet. Luigi allows you to specify 
>> all 3 actions - MRJob1, MRJob2, Clean - in a single "run()" method, but 
>> ruins the entire idea. 
>> 
>> And of course, both of these frameworks are optimized for standard MapReduce 
>> jobs, which is probably not what you want on Spark mailing list :) 
>> 
>> Experience with these frameworks, however, gave me some insights about 
>> typical data pipelines. 
>> 
>> 1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks 
>> allow branching, but most pipelines actually consist of moving data from 
>> source to destination with possibly some transformations in between (I'll be 
>> glad if somebody share use cases when you really need branching). 
>> 2. Transactional logic is important. Either everything, or nothing. 
>> Otherwise it's really easy to get into inconsistent state. 
>> 3. Extensibility is important. You never know what will need in a week or 
>> two. 
>> 
>> So eventually I decided that it is much easier to create your own pipeline 
>> instead of trying to adopt your code to existing frameworks. My latest 
>> pipeline incarnation simply consists of a list of steps that are started 
>> sequentially. Each step is a class with at least these methods: 
>> 
>>  * run() - launch this step
>>  * fail() - what to do if step fails
>>  * finalize() - (optional) what to do when all steps are done
>> 
>> For example, if you want to add possibility to run Spark jobs, you just 
>> create SparkStep and configure it with required code. If you want Hive query 
>> - just create HiveStep and configure it with Hive connection settings. I use 
>> YAML file to configure steps and Context (basically, Map[String, Any]) to 
>> pass variables between them. I also use configurable Reporter available for 
>> all steps to report the progress. 
>> 
>> Hopefully, th

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread 明风

We use Azkaban for a short time and suffer a lot. Finally we almost rewrite
it totally. Don’t recommend it really.

发件人:  Nick Pentreath 
答复:  
日期:  2014年7月11日 星期五 下午3:18
至:  
主题:  Re: Recommended pipeline automation tool? Oozie?

You may look into the new Azkaban - which while being quite heavyweight is
actually quite pleasant to use when set up.

You can run spark jobs (spark-submit) using azkaban shell commands and pass
paremeters between jobs. It supports dependencies, simple dags and
scheduling with retries.

I'm digging deeper and it may be worthwhile extending it with a Spark job
type...

It's probably best for mixed Hadoop / Spark clusters...
―
Sent from Mailbox <https://www.dropbox.com/mailbox>


On Fri, Jul 11, 2014 at 12:52 AM, Andrei  wrote:
> I used both - Oozie and Luigi - but found them inflexible and still
> overcomplicated, especially in presence of Spark.
> 
> Oozie has a fixed list of building blocks, which is pretty limiting. For
> example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are out
> of scope (of course, you can always write wrapper as Java or Shell action, but
> does it really need to be so complicated?). Another issue with Oozie is
> passing variables between actions. There's Oozie context that is suitable for
> passing key-value pairs (both strings) between actions, but for more complex
> objects (say, FileInputStream that should be closed at last step only) you
> have to do some advanced kung fu.
> 
> Luigi, on other hand, has its niche - complicated dataflows with many tasks
> that depend on each other. Basically, there are tasks (this is where you
> define computations) and targets (something that can "exist" - file on disk,
> entry in ZooKeeper, etc.). You ask Luigi to get some target, and it creates a
> plan for achieving this. Luigi is really shiny when your workflow fits this
> model, but one step away and you are in trouble. For example, consider simple
> pipeline: run MR job and output temporary data, run another MR job and output
> final data, clean temporary data. You can make target Clean, that depends on
> target MRJob2 that, in its turn, depends on MRJob1, right? Not so easy. How do
> you check that Clean task is achieved? If you just test whether temporary
> directory is empty or not, you catch both cases - when all tasks are done and
> when they are not even started yet. Luigi allows you to specify all 3 actions
> - MRJob1, MRJob2, Clean - in a single "run()" method, but ruins the entire
> idea. 
> 
> And of course, both of these frameworks are optimized for standard MapReduce
> jobs, which is probably not what you want on Spark mailing list :)
> 
> Experience with these frameworks, however, gave me some insights about typical
> data pipelines. 
> 
> 1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks
> allow branching, but most pipelines actually consist of moving data from
> source to destination with possibly some transformations in between (I'll be
> glad if somebody share use cases when you really need branching).
> 2. Transactional logic is important. Either everything, or nothing. Otherwise
> it's really easy to get into inconsistent state.
> 3. Extensibility is important. You never know what will need in a week or two.
> 
> So eventually I decided that it is much easier to create your own pipeline
> instead of trying to adopt your code to existing frameworks. My latest
> pipeline incarnation simply consists of a list of steps that are started
> sequentially. Each step is a class with at least these methods:
> 
>  * run() - launch this step
>  * fail() - what to do if step fails
>  * finalize() - (optional) what to do when all steps are done
> 
> For example, if you want to add possibility to run Spark jobs, you just create
> SparkStep and configure it with required code. If you want Hive query - just
> create HiveStep and configure it with Hive connection settings. I use YAML
> file to configure steps and Context (basically, Map[String, Any]) to pass
> variables between them. I also use configurable Reporter available for all
> steps to report the progress.
> 
> Hopefully, this will give you some insights about best pipeline for your
> specific case. 
> 
> 
> 
> On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown  wrote:
>> 
>> We use Luigi for this purpose.  (Our pipelines are typically on AWS (no EMR)
>> backed by S3 and using combinations of Python jobs, non-Spark Java/Scala, and
>> Spark.  We run Spark jobs by connecting drivers/clients to the master, and
>> those are what is invoked from Luigi.)
>> 
>> ―
>> p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>> 
>> 
>> On Thu, Jul 10, 2014 at 10:20

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread Nick Pentreath

You may look into the new Azkaban - which while being quite heavyweight is 
actually quite pleasant to use when set up.


You can run spark jobs (spark-submit) using azkaban shell commands and pass 
paremeters between jobs. It supports dependencies, simple dags and scheduling 
with retries. 




I'm digging deeper and it may be worthwhile extending it with a Spark job 
type...




It's probably best for mixed Hadoop / Spark clusters...
—
Sent from Mailbox

On Fri, Jul 11, 2014 at 12:52 AM, Andrei 
wrote:

> I used both - Oozie and Luigi - but found them inflexible and still
> overcomplicated, especially in presence of Spark.
> Oozie has a fixed list of building blocks, which is pretty limiting. For
> example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are
> out of scope (of course, you can always write wrapper as Java or Shell
> action, but does it really need to be so complicated?). Another issue with
> Oozie is passing variables between actions. There's Oozie context that is
> suitable for passing key-value pairs (both strings) between actions, but
> for more complex objects (say, FileInputStream that should be closed at
> last step only) you have to do some advanced kung fu.
> Luigi, on other hand, has its niche - complicated dataflows with many tasks
> that depend on each other. Basically, there are tasks (this is where you
> define computations) and targets (something that can "exist" - file on
> disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it
> creates a plan for achieving this. Luigi is really shiny when your workflow
> fits this model, but one step away and you are in trouble. For example,
> consider simple pipeline: run MR job and output temporary data, run another
> MR job and output final data, clean temporary data. You can make target
> Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1,
> right? Not so easy. How do you check that Clean task is achieved? If you
> just test whether temporary directory is empty or not, you catch both cases
> - when all tasks are done and when they are not even started yet. Luigi
> allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single
> "run()" method, but ruins the entire idea.
> And of course, both of these frameworks are optimized for standard
> MapReduce jobs, which is probably not what you want on Spark mailing list
> :)
> Experience with these frameworks, however, gave me some insights about
> typical data pipelines.
> 1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks
> allow branching, but most pipelines actually consist of moving data from
> source to destination with possibly some transformations in between (I'll
> be glad if somebody share use cases when you really need branching).
> 2. Transactional logic is important. Either everything, or nothing.
> Otherwise it's really easy to get into inconsistent state.
> 3. Extensibility is important. You never know what will need in a week or
> two.
> So eventually I decided that it is much easier to create your own pipeline
> instead of trying to adopt your code to existing frameworks. My latest
> pipeline incarnation simply consists of a list of steps that are started
> sequentially. Each step is a class with at least these methods:
>  * run() - launch this step
>  * fail() - what to do if step fails
>  * finalize() - (optional) what to do when all steps are done
> For example, if you want to add possibility to run Spark jobs, you just
> create SparkStep and configure it with required code. If you want Hive
> query - just create HiveStep and configure it with Hive connection
> settings. I use YAML file to configure steps and Context (basically,
> Map[String, Any]) to pass variables between them. I also use configurable
> Reporter available for all steps to report the progress.
> Hopefully, this will give you some insights about best pipeline for your
> specific case.
> On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown  wrote:
>>
>> We use Luigi for this purpose.  (Our pipelines are typically on AWS (no
>> EMR) backed by S3 and using combinations of Python jobs, non-Spark
>> Java/Scala, and Spark.  We run Spark jobs by connecting drivers/clients to
>> the master, and those are what is invoked from Luigi.)
>>
>> —
>> p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>>
>>
>> On Thu, Jul 10, 2014 at 10:20 AM, k.tham  wrote:
>>
>>> I'm just wondering what's the general recommendation for data pipeline
>>> automation.
>>>
>>> Say, I want to run Spark Job A, then B, then invoke script C, then do D,
>>> and
>>> if D fails, do E, and if Job A fails, send email F, etc...
>>>
>>> It looks like Oozie might be the best choice. But I'd like some
>>> advice/suggestions.
>>>
>>> Thanks!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
>>> Sent from the Apache Spark User List mailing list arch

Re: Recommended pipeline automation tool? Oozie?

2014-07-10 Thread Andrei

I used both - Oozie and Luigi - but found them inflexible and still
overcomplicated, especially in presence of Spark.

Oozie has a fixed list of building blocks, which is pretty limiting. For
example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are
out of scope (of course, you can always write wrapper as Java or Shell
action, but does it really need to be so complicated?). Another issue with
Oozie is passing variables between actions. There's Oozie context that is
suitable for passing key-value pairs (both strings) between actions, but
for more complex objects (say, FileInputStream that should be closed at
last step only) you have to do some advanced kung fu.

Luigi, on other hand, has its niche - complicated dataflows with many tasks
that depend on each other. Basically, there are tasks (this is where you
define computations) and targets (something that can "exist" - file on
disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it
creates a plan for achieving this. Luigi is really shiny when your workflow
fits this model, but one step away and you are in trouble. For example,
consider simple pipeline: run MR job and output temporary data, run another
MR job and output final data, clean temporary data. You can make target
Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1,
right? Not so easy. How do you check that Clean task is achieved? If you
just test whether temporary directory is empty or not, you catch both cases
- when all tasks are done and when they are not even started yet. Luigi
allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single
"run()" method, but ruins the entire idea.

And of course, both of these frameworks are optimized for standard
MapReduce jobs, which is probably not what you want on Spark mailing list
:)

Experience with these frameworks, however, gave me some insights about
typical data pipelines.

1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks
allow branching, but most pipelines actually consist of moving data from
source to destination with possibly some transformations in between (I'll
be glad if somebody share use cases when you really need branching).
2. Transactional logic is important. Either everything, or nothing.
Otherwise it's really easy to get into inconsistent state.
3. Extensibility is important. You never know what will need in a week or
two.

So eventually I decided that it is much easier to create your own pipeline
instead of trying to adopt your code to existing frameworks. My latest
pipeline incarnation simply consists of a list of steps that are started
sequentially. Each step is a class with at least these methods:

 * run() - launch this step
 * fail() - what to do if step fails
 * finalize() - (optional) what to do when all steps are done

For example, if you want to add possibility to run Spark jobs, you just
create SparkStep and configure it with required code. If you want Hive
query - just create HiveStep and configure it with Hive connection
settings. I use YAML file to configure steps and Context (basically,
Map[String, Any]) to pass variables between them. I also use configurable
Reporter available for all steps to report the progress.

Hopefully, this will give you some insights about best pipeline for your
specific case.

On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown  wrote:

>
> We use Luigi for this purpose.  (Our pipelines are typically on AWS (no
> EMR) backed by S3 and using combinations of Python jobs, non-Spark
> Java/Scala, and Spark.  We run Spark jobs by connecting drivers/clients to
> the master, and those are what is invoked from Luigi.)
>
> —
> p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>
>
> On Thu, Jul 10, 2014 at 10:20 AM, k.tham  wrote:
>
>> I'm just wondering what's the general recommendation for data pipeline
>> automation.
>>
>> Say, I want to run Spark Job A, then B, then invoke script C, then do D,
>> and
>> if D fails, do E, and if Job A fails, send email F, etc...
>>
>> It looks like Oozie might be the best choice. But I'd like some
>> advice/suggestions.
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Re: Recommended pipeline automation tool? Oozie?

2014-07-10 Thread Paul Brown

We use Luigi for this purpose.  (Our pipelines are typically on AWS (no
EMR) backed by S3 and using combinations of Python jobs, non-Spark
Java/Scala, and Spark.  We run Spark jobs by connecting drivers/clients to
the master, and those are what is invoked from Luigi.)

—
p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/

On Thu, Jul 10, 2014 at 10:20 AM, k.tham  wrote:

> I'm just wondering what's the general recommendation for data pipeline
> automation.
>
> Say, I want to run Spark Job A, then B, then invoke script C, then do D,
> and
> if D fails, do E, and if Job A fails, send email F, etc...
>
> It looks like Oozie might be the best choice. But I'd like some
> advice/suggestions.
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Recommended pipeline automation tool? Oozie?

Re: Recommended pipeline automation tool? Oozie?

Re: Recommended pipeline automation tool? Oozie?

Re: Recommended pipeline automation tool? Oozie?

Re: Recommended pipeline automation tool? Oozie?

Re: Recommended pipeline automation tool? Oozie?

Re: Recommended pipeline automation tool? Oozie?

Re: Recommended pipeline automation tool? Oozie?

8 matches

Site Navigation

Mail list logo

Footer information