Re: Recommended pipeline automation tool? Oozie?

Nick Pentreath Fri, 11 Jul 2014 00:30:08 -0700

Did you use "old" azkaban or azkaban 2.5? It has been completely rewritten.


Not saying it is the best but I found it way better than oozie for example.

Sent from my iPhone

> On 11 Jul 2014, at 09:24, "明风" <mingf...@taobao.com> wrote:
> 
> We use Azkaban for a short time and suffer a lot. Finally we almost rewrite 
> it totally. Don’t recommend it really.
> 
> 发件人: Nick Pentreath <nick.pentre...@gmail.com>
> 答复: <user@spark.apache.org>
> 日期: 2014年7月11日 星期五 下午3:18
> 至: <user@spark.apache.org>
> 主题: Re: Recommended pipeline automation tool? Oozie?
> 
> You may look into the new Azkaban - which while being quite heavyweight is 
> actually quite pleasant to use when set up.
> 
> You can run spark jobs (spark-submit) using azkaban shell commands and pass 
> paremeters between jobs. It supports dependencies, simple dags and scheduling 
> with retries. 
> 
> I'm digging deeper and it may be worthwhile extending it with a Spark job 
> type...
> 
> It's probably best for mixed Hadoop / Spark clusters...
> —
> Sent from Mailbox
> 
> 
>> On Fri, Jul 11, 2014 at 12:52 AM, Andrei <faithlessfri...@gmail.com> wrote:
>> I used both - Oozie and Luigi - but found them inflexible and still 
>> overcomplicated, especially in presence of Spark. 
>> 
>> Oozie has a fixed list of building blocks, which is pretty limiting. For 
>> example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are out 
>> of scope (of course, you can always write wrapper as Java or Shell action, 
>> but does it really need to be so complicated?). Another issue with Oozie is 
>> passing variables between actions. There's Oozie context that is suitable 
>> for passing key-value pairs (both strings) between actions, but for more 
>> complex objects (say, FileInputStream that should be closed at last step 
>> only) you have to do some advanced kung fu. 
>> 
>> Luigi, on other hand, has its niche - complicated dataflows with many tasks 
>> that depend on each other. Basically, there are tasks (this is where you 
>> define computations) and targets (something that can "exist" - file on disk, 
>> entry in ZooKeeper, etc.). You ask Luigi to get some target, and it creates 
>> a plan for achieving this. Luigi is really shiny when your workflow fits 
>> this model, but one step away and you are in trouble. For example, consider 
>> simple pipeline: run MR job and output temporary data, run another MR job 
>> and output final data, clean temporary data. You can make target Clean, that 
>> depends on target MRJob2 that, in its turn, depends on MRJob1, right? Not so 
>> easy. How do you check that Clean task is achieved? If you just test whether 
>> temporary directory is empty or not, you catch both cases - when all tasks 
>> are done and when they are not even started yet. Luigi allows you to specify 
>> all 3 actions - MRJob1, MRJob2, Clean - in a single "run()" method, but 
>> ruins the entire idea. 
>> 
>> And of course, both of these frameworks are optimized for standard MapReduce 
>> jobs, which is probably not what you want on Spark mailing list :) 
>> 
>> Experience with these frameworks, however, gave me some insights about 
>> typical data pipelines. 
>> 
>> 1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks 
>> allow branching, but most pipelines actually consist of moving data from 
>> source to destination with possibly some transformations in between (I'll be 
>> glad if somebody share use cases when you really need branching). 
>> 2. Transactional logic is important. Either everything, or nothing. 
>> Otherwise it's really easy to get into inconsistent state. 
>> 3. Extensibility is important. You never know what will need in a week or 
>> two. 
>> 
>> So eventually I decided that it is much easier to create your own pipeline 
>> instead of trying to adopt your code to existing frameworks. My latest 
>> pipeline incarnation simply consists of a list of steps that are started 
>> sequentially. Each step is a class with at least these methods: 
>> 
>>  * run() - launch this step
>>  * fail() - what to do if step fails
>>  * finalize() - (optional) what to do when all steps are done
>> 
>> For example, if you want to add possibility to run Spark jobs, you just 
>> create SparkStep and configure it with required code. If you want Hive query 
>> - just create HiveStep and configure it with Hive connection settings. I use 
>> YAML file to configure steps and Context (basically, Map[String, Any]) to 
>> pass variables between them. I also use configurable Reporter available for 
>> all steps to report the progress. 
>> 
>> Hopefully, this will give you some insights about best pipeline for your 
>> specific case. 
>> 
>> 
>> 
>>> On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown <p...@mult.ifario.us> wrote:
>>> 
>>> We use Luigi for this purpose.  (Our pipelines are typically on AWS (no 
>>> EMR) backed by S3 and using combinations of Python jobs, non-Spark 
>>> Java/Scala, and Spark.  We run Spark jobs by connecting drivers/clients to 
>>> the master, and those are what is invoked from Luigi.)
>>> 
>>> —
>>> p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>>> 
>>> 
>>>> On Thu, Jul 10, 2014 at 10:20 AM, k.tham <kevins...@gmail.com> wrote:
>>>> I'm just wondering what's the general recommendation for data pipeline
>>>> automation.
>>>> 
>>>> Say, I want to run Spark Job A, then B, then invoke script C, then do D, 
>>>> and
>>>> if D fails, do E, and if Job A fails, send email F, etc...
>>>> 
>>>> It looks like Oozie might be the best choice. But I'd like some
>>>> advice/suggestions.
>>>> 
>>>> Thanks!
>>>> 
>>>> 
>>>> 
>>>> --
>>>> View this message in context: 
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Recommended pipeline automation tool? Oozie?

Reply via email to