Re: Recommended pipeline automation tool? Oozie?

明风 Fri, 11 Jul 2014 00:26:30 -0700

We use Azkaban for a short time and suffer a lot. Finally we almost rewrite
it totally. Don’t recommend it really.


发件人:  Nick Pentreath <nick.pentre...@gmail.com>
答复:  <user@spark.apache.org>
日期:  2014年7月11日 星期五 下午3:18
至:  <user@spark.apache.org>
主题:  Re: Recommended pipeline automation tool? Oozie?

You may look into the new Azkaban - which while being quite heavyweight is
actually quite pleasant to use when set up.

You can run spark jobs (spark-submit) using azkaban shell commands and pass
paremeters between jobs. It supports dependencies, simple dags and
scheduling with retries.

I'm digging deeper and it may be worthwhile extending it with a Spark job
type...

It's probably best for mixed Hadoop / Spark clusters...
―
Sent from Mailbox <https://www.dropbox.com/mailbox>


On Fri, Jul 11, 2014 at 12:52 AM, Andrei <faithlessfri...@gmail.com> wrote:
> I used both - Oozie and Luigi - but found them inflexible and still
> overcomplicated, especially in presence of Spark.
> 
> Oozie has a fixed list of building blocks, which is pretty limiting. For
> example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are out
> of scope (of course, you can always write wrapper as Java or Shell action, but
> does it really need to be so complicated?). Another issue with Oozie is
> passing variables between actions. There's Oozie context that is suitable for
> passing key-value pairs (both strings) between actions, but for more complex
> objects (say, FileInputStream that should be closed at last step only) you
> have to do some advanced kung fu.
> 
> Luigi, on other hand, has its niche - complicated dataflows with many tasks
> that depend on each other. Basically, there are tasks (this is where you
> define computations) and targets (something that can "exist" - file on disk,
> entry in ZooKeeper, etc.). You ask Luigi to get some target, and it creates a
> plan for achieving this. Luigi is really shiny when your workflow fits this
> model, but one step away and you are in trouble. For example, consider simple
> pipeline: run MR job and output temporary data, run another MR job and output
> final data, clean temporary data. You can make target Clean, that depends on
> target MRJob2 that, in its turn, depends on MRJob1, right? Not so easy. How do
> you check that Clean task is achieved? If you just test whether temporary
> directory is empty or not, you catch both cases - when all tasks are done and
> when they are not even started yet. Luigi allows you to specify all 3 actions
> - MRJob1, MRJob2, Clean - in a single "run()" method, but ruins the entire
> idea. 
> 
> And of course, both of these frameworks are optimized for standard MapReduce
> jobs, which is probably not what you want on Spark mailing list :)
> 
> Experience with these frameworks, however, gave me some insights about typical
> data pipelines. 
> 
> 1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks
> allow branching, but most pipelines actually consist of moving data from
> source to destination with possibly some transformations in between (I'll be
> glad if somebody share use cases when you really need branching).
> 2. Transactional logic is important. Either everything, or nothing. Otherwise
> it's really easy to get into inconsistent state.
> 3. Extensibility is important. You never know what will need in a week or two.
> 
> So eventually I decided that it is much easier to create your own pipeline
> instead of trying to adopt your code to existing frameworks. My latest
> pipeline incarnation simply consists of a list of steps that are started
> sequentially. Each step is a class with at least these methods:
> 
>  * run() - launch this step
>  * fail() - what to do if step fails
>  * finalize() - (optional) what to do when all steps are done
> 
> For example, if you want to add possibility to run Spark jobs, you just create
> SparkStep and configure it with required code. If you want Hive query - just
> create HiveStep and configure it with Hive connection settings. I use YAML
> file to configure steps and Context (basically, Map[String, Any]) to pass
> variables between them. I also use configurable Reporter available for all
> steps to report the progress.
> 
> Hopefully, this will give you some insights about best pipeline for your
> specific case. 
> 
> 
> 
> On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown <p...@mult.ifario.us> wrote:
>> 
>> We use Luigi for this purpose.  (Our pipelines are typically on AWS (no EMR)
>> backed by S3 and using combinations of Python jobs, non-Spark Java/Scala, and
>> Spark.  We run Spark jobs by connecting drivers/clients to the master, and
>> those are what is invoked from Luigi.)
>> 
>> ―
>> p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>> 
>> 
>> On Thu, Jul 10, 2014 at 10:20 AM, k.tham <kevins...@gmail.com> wrote:
>>> I'm just wondering what's the general recommendation for data pipeline
>>> automation.
>>> 
>>> Say, I want to run Spark Job A, then B, then invoke script C, then do D, and
>>> if D fails, do E, and if Job A fails, send email F, etc...
>>> 
>>> It looks like Oozie might be the best choice. But I'd like some
>>> advice/suggestions.
>>> 
>>> Thanks!
>>> 
>>> 
>>> 
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-aut
>>> omation-tool-Oozie-tp9319.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>

Re: Recommended pipeline automation tool? Oozie?

Reply via email to