If you're already using Scala for Spark programming and you hate Oozie XML
as much as I do ;), you might check out Scoozie, a Scala DSL for Oozie:
https://github.com/klout/scoozie


On Thu, Jul 10, 2014 at 5:52 PM, Andrei <faithlessfri...@gmail.com> wrote:

> I used both - Oozie and Luigi - but found them inflexible and still
> overcomplicated, especially in presence of Spark.
>
> Oozie has a fixed list of building blocks, which is pretty limiting. For
> example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are
> out of scope (of course, you can always write wrapper as Java or Shell
> action, but does it really need to be so complicated?). Another issue with
> Oozie is passing variables between actions. There's Oozie context that is
> suitable for passing key-value pairs (both strings) between actions, but
> for more complex objects (say, FileInputStream that should be closed at
> last step only) you have to do some advanced kung fu.
>
> Luigi, on other hand, has its niche - complicated dataflows with many
> tasks that depend on each other. Basically, there are tasks (this is where
> you define computations) and targets (something that can "exist" - file on
> disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it
> creates a plan for achieving this. Luigi is really shiny when your workflow
> fits this model, but one step away and you are in trouble. For example,
> consider simple pipeline: run MR job and output temporary data, run another
> MR job and output final data, clean temporary data. You can make target
> Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1,
> right? Not so easy. How do you check that Clean task is achieved? If you
> just test whether temporary directory is empty or not, you catch both cases
> - when all tasks are done and when they are not even started yet. Luigi
> allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single
> "run()" method, but ruins the entire idea.
>
> And of course, both of these frameworks are optimized for standard
> MapReduce jobs, which is probably not what you want on Spark mailing list
> :)
>
> Experience with these frameworks, however, gave me some insights about
> typical data pipelines.
>
> 1. Pipelines are mostly linear. Oozie, Luigi and number of other
> frameworks allow branching, but most pipelines actually consist of moving
> data from source to destination with possibly some transformations in
> between (I'll be glad if somebody share use cases when you really need
> branching).
> 2. Transactional logic is important. Either everything, or nothing.
> Otherwise it's really easy to get into inconsistent state.
> 3. Extensibility is important. You never know what will need in a week or
> two.
>
> So eventually I decided that it is much easier to create your own pipeline
> instead of trying to adopt your code to existing frameworks. My latest
> pipeline incarnation simply consists of a list of steps that are started
> sequentially. Each step is a class with at least these methods:
>
>  * run() - launch this step
>  * fail() - what to do if step fails
>  * finalize() - (optional) what to do when all steps are done
>
> For example, if you want to add possibility to run Spark jobs, you just
> create SparkStep and configure it with required code. If you want Hive
> query - just create HiveStep and configure it with Hive connection
> settings. I use YAML file to configure steps and Context (basically,
> Map[String, Any]) to pass variables between them. I also use configurable
> Reporter available for all steps to report the progress.
>
> Hopefully, this will give you some insights about best pipeline for your
> specific case.
>
>
>
> On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown <p...@mult.ifario.us> wrote:
>
>>
>> We use Luigi for this purpose.  (Our pipelines are typically on AWS (no
>> EMR) backed by S3 and using combinations of Python jobs, non-Spark
>> Java/Scala, and Spark.  We run Spark jobs by connecting drivers/clients to
>> the master, and those are what is invoked from Luigi.)
>>
>> —
>> p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>>
>>
>> On Thu, Jul 10, 2014 at 10:20 AM, k.tham <kevins...@gmail.com> wrote:
>>
>>> I'm just wondering what's the general recommendation for data pipeline
>>> automation.
>>>
>>> Say, I want to run Spark Job A, then B, then invoke script C, then do D,
>>> and
>>> if D fails, do E, and if Job A fails, send email F, etc...
>>>
>>> It looks like Oozie might be the best choice. But I'd like some
>>> advice/suggestions.
>>>
>>> Thanks!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>
>>
>


-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com

Reply via email to