I used both - Oozie and Luigi - but found them inflexible and still
overcomplicated, especially in presence of Spark.

Oozie has a fixed list of building blocks, which is pretty limiting. For
example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are
out of scope (of course, you can always write wrapper as Java or Shell
action, but does it really need to be so complicated?). Another issue with
Oozie is passing variables between actions. There's Oozie context that is
suitable for passing key-value pairs (both strings) between actions, but
for more complex objects (say, FileInputStream that should be closed at
last step only) you have to do some advanced kung fu.

Luigi, on other hand, has its niche - complicated dataflows with many tasks
that depend on each other. Basically, there are tasks (this is where you
define computations) and targets (something that can "exist" - file on
disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it
creates a plan for achieving this. Luigi is really shiny when your workflow
fits this model, but one step away and you are in trouble. For example,
consider simple pipeline: run MR job and output temporary data, run another
MR job and output final data, clean temporary data. You can make target
Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1,
right? Not so easy. How do you check that Clean task is achieved? If you
just test whether temporary directory is empty or not, you catch both cases
- when all tasks are done and when they are not even started yet. Luigi
allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single
"run()" method, but ruins the entire idea.

And of course, both of these frameworks are optimized for standard
MapReduce jobs, which is probably not what you want on Spark mailing list
:)

Experience with these frameworks, however, gave me some insights about
typical data pipelines.

1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks
allow branching, but most pipelines actually consist of moving data from
source to destination with possibly some transformations in between (I'll
be glad if somebody share use cases when you really need branching).
2. Transactional logic is important. Either everything, or nothing.
Otherwise it's really easy to get into inconsistent state.
3. Extensibility is important. You never know what will need in a week or
two.

So eventually I decided that it is much easier to create your own pipeline
instead of trying to adopt your code to existing frameworks. My latest
pipeline incarnation simply consists of a list of steps that are started
sequentially. Each step is a class with at least these methods:

 * run() - launch this step
 * fail() - what to do if step fails
 * finalize() - (optional) what to do when all steps are done

For example, if you want to add possibility to run Spark jobs, you just
create SparkStep and configure it with required code. If you want Hive
query - just create HiveStep and configure it with Hive connection
settings. I use YAML file to configure steps and Context (basically,
Map[String, Any]) to pass variables between them. I also use configurable
Reporter available for all steps to report the progress.

Hopefully, this will give you some insights about best pipeline for your
specific case.



On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown <p...@mult.ifario.us> wrote:

>
> We use Luigi for this purpose.  (Our pipelines are typically on AWS (no
> EMR) backed by S3 and using combinations of Python jobs, non-Spark
> Java/Scala, and Spark.  We run Spark jobs by connecting drivers/clients to
> the master, and those are what is invoked from Luigi.)
>
> —
> p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>
>
> On Thu, Jul 10, 2014 at 10:20 AM, k.tham <kevins...@gmail.com> wrote:
>
>> I'm just wondering what's the general recommendation for data pipeline
>> automation.
>>
>> Say, I want to run Spark Job A, then B, then invoke script C, then do D,
>> and
>> if D fails, do E, and if Job A fails, send email F, etc...
>>
>> It looks like Oozie might be the best choice. But I'd like some
>> advice/suggestions.
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Reply via email to