You may look into the new Azkaban - which while being quite heavyweight is actually quite pleasant to use when set up.
You can run spark jobs (spark-submit) using azkaban shell commands and pass paremeters between jobs. It supports dependencies, simple dags and scheduling with retries. I'm digging deeper and it may be worthwhile extending it with a Spark job type... It's probably best for mixed Hadoop / Spark clusters... — Sent from Mailbox On Fri, Jul 11, 2014 at 12:52 AM, Andrei <faithlessfri...@gmail.com> wrote: > I used both - Oozie and Luigi - but found them inflexible and still > overcomplicated, especially in presence of Spark. > Oozie has a fixed list of building blocks, which is pretty limiting. For > example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are > out of scope (of course, you can always write wrapper as Java or Shell > action, but does it really need to be so complicated?). Another issue with > Oozie is passing variables between actions. There's Oozie context that is > suitable for passing key-value pairs (both strings) between actions, but > for more complex objects (say, FileInputStream that should be closed at > last step only) you have to do some advanced kung fu. > Luigi, on other hand, has its niche - complicated dataflows with many tasks > that depend on each other. Basically, there are tasks (this is where you > define computations) and targets (something that can "exist" - file on > disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it > creates a plan for achieving this. Luigi is really shiny when your workflow > fits this model, but one step away and you are in trouble. For example, > consider simple pipeline: run MR job and output temporary data, run another > MR job and output final data, clean temporary data. You can make target > Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1, > right? Not so easy. How do you check that Clean task is achieved? If you > just test whether temporary directory is empty or not, you catch both cases > - when all tasks are done and when they are not even started yet. Luigi > allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single > "run()" method, but ruins the entire idea. > And of course, both of these frameworks are optimized for standard > MapReduce jobs, which is probably not what you want on Spark mailing list > :) > Experience with these frameworks, however, gave me some insights about > typical data pipelines. > 1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks > allow branching, but most pipelines actually consist of moving data from > source to destination with possibly some transformations in between (I'll > be glad if somebody share use cases when you really need branching). > 2. Transactional logic is important. Either everything, or nothing. > Otherwise it's really easy to get into inconsistent state. > 3. Extensibility is important. You never know what will need in a week or > two. > So eventually I decided that it is much easier to create your own pipeline > instead of trying to adopt your code to existing frameworks. My latest > pipeline incarnation simply consists of a list of steps that are started > sequentially. Each step is a class with at least these methods: > * run() - launch this step > * fail() - what to do if step fails > * finalize() - (optional) what to do when all steps are done > For example, if you want to add possibility to run Spark jobs, you just > create SparkStep and configure it with required code. If you want Hive > query - just create HiveStep and configure it with Hive connection > settings. I use YAML file to configure steps and Context (basically, > Map[String, Any]) to pass variables between them. I also use configurable > Reporter available for all steps to report the progress. > Hopefully, this will give you some insights about best pipeline for your > specific case. > On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown <p...@mult.ifario.us> wrote: >> >> We use Luigi for this purpose. (Our pipelines are typically on AWS (no >> EMR) backed by S3 and using combinations of Python jobs, non-Spark >> Java/Scala, and Spark. We run Spark jobs by connecting drivers/clients to >> the master, and those are what is invoked from Luigi.) >> >> — >> p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ >> >> >> On Thu, Jul 10, 2014 at 10:20 AM, k.tham <kevins...@gmail.com> wrote: >> >>> I'm just wondering what's the general recommendation for data pipeline >>> automation. >>> >>> Say, I want to run Spark Job A, then B, then invoke script C, then do D, >>> and >>> if D fails, do E, and if Job A fails, send email F, etc... >>> >>> It looks like Oozie might be the best choice. But I'd like some >>> advice/suggestions. >>> >>> Thanks! >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >> >>