We are still evaluating Hadoop for use in our main-line analysis systems, but we already have the problem of workflow scheduling.
Our solution for that was to implement a simpler version of Amazon's Simple Queue Service. This allows us to have multiple redundant workers for some tasks or to choke a task down on other tasks. The basic idea is that queues contain XML tasks. Tasks are read from the queue by workers, but are kept in a holding pen for a queue specific time period after they are read. If the task completes normally, the worker will delete the task, but if the timeout expires before the worker completes the task, it is added back to the queue. Workers are structured as a triple of scripts that are executed of a manager process. These are a pre-condition that can determine if any work should be done (usually this is a check for available local disk space or available CPU cycles), an item qualification (this is done with a particular item in case the work is subject to resource reservation) and a worker script. Even this tiny little framework suffices for quite complex workflows and work constraints. It would be very easy to schedule map-reduce tasks via a similar mechanism. On 6/23/07 5:34 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote: > James Kennedy wrote: >> But back to my original question... Doug suggests that dependence on a >> driver process is acceptable. But has anyone needed true MapReduce >> chaining or tried it successfully? Or is it generally accepted that a >> multi-MapReduce algorithm should always be driven by a single process? > > I would argue that this functionality is outside the scope of Hadoop. As > far as I understand your question, you need orchestration, which > involves the ability to record a state of previously executed map-reduce > jobs, and starting next map-reduce jobs based on the existing state, > possibly long time after the first job completes and from a different > process. > > I'm frequently facing this problem, and so far I've been using a > poor-man's workflow system, consisting of a bunch of cron jobs, shell > scripts, and simple marker files to record current state of data. In a > similar way you can implement advisory application-level locking, using > lock files. > > Example: adding a new batch of pages to a Nutch index involves many > steps, starting with fetchlist generation, fetching, parsing, updating > the db, extraction of link information, and indexing. Each of these > steps consists of one (or several) map-reduce jobs, and the input to the > next jobs depends on the output of previous jobs. What you referred to > in your previous email was a single-app driver for this workflow, called > Crawl. But I'm using the slightly modified individual tools, which on > successful completion create marker files (e.g. fetching.done). Other > tools check for the existence of these files, and either perform their > function or exit (if I want to run updatedb from a segment that is > fetched but not parsed). > > To summarize this long answer - I think that this functionality belongs > in the application layer built on top of Hadoop, and IMHO we are better > off not implementing it in the Hadoop proper. >