I see it this way.

You can glue a bunch of discrete command line apps together that may or may not 
have dependencies between one another in a new syntax. which is darn nice if 
you already have a bunch of discrete ready to run command line apps sitting 
around that need to be strung together, that can't be used as libraries and 
instantiated through their APIs.

Or, you can string all your work together through the APIs with a turing 
complete language and run them all from a single command line interface (and 
hand that to cron, or some other tool).

In this case you can use Java, or easier languages like JRuby, Groovy, Jython, 
Clojure, etc which were designed for this purpose. (They don't run on the 
cluster, they only run Hadoop client side).

Think ant vs graddle (or any other build tool that uses a scripting language 
and not a configuration file) if you want a concrete example.

Cascading itself is a query API (and query planner). But it also exposes to the 
user the ability to run discrete 'processes' in dependency order for you. 
Either Cascading (Hadoop) Flows or Riffle annotated process objects. They all 
can be intermingled and managed from the same dependency scheduler. Cascading 
has one, and Riffle has one.

So you can run> Flow -> Mahout -> Pig -> Mahout -> Flow -> shell -> 
whattheheckever from the same application.

Cascading also has the ability to only run 'stale' processes. Think 'make' 
file. When re-running a job where only one file of many has changed, this is a 
big win.

I personally like parameterizing my applications via the command line and 
letting my cli options drive the workflows. for example, my testing, 
integration, production environments are much different, so its very easy to 
drive specific runs of the jobs by changing a cli arg. (args4j makes this darn 
simple)

if I am chaining multiple CLI apps into a bigger production app, parameterizing 
that I suspect will be error prone, esp if the input/output data points (jdbc 
vs file) are different in different contexts.

you can find Riffle here, https://github.com/cwensel/riffle  (its Apache 
Licensed, contributions welcomed)

ckw

On Dec 14, 2010, at 1:30 AM, Alejandro Abdelnur wrote:

> Ed,
> 
> Actually Oozie is quite different from Cascading.
> 
> * Cascading allows you to write 'queries' using a Java API and they get
> translated into MR jobs.
> * Oozie allows you compose sequences of MR/Pig/Hive/Java/SSH jobs in a DAG
> (workflow jobs) and has timer+data dependency triggers (coordinator jobs).
> 
> Regards.
> 
> Alejandro
> 
> On Tue, Dec 14, 2010 at 1:26 PM, edward choi <mp2...@gmail.com> wrote:
> 
>> Thanks for the tip. I took a look at it.
>> Looks similar to Cascading I guess...?
>> Anyway thanks for the info!!
>> 
>> Ed
>> 
>> 2010/12/8 Alejandro Abdelnur <t...@cloudera.com>
>> 
>>> Or, if you want to do it in a reliable way you could use an Oozie
>>> coordinator job.
>>> 
>>> On Wed, Dec 8, 2010 at 1:53 PM, edward choi <mp2...@gmail.com> wrote:
>>>> My mistake. Come to think about it, you are right, I can just make an
>>>> infinite loop inside the Hadoop application.
>>>> Thanks for the reply.
>>>> 
>>>> 2010/12/7 Harsh J <qwertyman...@gmail.com>
>>>> 
>>>>> Hi,
>>>>> 
>>>>> On Tue, Dec 7, 2010 at 2:25 PM, edward choi <mp2...@gmail.com> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I'm planning to crawl a certain web site every 30 minutes.
>>>>>> How would I get it done in Hadoop?
>>>>>> 
>>>>>> In pure Java, I used Thread.sleep() method, but I guess this won't
>>> work
>>>>> in
>>>>>> Hadoop.
>>>>> 
>>>>> Why wouldn't it? You need to manage your post-job logic mostly, but
>>>>> sleep and resubmission should work just fine.
>>>>> 
>>>>>> Or if it could work, could anyone show me an example?
>>>>>> 
>>>>>> Ed.
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Harsh J
>>>>> www.harshj.com
>>>>> 
>>>> 
>>> 
>> 

--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com

-- Concurrent, Inc. offers mentoring, support, and licensing for Cascading

Reply via email to