RE: Complex MapReduce applications with the streaming API

Zoltán Tóth-Czifra Tue, 27 Nov 2012 09:35:50 -0800

Hi,

Thanks, the self-referencing subworkflow is a good idea, it never occured to me.
However, I'm still expecting something that is more light-weight, with no Oozie 
or external tools.

My best idea now is simply abstracting the exec call in my script that submits 
the job (hadoop jar hadoop-streaming.jar ...), extracing JobId from the output, 
then abstracing another exec (hadoop job -counter ....) which can give me info 
about the counters. Is this the best option?

Thanks!

________________________________
From: Alejandro Abdelnur [t...@cloudera.com]
Sent: Tuesday, November 27, 2012 6:10 PM
To: common-u...@hadoop.apache.org
Subject: Re: Complex MapReduce applications with the streaming API

> Using Oozie seems to be an overkilling for this application, besides, it 
> doesn't support "loops"
> so the recusrsion can't really be implemented.

Correct, Oozie does not support loops, this is a restriction by design (early 
prototypes supported loops). The idea was that you didn't want never ending 
workflows. To this end, Coordinator Jobs address the recurrent run of workflow 
jobs.

Still, if you want to do recursion in Oozie, you certainly can, a workflow 
invoking to itself as a sub-workflow. Just make sure you define properly your 
exit condition.

If you have additional questions, please move this thread to the 
u...@oozie.apache.org<mailto:u...@oozie.apache.org> alias.

Thx

On Tue, Nov 27, 2012 at 4:03 AM, Zoltán Tóth-Czifra 
<zoltan.tothczi...@softonic.com<mailto:zoltan.tothczi...@softonic.com>> wrote:
Hi everyone,

Thanks in advance for the support. My problem is the following:

I'm trying to develop a fairly complex MapReduce application using the 
streaming API (for demonstation purposes, so unfortunately the "use Java" 
answer doesn't work :-( ). I can get one single MapReduce phase running from 
command line with no problem. The problem is when I want to add more MapReduce 
phases which use each others output, and I maybe even want to do a recursion 
(feed the its output to the same phase again) conditioned by a counter.

The solution in Java MapReduce is trivial (i.e. creating multiple Job instances 
and monitoring counters) but with the streaming API not quite. What is the 
correct way to manage my application with its native code? (Python, PHP, 
Perl...) Calling shell commands from a "controller" script? How should I obtain 
counters?...

Using Oozie seems to be an overkilling for this application, besides, it 
doesn't support "loops" so the recusrsion can't really be implemented.

Thanks a lot!
Zoltan

--
Alejandro

RE: Complex MapReduce applications with the streaming API

Reply via email to