Re: Is there a way to know the input filename at Hadoop Streaming?

2008-10-23 Thread Rick Cox
On Wed, Oct 22, 2008 at 18:55, Steve Gao <[EMAIL PROTECTED]> wrote:
> I am using Hadoop Streaming. The input are multiple files.
> Is there a way to get the current filename in mapper?
>

Streaming map tasks should have a "map_input_file" environment
variable like the following:

map_input_file=hdfs://HOST/path/to/file

rick

> For example:
> $HADOOP_HOME/bin/hadoop  \
> jar $HADOOP_HOME/hadoop-streaming.jar \
>-input file1 \
>-input file2 \
>-output myOutputDir \
>-mapper mapper \
>-reducer reducer
>
> In mapper:
> while (){
>  //how to tell the current line is from file1 or file2?
> }
>
>
>
>
>


Re: process limits for streaming jar

2008-06-27 Thread Rick Cox
On Fri, Jun 27, 2008 at 08:57, Chris Anderson <[EMAIL PROTECTED]> wrote:

> The problem is that when there are a large number of map tasks to
> complete, Hadoop doesn't seem to obey the map.tasks.maximum. Instead,
> it is spawning 8 map tasks per tasktracker (even when I change the
> mapred.tasktracker.map.tasks.maximum in hadoop-site.xml to 2, on the
> master). The cluster was booted with the setting at 8. Do I need to
> change hadoop-site.xml on all the slaves, and restart the task
> trackers, in order to make the limit apply? That seems unlikely - I'd
> really like to manage this parameter on a per-job level.
>

Yes, mapred.tasktracker.map.tasks.maximum is configured per
tasktracker on startup. It can't be configured per job because it's
not a job-scope parameter (if there are multiple concurrent jobs, they
have to share the task limit).

rick


Re: Streaming and subprocess error code

2008-05-14 Thread Rick Cox
Does the syslog output from a should-have-failed task contain
something like this?

java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
subprocess failed with code 1

(In particular, I'm curious if it mentions the RuntimeException.)

Tasks that consume all their input and then exit non-zero are
definitely supposed to be counted as failed, so there's either a
problem with the setup or a bug somewhere.

rick

On Wed, May 14, 2008 at 8:49 PM, Andrey Pankov <[EMAIL PROTECTED]> wrote:
> Hi,
>
>  I've tested this new option "-jobconf
> stream.non.zero.exit.status.is.failure=true". Seems working but still not
> good for me. When mapper/reducer program have read all input data
> successfully and fails after that, streaming still finishes successfully so
> there are no chances to know about some data post-processing errors in
> subprocesses :(
>
>
>
>  Andrey Pankov wrote:
>
> > Hi Rick,
> >
> > Thank you for the quick response! I see this feature is in trunk and not
> available in last stable release. Anyway will try if it works for me from
> the trunk, and will try does it catch segmentation faults too.
> >
> >
> > Rick Cox wrote:
> >
> > > Try "-jobconf stream.non.zero.exit.status.is.failure=true".
> > >
> > > That will tell streaming that a non-zero exit is a task failure. To
> > > turn that into an immediate whole job failure, I think configuring 0
> > > task retries (mapred.map.max.attempts=1 and
> > > mapred.reduce.max.attempts=1) will be sufficient.
> > >
> > > rick
> > >
> > > On Tue, May 13, 2008 at 8:15 PM, Andrey Pankov <[EMAIL PROTECTED]>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > >  I'm looking a way to force Streaming to shutdown the whole job in
> case when
> > > > some of its subprocesses exits with non-zero error code.
> > > >
> > > >  We have next situation. Sometimes either mapper or reducer could
> crush, as
> > > > a rule it returns some exit code. In this case entire streaming job
> finishes
> > > > successfully, but that's wrong. Almost the same when any subprocess
> finishes
> > > > with segmentation fault.
> > > >
> > > >  It's possible to check automatically if a subprocess crushed only via
> logs
> > > > but it means you need to parse tons of outputs/logs/dirs/etc.
> > > >  In order to find logs of your job you have to know it's jobid ~
> > > > job_200805130853_0016. I don't know easy way to determine it - just
> scan
> > > > stdout for the pattern. Then find logs of each mapper, each reducer,
> find a
> > > > way to parse them, etc, etc...
> > > >
> > > >  So, is there any easiest way get correct status of the whole
> streaming job
> > > > or I still have to build rather fragile parsing systems for such
> purposes?
> > > >
> > > >  Thanks in advance.
> > > >
> > > >  --
> > > >  Andrey Pankov
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> >
>
>
>  --
>  Andrey Pankov
>
>


Re: Streaming and subprocess error code

2008-05-13 Thread Rick Cox
Try "-jobconf stream.non.zero.exit.status.is.failure=true".

That will tell streaming that a non-zero exit is a task failure. To
turn that into an immediate whole job failure, I think configuring 0
task retries (mapred.map.max.attempts=1 and
mapred.reduce.max.attempts=1) will be sufficient.

rick

On Tue, May 13, 2008 at 8:15 PM, Andrey Pankov <[EMAIL PROTECTED]> wrote:
> Hi all,
>
>  I'm looking a way to force Streaming to shutdown the whole job in case when
> some of its subprocesses exits with non-zero error code.
>
>  We have next situation. Sometimes either mapper or reducer could crush, as
> a rule it returns some exit code. In this case entire streaming job finishes
> successfully, but that's wrong. Almost the same when any subprocess finishes
> with segmentation fault.
>
>  It's possible to check automatically if a subprocess crushed only via logs
> but it means you need to parse tons of outputs/logs/dirs/etc.
>  In order to find logs of your job you have to know it's jobid ~
> job_200805130853_0016. I don't know easy way to determine it - just scan
> stdout for the pattern. Then find logs of each mapper, each reducer, find a
> way to parse them, etc, etc...
>
>  So, is there any easiest way get correct status of the whole streaming job
> or I still have to build rather fragile parsing systems for such purposes?
>
>  Thanks in advance.
>
>  --
>  Andrey Pankov
>
>


Re: New user, several questions/comments (MaxMapTaskFailuresPercent in particular)

2008-04-08 Thread Rick Cox
On Tue, Apr 8, 2008 at 12:36 PM, Ian Tegebo <[EMAIL PROTECTED]> wrote:

>
>  My original question was about specifying MaxMapTaskFailuresPercent as a job
>  conf parameter on the command line for streaming jobs.  Is there a conf 
> setting
>  like the following?
>
>  mapred.taskfailure.percent

The job conf settings to control this are:

mapred.max.map.failures.percent
mapred.max.reduce.failures.percent

Both have a default of 0, meaning any failed task makes for a failed
job (according to JobConf.java).

rick