Re: Is there a way to know the input filename at Hadoop Streaming?
On Wed, Oct 22, 2008 at 18:55, Steve Gao <[EMAIL PROTECTED]> wrote: > I am using Hadoop Streaming. The input are multiple files. > Is there a way to get the current filename in mapper? > Streaming map tasks should have a "map_input_file" environment variable like the following: map_input_file=hdfs://HOST/path/to/file rick > For example: > $HADOOP_HOME/bin/hadoop \ > jar $HADOOP_HOME/hadoop-streaming.jar \ >-input file1 \ >-input file2 \ >-output myOutputDir \ >-mapper mapper \ >-reducer reducer > > In mapper: > while (){ > //how to tell the current line is from file1 or file2? > } > > > > >
Re: process limits for streaming jar
On Fri, Jun 27, 2008 at 08:57, Chris Anderson <[EMAIL PROTECTED]> wrote: > The problem is that when there are a large number of map tasks to > complete, Hadoop doesn't seem to obey the map.tasks.maximum. Instead, > it is spawning 8 map tasks per tasktracker (even when I change the > mapred.tasktracker.map.tasks.maximum in hadoop-site.xml to 2, on the > master). The cluster was booted with the setting at 8. Do I need to > change hadoop-site.xml on all the slaves, and restart the task > trackers, in order to make the limit apply? That seems unlikely - I'd > really like to manage this parameter on a per-job level. > Yes, mapred.tasktracker.map.tasks.maximum is configured per tasktracker on startup. It can't be configured per job because it's not a job-scope parameter (if there are multiple concurrent jobs, they have to share the task limit). rick
Re: Streaming and subprocess error code
Does the syslog output from a should-have-failed task contain something like this? java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 (In particular, I'm curious if it mentions the RuntimeException.) Tasks that consume all their input and then exit non-zero are definitely supposed to be counted as failed, so there's either a problem with the setup or a bug somewhere. rick On Wed, May 14, 2008 at 8:49 PM, Andrey Pankov <[EMAIL PROTECTED]> wrote: > Hi, > > I've tested this new option "-jobconf > stream.non.zero.exit.status.is.failure=true". Seems working but still not > good for me. When mapper/reducer program have read all input data > successfully and fails after that, streaming still finishes successfully so > there are no chances to know about some data post-processing errors in > subprocesses :( > > > > Andrey Pankov wrote: > > > Hi Rick, > > > > Thank you for the quick response! I see this feature is in trunk and not > available in last stable release. Anyway will try if it works for me from > the trunk, and will try does it catch segmentation faults too. > > > > > > Rick Cox wrote: > > > > > Try "-jobconf stream.non.zero.exit.status.is.failure=true". > > > > > > That will tell streaming that a non-zero exit is a task failure. To > > > turn that into an immediate whole job failure, I think configuring 0 > > > task retries (mapred.map.max.attempts=1 and > > > mapred.reduce.max.attempts=1) will be sufficient. > > > > > > rick > > > > > > On Tue, May 13, 2008 at 8:15 PM, Andrey Pankov <[EMAIL PROTECTED]> > wrote: > > > > > > > Hi all, > > > > > > > > I'm looking a way to force Streaming to shutdown the whole job in > case when > > > > some of its subprocesses exits with non-zero error code. > > > > > > > > We have next situation. Sometimes either mapper or reducer could > crush, as > > > > a rule it returns some exit code. In this case entire streaming job > finishes > > > > successfully, but that's wrong. Almost the same when any subprocess > finishes > > > > with segmentation fault. > > > > > > > > It's possible to check automatically if a subprocess crushed only via > logs > > > > but it means you need to parse tons of outputs/logs/dirs/etc. > > > > In order to find logs of your job you have to know it's jobid ~ > > > > job_200805130853_0016. I don't know easy way to determine it - just > scan > > > > stdout for the pattern. Then find logs of each mapper, each reducer, > find a > > > > way to parse them, etc, etc... > > > > > > > > So, is there any easiest way get correct status of the whole > streaming job > > > > or I still have to build rather fragile parsing systems for such > purposes? > > > > > > > > Thanks in advance. > > > > > > > > -- > > > > Andrey Pankov > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > Andrey Pankov > >
Re: Streaming and subprocess error code
Try "-jobconf stream.non.zero.exit.status.is.failure=true". That will tell streaming that a non-zero exit is a task failure. To turn that into an immediate whole job failure, I think configuring 0 task retries (mapred.map.max.attempts=1 and mapred.reduce.max.attempts=1) will be sufficient. rick On Tue, May 13, 2008 at 8:15 PM, Andrey Pankov <[EMAIL PROTECTED]> wrote: > Hi all, > > I'm looking a way to force Streaming to shutdown the whole job in case when > some of its subprocesses exits with non-zero error code. > > We have next situation. Sometimes either mapper or reducer could crush, as > a rule it returns some exit code. In this case entire streaming job finishes > successfully, but that's wrong. Almost the same when any subprocess finishes > with segmentation fault. > > It's possible to check automatically if a subprocess crushed only via logs > but it means you need to parse tons of outputs/logs/dirs/etc. > In order to find logs of your job you have to know it's jobid ~ > job_200805130853_0016. I don't know easy way to determine it - just scan > stdout for the pattern. Then find logs of each mapper, each reducer, find a > way to parse them, etc, etc... > > So, is there any easiest way get correct status of the whole streaming job > or I still have to build rather fragile parsing systems for such purposes? > > Thanks in advance. > > -- > Andrey Pankov > >
Re: New user, several questions/comments (MaxMapTaskFailuresPercent in particular)
On Tue, Apr 8, 2008 at 12:36 PM, Ian Tegebo <[EMAIL PROTECTED]> wrote: > > My original question was about specifying MaxMapTaskFailuresPercent as a job > conf parameter on the command line for streaming jobs. Is there a conf > setting > like the following? > > mapred.taskfailure.percent The job conf settings to control this are: mapred.max.map.failures.percent mapred.max.reduce.failures.percent Both have a default of 0, meaning any failed task makes for a failed job (according to JobConf.java). rick