You have to modify wordcount-mapper-t1.py to just ignore the bad line. In the worst case, you should be able to do something like:

for line in sys.stdin:
  try:
    # Insert processing code here
  except:
    # Error processing record, ignore it
    pass

Daniel

On 4/13/17 1:33 PM, Pillis W wrote:
Hello,
I am using 'hadoop-streaming.jar' to do a simple word count, and want to
skip records that fail execution. Below is the actual command I run, and
the mapper always fails on one record, and hence fails the job. The input
file is 3 lines with 1 bad line.

hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D mapred.job.name=SkipTest
-Dmapreduce.task.skip.start.attempts=1 -Dmapreduce.map.skip.maxrecords=1
-Dmapreduce.reduce.skip.maxgroups=1
-Dmapreduce.map.skip.proc.count.autoincr=false
-Dmapreduce.reduce.skip.proc.count.autoincr=false -D mapred.reduce.tasks=1
-D mapred.map.tasks=1 -files
/home/hadoop/wc/wordcount-mapper-t1.py,/home/hadoop/wc/wordcount-reducer-t1.py
-input /user/hadoop/data/test1 -output /user/hadoop/data/output-test-5
-mapper "python wordcount-mapper-t1.py" -reducer "python
wordcount-reducer-t1.py"


I was wondering if skipping of records is supported when MapReduce is used
in streaming mode?

Thanks in advance.
PW



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to