Hello, I am using 'hadoop-streaming.jar' to do a simple word count, and want to skip records that fail execution. Below is the actual command I run, and the mapper always fails on one record, and hence fails the job. The input file is 3 lines with 1 bad line.
hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D mapred.job.name=SkipTest -Dmapreduce.task.skip.start.attempts=1 -Dmapreduce.map.skip.maxrecords=1 -Dmapreduce.reduce.skip.maxgroups=1 -Dmapreduce.map.skip.proc.count.autoincr=false -Dmapreduce.reduce.skip.proc.count.autoincr=false -D mapred.reduce.tasks=1 -D mapred.map.tasks=1 -files /home/hadoop/wc/wordcount-mapper-t1.py,/home/hadoop/wc/wordcount-reducer-t1.py -input /user/hadoop/data/test1 -output /user/hadoop/data/output-test-5 -mapper "python wordcount-mapper-t1.py" -reducer "python wordcount-reducer-t1.py" I was wondering if skipping of records is supported when MapReduce is used in streaming mode? Thanks in advance. PW
