Re: Skip bad records when streaming supported?

Daniel Templeton Thu, 13 Apr 2017 14:49:43 -0700

You have to modify wordcount-mapper-t1.py to just ignore the bad line.In the worst case, you should be able to do something like:


for line in sys.stdin:
  try:
    # Insert processing code here
  except:
    # Error processing record, ignore it
    pass


Daniel

On 4/13/17 1:33 PM, Pillis W wrote:

Hello,
I am using 'hadoop-streaming.jar' to do a simple word count, and want to
skip records that fail execution. Below is the actual command I run, and
the mapper always fails on one record, and hence fails the job. The input
file is 3 lines with 1 bad line.

hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D mapred.job.name=SkipTest
-Dmapreduce.task.skip.start.attempts=1 -Dmapreduce.map.skip.maxrecords=1
-Dmapreduce.reduce.skip.maxgroups=1
-Dmapreduce.map.skip.proc.count.autoincr=false
-Dmapreduce.reduce.skip.proc.count.autoincr=false -D mapred.reduce.tasks=1
-D mapred.map.tasks=1 -files
/home/hadoop/wc/wordcount-mapper-t1.py,/home/hadoop/wc/wordcount-reducer-t1.py
-input /user/hadoop/data/test1 -output /user/hadoop/data/output-test-5
-mapper "python wordcount-mapper-t1.py" -reducer "python
wordcount-reducer-t1.py"


I was wondering if skipping of records is supported when MapReduce is used
in streaming mode?

Thanks in advance.
PW



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Skip bad records when streaming supported?

Reply via email to