To quote the docs:
---
This feature can be used when map/reduce tasks crashes deterministically
on certain input. This happens due to bugs in the map/reduce function.
The usual course would be to fix these bugs. But sometimes this is not
possible; perhaps the bug is in third party libraries for which the
source code is not available. Due to this, the task never reaches to
completion even with multiple attempts and complete data for that task
is lost.
With this feature, only a small portion of data is lost surrounding the
bad record, which may be acceptable for some user applications. see
setMapperMaxSkipRecords(Configuration, long)
---
Basically, it's a heavy-handed approach that you should only use as a
last resort.
Daniel
On 4/13/17 3:24 PM, Pillis W wrote:
Thanks Daniel.
Please correct me if I have understood this incorrectly, but according
to the documentation at
http://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Skipping_Bad_Records
, it seemed like the sole purpose of this functionality is to tolerate
unknown failures/exceptions in mappers/reducers. If I was able to
catch all failures, I do not need to even use this ability - is that
not true?
If I have understood it incorrectly, when would one use the feature to
skip bad records?
Regards,
PW
On Thu, Apr 13, 2017 at 2:49 PM, Daniel Templeton <[email protected]
<mailto:[email protected]>> wrote:
You have to modify wordcount-mapper-t1.py to just ignore the bad
line. In the worst case, you should be able to do something like:
for line in sys.stdin:
try:
# Insert processing code here
except:
# Error processing record, ignore it
pass
Daniel
On 4/13/17 1:33 PM, Pillis W wrote:
Hello,
I am using 'hadoop-streaming.jar' to do a simple word count,
and want to
skip records that fail execution. Below is the actual command
I run, and
the mapper always fails on one record, and hence fails the
job. The input
file is 3 lines with 1 bad line.
hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D
mapred.job.name <http://mapred.job.name>=SkipTest
-Dmapreduce.task.skip.start.at
<http://Dmapreduce.task.skip.start.at>tempts=1
-Dmapreduce.map.skip.maxrecords=1
-Dmapreduce.reduce.skip.maxgroups=1
-Dmapreduce.map.skip.proc.count.autoincr=false
-Dmapreduce.reduce.skip.proc.count.autoincr=false -D
mapred.reduce.tasks=1
-D mapred.map.tasks=1 -files
/home/hadoop/wc/wordcount-mapper-t1.py,/home/hadoop/wc/wordcount-reducer-t1.py
-input /user/hadoop/data/test1 -output
/user/hadoop/data/output-test-5
-mapper "python wordcount-mapper-t1.py" -reducer "python
wordcount-reducer-t1.py"
I was wondering if skipping of records is supported when
MapReduce is used
in streaming mode?
Thanks in advance.
PW
---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]
<mailto:[email protected]>
For additional commands, e-mail:
[email protected]
<mailto:[email protected]>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]