[ 
https://issues.apache.org/jira/browse/HADOOP-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574404#action_12574404
 ] 

Devaraj Das commented on HADOOP-153:
------------------------------------

Here is a proposal that tries to address the scenarios discussed in this jira:
0) Define the concept of a _failed record number_ that is set by Tasks and 
propagated to the JobTracker on task failures. This becomes part of the TIP 
object at the JobTracker.
1) Define an API in the RecordReader to do with getting the record boundary. On 
getting an exception in RecordReader.next, the task starts from the beginning 
of the last successfully read record till the boundary and reads the next 
record from that point (ignoring the boundary bytes). Applies to maps.
2) Define an API in RecordWriter to do with writing record boundary along with 
every write(k,v). The record boundary can default to the sync bytes. Tasks fail 
when they get an exception while writing a bad record. With (0), in the 
subsequent retries, the records can be skipped. This applies to outputs of maps 
and reduces.
3) Define an API in RecordReader to do with whether we want to have recovery 
while reading records on not (useful for e.g. if the RecordReader has side 
effects in the next() method that would affect the reading of the subsequent 
record if there was an exception for the current record). 

In cases of applications throwing exceptions in the map/reduce methods, the 
exception is caught by the Task method, which invoked the map/reduce method. 
The task attempt is killed after noting the record number of the failed 
invocation. With the above point (0), these records are not fed to the m/r 
methods in the subsequent retries.

The recovery/skip above is done on a best effort basis. That is, the worst case 
is that tasks fail!

The above strategies should at least allow configuring the max % of records 
that will keep the recovery/skip cycle going before the job is declared as 
having failed. Also, the "skip records on re-execution" should probably be 
configurable on a per job basis by the user (since the exceptions could have 
been caused due to reasons other than incorrect input).

Thoughts?

> skip records that throw exceptions
> ----------------------------------
>
>                 Key: HADOOP-153
>                 URL: https://issues.apache.org/jira/browse/HADOOP-153
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.2.0
>            Reporter: Doug Cutting
>            Assignee: Devaraj Das
>             Fix For: 0.17.0
>
>
> MapReduce should skip records that throw exceptions.
> If the exception is thrown under RecordReader.next() then RecordReader 
> implementations should automatically skip to the start of a subsequent record.
> Exceptions in map and reduce implementations can simply be logged, unless 
> they happen under RecordWriter.write().  Cancelling partial output could be 
> hard.  So such output errors will still result in task failure.
> This behaviour should be optional, but enabled by default.  A count of errors 
> per task and job should be maintained and displayed in the web ui.  Perhaps 
> if some percentage of records (>50%?) result in exceptions then the task 
> should fail.  This would stop jobs early that are misconfigured or have buggy 
> code.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to