[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772300#action_12772300
 ] 

Christian Kunz commented on MAPREDUCE-1171:
-------------------------------------------

The algorithm in hadoop-0.20 that determines when to mark a map task as having 
too many fetch failures seems to have big problems with jobs that have a few 
out-layers in map task runtimes.

The job mentioned in the description has a couple of map tasks requiring about 
3 hours of runtime (1hr application execution time and 2hrs merge phase), with 
the overall average map runtime much smaller.
The consequence is that all reduces just wait for these few map tasks to 
finish, and these maps get declared as 'failed' very quickly because of many 
socket timeout exceptions on the reduce side although there is nothing wrong 
with the map tasks.

The first attempt to run the job failed after 28hrs (we allow for up to 10 map 
task failures), with about only 200 reduces successfully fetching map output 
every time the long running map tasks completed; these map tasks got declared 
as failed within less than 1 minute after completion.

We deployed an emergency fetch in the ReduceTask, commenting out the portion 
where a single read error results in a report:
{noformat}
  
              if (//cr.getError().equals(CopyOutputErrorType.READ_ERROR) ||
                 ((noFailedFetches >= fetchRetriesPerMap)
                  && ((noFailedFetches - fetchRetriesPerMap) % 2) == 0)) {
                synchronized (ReduceTask.this) {
                  taskStatus.addFetchFailedMap(mapTaskId);
{noformat}
and it helped somewhat.
The job finished after 6 re-executions of the long running map tasks 
(fortunately, we allow a high number of map task failures), but still took 
18hrs instead of optimally less than 4 hrs. About 600 reduces got map output 
successfully before the map tasks got declared as failed, always because of 5+ 
socket timeout exceptions (one advantage of hadoop-0.20 to retry a fetch within 
a few seconds was counter-productive here).

With hadoop-0.18.3 this job  needed less than 9 hrs (not optimal but we could 
live with it).

Take away:
Current algorithm is not general enough to cover all corner cases, in 
particular jobs with a handful of much longer running map tasks.
It should be improved (there should be an allowance for socket timeout 
exceptions when a high number of reduces fetch output from a single map task in 
parallel) or there should be a configuration option to allow to override the 
threshold on a per job basis.

> Lots of fetch failures
> ----------------------
>
>                 Key: MAPREDUCE-1171
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: task
>    Affects Versions: 0.20.1
>            Reporter: Christian Kunz
>
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
> task failures because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able 
> to get map output of 2 trailing maps (with about 80GB output each), which 
> repeatedly are marked as failures because of reduces not being able to get 
> their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
> mapoutput fetch even after a single try when it was a read error 
> (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is 
> a good idea, as trailing map tasks will be attacked by all reduces 
> simultaneously.
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> attempt_200910281903_0028_r_000000_0 copy failed: 
> attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
> java.net.SocketTimeoutException: Read timed out        at 
> java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>         at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
>         at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
>         at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
>         at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
>         at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
>         at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
>         at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
> attempt_200910281903_0028_r_000000_0: Failed fetch #1 from 
> attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
> fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to 
> the JobTracker.
> {noformat}
> Also I saw a few log messages which look suspicious as if successfully 
> fetched map output is discarded because of the map being marked as failed 
> (because of too many fetch failures). This would make the situation even 
> worse.
> {noformat}
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: 
> attempt_200910281903_0028_m_001076_0, compressed len: 21882555, decompressed 
> len: 23967845
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
> 23967845 bytes (21882555 raw bytes) into RAM from 
> attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read 
> 23967845 bytes from map-output for attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
> attempt_200910281903_0028_m_001076_0 -> (20, 39772) from some host
> ...
> 2009-10-29 22:10:07,220 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring 
> obsolete output of FAILED map-task: 'attempt_200910281903_0028_m_001076_0'
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to