[ https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772424#action_12772424 ]
Amareshwari Sriramadasu commented on MAPREDUCE-1171: ---------------------------------------------------- Christian, are you using Yahoo! distribution for 0.20? In branch 0.21, MAPREDUCE-353 makes connect and read timeout configurable for a job. Moreover, Shuffle is simplified by MAPREDUCE-318. Essentially, HADOOP-3327 is no more there. Christian, Making connect and read timeout configurable should address this issue, right? > Lots of fetch failures > ---------------------- > > Key: MAPREDUCE-1171 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: task > Affects Versions: 0.21.0 > Reporter: Christian Kunz > > Since we upgraded to hadoop-0.20.1 from hadoop0.18.3, we see lot of more map > task failures because of 'Too many fetch-failures'. > One of our jobs makes hardly any progress, because of 3000 reduces not able > to get map output of 2 trailing maps (with about 80GB output each), which > repeatedly are marked as failures because of reduces not being able to get > their map output. > One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed > mapoutput fetch even after a single try when it was a read error > (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is > a good idea, as trailing map tasks will be attacked by all reduces > simultaneously. > Here is a log output of a reduce task: > {noformat} > 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: > attempt_200910281903_0028_r_000000_0 copy failed: > attempt_200910281903_0028_m_002781_1 from some host > 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: > java.net.SocketTimeoutException: Read timed out at > java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:129) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) > at java.io.BufferedInputStream.read(BufferedInputStream.java:317) > at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687) > at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220) > 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task > attempt_200910281903_0028_r_000000_0: Failed fetch #1 from > attempt_200910281903_0028_m_002781_1 > 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to > fetch map-output from attempt_200910281903_0028_m_002781_1 even after > MAX_FETCH_RETRIES_PER_MAP retries... or it is a read error, reporting to > the JobTracker. > {noformat} > Also I saw a few log messages which look suspicious as if successfully > fetched map output is discarded because of the map being marked as failed > (because of too many fetch failures). This would make the situation even > worse. > {noformat} > 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: > attempt_200910281903_0028_m_001076_0, compressed len: 21882555, decompressed > len: 23967845 > 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling > 23967845 bytes (21882555 raw bytes) into RAM from > attempt_200910281903_0028_m_001076_0 > 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read > 23967845 bytes from map-output for attempt_200910281903_0028_m_001076_0 > 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from > attempt_200910281903_0028_m_001076_0 -> (20, 39772) from some host > ... > 2009-10-29 22:10:07,220 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring > obsolete output of FAILED map-task: 'attempt_200910281903_0028_m_001076_0' > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.