[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391516#comment-14391516
 ] 

Jason Lowe commented on MAPREDUCE-6303:
---------------------------------------

Sample reduce log snippet showing the issue:

{noformat}
2015-03-28 00:31:54,393 WARN [main] org.apache.hadoop.mapred.YarnChild: 
Exception running child : 
org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle 
in fetcher#7
        at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:150)
        at java.net.SocketInputStream.read(SocketInputStream.java:121)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:633)
        at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:579)
        at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1322)
        at 
java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
        at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.verifyConnection(Fetcher.java:427)
        at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.setupConnectionsWithRetry(Fetcher.java:392)
        at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:338)
        at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)

2015-03-28 00:31:54,511 INFO [main] org.apache.hadoop.mapred.Task: Runnning 
cleanup for the task
{noformat}

The problem is that the code caught an IOException trying to shuffle and within 
the catch block the code throws _again_ which leaks up to the top of the 
Fetcher thread and kills the task.

> Read timeout when retrying a fetch error can be fatal to a reducer
> ------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6303
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6303
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Priority: Blocker
>
> If a reducer encounters an error trying to fetch from a node then encounters 
> a read timeout when trying to re-establish the connection then the reducer 
> can fail.  The read timeout exception can leak to the top of the Fetcher 
> thread which will cause the reduce task to teardown.  This type of error can 
> repeat across reducer attempts causing jobs to fail due to a single bad node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to