[ https://issues.apache.org/jira/browse/MAPREDUCE-935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748984#action_12748984 ]
Steve Loughran commented on MAPREDUCE-935: ------------------------------------------ Looking at the thread dump, it also looks like the exponential backoff feature in the {{Client.handleConnectFailure()}} is interfering with heartbeats. A failure to connect to the server is triggering backoff, stopping progress from being reported. > There's little to be gained by putting a host into the penaltybox at reduce > time, if its the only host you have > --------------------------------------------------------------------------------------------------------------- > > Key: MAPREDUCE-935 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-935 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: tasktracker > Affects Versions: 0.21.0 > Reporter: Steve Loughran > > Exponential backoff may be good for dealing with troublesome hosts, but not > if you only have one host in the entire system. From the log of > {{TestNodeRefresh}}, which for some reason is blocking in the reduce phase, I > can see it doesn't take much for the backoff to kick in so rapidly that the > reducer is waiting for longer than the test > {code} > 2009-08-28 21:39:16,788 WARN mapred.ReduceTask > (ReduceTask.java:fetchOutputs(2192)) - > attempt_20090828213826033_0001_r_000000_0 adding host localhost to penalty > box, next contact in 150 seconds > {code} > The result of this backoff process is that the reduce process ends up > appearing to hang, getting killed from above. > Note that this isn't the root cause of the problem, but it certainly > amplifies things. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.