[ https://issues.apache.org/jira/browse/MAPREDUCE-6156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208162#comment-14208162 ]
Jason Lowe commented on MAPREDUCE-6156: --------------------------------------- Thanks for the patch, Junping! Sorry for missing this in the review of MAPREDUCE-5891. Rather than throwing the original IOException when we stop retrying, it might be more informative to wrap it with a message stating how many attempts and how long we waited total to try to connect. Otherwise we could throw a socket timeout exception indicated we only waited 1ms depending upon how the attempts lined up with the retries. Speaking of trying up to the retry timeout, we're computing a potentially lower connection timeout value based on the amount of time left in total retry time but then we could end up sleeping most of that time away just below. Arguably we should be computing the socket timeout time just before we try to connect rather than just after we received the error. I don't think we should always ignore InterruptedException. That's probably what we'll get when the Fetcher threads are trying to be shut down, so I think we should return if {{stopped==true}} when that's received. > Fetcher - connect() doesn't handle connection refused correctly > ---------------------------------------------------------------- > > Key: MAPREDUCE-6156 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6156 > Project: Hadoop Map/Reduce > Issue Type: Bug > Reporter: sidharta seethana > Assignee: Junping Du > Priority: Blocker > Attachments: MAPREDUCE-6156.patch > > > The connect() function in the fetcher assumes that whenever an IOException is > thrown, the amount of time passed equals "connectionTimeout" ( see code > snippet below ). This is incorrect. For example, in case the NM is down, an > ConnectException is thrown immediately - and the catch block assumes a minute > has passed when it is not the case. > {code} > if (connectionTimeout < 0) { > throw new IOException("Invalid timeout " > + "[timeout = " + connectionTimeout + " ms]"); > } else if (connectionTimeout > 0) { > unit = Math.min(UNIT_CONNECT_TIMEOUT, connectionTimeout); > } > // set the connect timeout to the unit-connect-timeout > connection.setConnectTimeout(unit); > while (true) { > try { > connection.connect(); > break; > } catch (IOException ioe) { > // update the total remaining connect-timeout > connectionTimeout -= unit; > // throw an exception if we have waited for timeout amount of time > // note that the updated value if timeout is used here > if (connectionTimeout == 0) { > throw ioe; > } > // reset the connect timeout for the last try > if (connectionTimeout < unit) { > unit = connectionTimeout; > // reset the connect time out for the final connect > connection.setConnectTimeout(unit); > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)