[ https://issues.apache.org/jira/browse/HADOOP-12622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Junping Du updated HADOOP-12622: -------------------------------- Status: Patch Available (was: Open) > RetryPolicies (other than FailoverOnNetworkExceptionRetry) should put on > retry failed reason or the log from RMProxy's retry could be very misleading. > ------------------------------------------------------------------------------------------------------------------------------------------------------ > > Key: HADOOP-12622 > URL: https://issues.apache.org/jira/browse/HADOOP-12622 > Project: Hadoop Common > Issue Type: Bug > Components: auto-failover > Affects Versions: 2.7.0, 2.6.0 > Reporter: Junping Du > Assignee: Junping Du > Attachments: HADOOP-12622.patch > > > In debugging a NM retry connection to RM (non-HA), the NM log during RM down > time is very misleading: > {noformat} > 2015-12-07 11:37:14,098 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 11:37:15,099 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 11:37:16,101 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 2 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 11:37:17,103 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 3 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 11:37:18,105 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 4 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 11:37:19,107 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 5 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 11:37:20,109 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 6 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 11:37:21,112 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 7 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 11:37:22,113 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 8 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 11:37:23,115 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 11:37:54,120 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 11:37:55,121 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 11:37:56,123 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 2 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 11:37:57,125 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 3 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 11:37:58,126 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 4 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 11:37:59,128 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 5 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 2015-12-07 11:38:00,130 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: 0.0.0.0/0.0.0.0:8031. Already tried 6 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > {noformat} > It actually only log client side retry on NetworkConnection failure but not > include any info on RetryInvocationHandler where the real retry policy works. > From the code below in RetryInvocationHandler.java, even the retry ends, we > don't put warn messages to include how much/many time/ counts we spent on > retry logic that make it harder to debug. > {code} > if (failAction != null) { > if (failAction.reason != null) { > LOG.warn("Exception while invoking " + > currentProxy.proxy.getClass() > + "." + method.getName() + " over " + currentProxy.proxyInfo > + ". Not retrying because " + failAction.reason, ex); > } > throw ex; > } > {code} > We should add failAction.reason as much as we can in multiple retry policies. > In addition, we should keep consistent in log level for message during the > retry attempts: now the ipc.client is INFO, but RetryInvocationHandler is > DEBUG (if not fail_over). We should keep them consistent or it could be very > confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)