[ https://issues.apache.org/jira/browse/HADOOP-10630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14016808#comment-14016808 ]
Hudson commented on HADOOP-10630: --------------------------------- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1790 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1790/]) HADOOP-10630. Possible race condition in RetryInvocationHandler. Contributed by Jing Zhao. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1599366) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryInvocationHandler.java > Possible race condition in RetryInvocationHandler > ------------------------------------------------- > > Key: HADOOP-10630 > URL: https://issues.apache.org/jira/browse/HADOOP-10630 > Project: Hadoop Common > Issue Type: Bug > Reporter: Jing Zhao > Assignee: Jing Zhao > Fix For: 2.5.0 > > Attachments: HADOOP-10630.000.patch > > > In one of our system tests with NameNode HA setup, we ran 300 threads in > LoadGenerator. While one of the NameNodes was already in the active state and > started to serve, we still saw one of the client thread failed all the > retries in a 20 seconds window. In the meanwhile, we saw a lot of following > warning msg in the log: > {noformat} > WARN retry.RetryInvocationHandler: A failover has occurred since the start of > this method invocation attempt. > {noformat} > After checking the code, we see the following code in RetryInvocationHandler: > {code} > while (true) { > // The number of times this invocation handler has ever been failed > over, > // before this method invocation attempt. Used to prevent concurrent > // failed method invocations from triggering multiple failover attempts. > long invocationAttemptFailoverCount; > synchronized (proxyProvider) { > invocationAttemptFailoverCount = proxyProviderFailoverCount; > } > ...... > if (action.action == RetryAction.RetryDecision.FAILOVER_AND_RETRY) { > // Make sure that concurrent failed method invocations only cause > a > // single actual fail over. > synchronized (proxyProvider) { > if (invocationAttemptFailoverCount == > proxyProviderFailoverCount) { > proxyProvider.performFailover(currentProxy.proxy); > proxyProviderFailoverCount++; > currentProxy = proxyProvider.getProxy(); > } else { > LOG.warn("A failover has occurred since the start of this > method" > + " invocation attempt."); > } > } > invocationFailoverCount++; > } > ...... > {code} > We can see we refresh the value of currentProxy only when the thread performs > the failover (while holding the monitor of the proxyProvider). Because > "currentProxy" is not volatile, a thread that does not perform the failover > (in which case it will log the warning msg) may fail to get the new value of > currentProxy. -- This message was sent by Atlassian JIRA (v6.2#6252)