Kevin Wikant created YARN-11249:
-----------------------------------

             Summary: YARN Client Retry Policy inadvertently changed several 
years ago
                 Key: YARN-11249
                 URL: https://issues.apache.org/jira/browse/YARN-11249
             Project: Hadoop YARN
          Issue Type: Bug
          Components: client
            Reporter: Kevin Wikant


A non-HA & non-Federation YARN client should be using 
[OtherThanRemoteAndSaslExceptionDependentRetry|[https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java#L303]]
 (previously named OtherThanRemoteExceptionDependentRetry).

An HA or Federation YARN client should be using 
[FailoverOnNetworkExceptionRetry|[https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java#L267]].

However, the logic surrounding which Retry Policy gets used (by YARN client) 
has been inconsistent/unexpected due to past changes in the code.

[This 
PR|[https://github.com/apache/hadoop/commit/bdfad4523f1a5a776e20773c371ce99d0c538ac1]]
 introduced what is arguably a bug where FailoverOnNetworkExceptionRetry would 
get used for non-HA & non-Federation YARN clusters because of the default 
yarn-site configuration 
["yarn.federation.failover.enabled=true"|[https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java#L102]].
 This was arguably a bug because "yarn.federation.failover.enabled=true" even 
if YARN federation is not enabled at all. The consequence of this change is 
that all over the world non-HA & non-Federation YARN clusters using a Hadoop 
release from after 2017-09 were unexpectedly using 
FailoverOnNetworkExceptionRetry when in reality they should be using 
OtherThanRemoteAndSaslExceptionDependentRetry. The only YARN clusters (using a 
Hadoop release from after 2017-09) that would not be impacted by the retry 
policy change would be those with a custom yarn-site configuration of 
"yarn.federation.failover.enabled=false"; I assume this is probably a fairly 
small fraction of YARN clusters around the world and so I suspect this retry 
policy change affected the majority of clusters around the world using an 
affected Hadoop release.

The code was recently corrected to change the Retry Policy used for non-HA & 
non-Federation YARN clusters; in 2022-07 this 
[change|[https://github.com/apache/hadoop/commit/213ea037589d8a69c34e1f98413ddbc06dfd42bf]]
 was committed which causes the correct Retry Policy of 
OtherThanRemoteAndSaslExceptionDependentRetry to be used for non-HA & 
non-Federation YARN clusters. The change was to return that [federation 
failover is not enabled unless federation is 
enabled|[https://github.com/apache/hadoop/blob/213ea037589d8a69c34e1f98413ddbc06dfd42bf/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/HAUtil.java#L70]].

Hadoop clusters based on Hadoop releases created between the following 
approximate time ranges will have a different "default" YARN client retry 
policy (i.e. "default" retry policy in the sense that you will get it with a 
default yarn-site file containing "yarn.federation.failover.enabled=true"):
 - before 2017-09 ; YARN uses OtherThanRemoteAndSaslExceptionDependentRetry by 
default
 - 2017-09 to 2022-07 ; YARN uses FailoverOnNetworkExceptionRetry by default
 - 2022-07 into future ; YARN uses 
OtherThanRemoteAndSaslExceptionDependentRetry by default

Part of the problem here is that 
[FailoverOnNetworkExceptionRetry|[https://github.com/apache/hadoop/blob/213ea037589d8a69c34e1f98413ddbc06dfd42bf/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryPolicies.java#L698]]
 is arguably a superior retry policy than 
[OtherThanRemoteAndSaslExceptionDependentRetry|[https://github.com/apache/hadoop/blob/213ea037589d8a69c34e1f98413ddbc06dfd42bf/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryPolicies.java#L610]]
 because:
 - FailoverOnNetworkExceptionRetry will not retry on certain non-retryable 
exceptions such as 
[InvalidToken|[https://github.com/apache/hadoop/blob/213ea037589d8a69c34e1f98413ddbc06dfd42bf/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryPolicies.java#L729]]
 & 
[AccessControlException|[https://github.com/apache/hadoop/blob/213ea037589d8a69c34e1f98413ddbc06dfd42bf/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryPolicies.java#L732]]
 - OtherThanRemoteAndSaslExceptionDependentRetry will retry for 15 minutes on 
certain non-retryable exceptions such as [InvalidToken & 
AccessControlException|[https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java#L291]]

My concern is that YARN users who were previously seeing their YARN client not 
retry on InvalidToken or AccessControlException will now start to see the YARN 
client retry for 15 minutes on these exceptions because the Retry Policy for 
non-HA & non-Federation YARN client has been reverted back to 
OtherThanRemoteAndSaslExceptionDependentRetry.

I see a few options to correct this:
 - remove OtherThanRemoteAndSaslExceptionDependentRetry entirely & always use 
FailoverOnNetworkExceptionRetry
 -- the benefit of this approach is that it reduces lines of code & helps avoid 
cases of exception handling code divergence by maintaining multiple different 
retry policies
 -- I don't see any major risk to doing this because all Hadoop releases since 
after 2017-09 have been using FailoverOnNetworkExceptionRetry without the 
sysadmins even being aware of it & without any issues reported
 - bring OtherThanRemoteAndSaslExceptionDependentRetry to parity with 
FailoverOnNetworkExceptionRetry in terms of not retrying on InvalidToken or 
AccessControlException
 -- downside is that the exception handling in these retry policies may diverge 
again in the future
 - refactor so that both retry policies share the same exception handling code 
but may differ in other areas such as the RetryAction being returned
 -- may be as simple as using FailoverOnNetworkExceptionRetry's exception 
handling logic in a parent class but to return different RetryAction depending 
on subclass (i.e. FAILOVER_AND_RETRY for HA/federation vs RETRY for 
non-HA/non-federation)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org

Reply via email to