[ https://issues.apache.org/jira/browse/YARN-11184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554318#comment-17554318 ]
Steven Rand commented on YARN-11184: ------------------------------------ Possibly [ZOOKEEPER-2251|https://issues.apache.org/jira/browse/ZOOKEEPER-2251] is related? The thread dump is different, but it appears to be a similar problem of the {{StandByTransitionThread}} waiting indefinitely for a response. The ZK version used client side by hadoop does not include the fix for that issue. > fenced active RM not failing over correctly in HA setup > ------------------------------------------------------- > > Key: YARN-11184 > URL: https://issues.apache.org/jira/browse/YARN-11184 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 3.2.3 > Reporter: Steven Rand > Priority: Major > Attachments: image-2022-06-14-16-38-00-336.png, > image-2022-06-14-16-39-50-278.png, image-2022-06-14-16-41-39-742.png, > image-2022-06-14-16-44-45-101.png > > > We've observed an issue recently on a production cluster running 3.2.3 in > which a fenced Resource Manager remains active, but does not communicate with > the ZK state store, and therefore cannot function correctly. This did not > occur while running 3.2.2 on the same cluster. > In more detail, what seems to happen is: > 1. The active RM gets a {{NodeExists}} error from ZK while storing an app in > the state store. I suspect that this is caused by some transient connection > issue that causes the first node creation request to succeed, but for the > response to not reach the RM, triggering a duplicate request which fails with > this error. > !image-2022-06-14-16-38-00-336.png! > 2. Because of this error, the active RM is fenced. > !image-2022-06-14-16-39-50-278.png! > 3. Because it is fenced, the active RM starts to transition to standby. > !image-2022-06-14-16-41-39-742.png! 4. However, the RM never fully > transitions to standby. It never logs {{Transitioning RM to Standby mode}} > from the run method of {{{}StandByTransitionRunnable{}}}: > [https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java#L1195.] > Related to this, a jstack of the RM shows that thread being {{RUNNABLE}}, > but evidently not making progress: > !image-2022-06-14-16-44-45-101.png! > So the RM doesn't work because it is fenced, but remains active, which causes > an outage until a failover is manually initiated. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org