[ https://issues.apache.org/jira/browse/YARN-11626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
wangzhihui updated YARN-11626: ------------------------------ Priority: Minor (was: Major) > Optimization of the safeDelete operation in ZKRMStateStore > ---------------------------------------------------------- > > Key: YARN-11626 > URL: https://issues.apache.org/jira/browse/YARN-11626 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager > Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0 > Reporter: wangzhihui > Priority: Minor > > h1. Description > * We can be observed that removing app info started at 06:17:20, but the > NoNodeException was received at 06:17:35. > * During the 15s interval, Curator was retrying the metadata operation. Due > to the non-idempotent nature of the Zookeeper deletion operation, in one of > the retry attempts, the metadata operation was successful but no response was > received. In the next retry it resulted in a NoNodeException, triggering the > STATE_STORE_FENCED event and ultimately causing the current ResourceManager > to switch to standby . > {code:java} > 2023-10-28 06:17:20,359 INFO recovery.RMStateStore > (RMStateStore.java:transition(333)) - Removing info for app: > application_1697410508608_140368 > 2023-10-28 06:17:20,359 INFO resourcemanager.RMAppManager > (RMAppManager.java:checkAppNumCompletedLimit(303)) - Application should be > expired, max number of completed apps kept in memory met: > maxCompletedAppsInMemory = 1000, removing app > application_1697410508608_140368 from memory: > 2023-10-28 06:17:35,665 ERROR recovery.RMStateStore > (RMStateStore.java:transition(337)) - Error removing app: > application_1697410508608_140368 > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:111) > 2023-10-28 06:17:35,666 INFO recovery.RMStateStore > (RMStateStore.java:handleStoreEvent(1147)) - RMStateStore state change from > ACTIVE to FENCED > 2023-10-28 06:17:35,666 ERROR resourcemanager.ResourceManager > (ResourceManager.java:handle(898)) - Received RMFatalEvent of type > STATE_STORE_FENCED, caused by > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > 2023-10-28 06:17:35,666 INFO resourcemanager.ResourceManager > (ResourceManager.java:transitionToStandby(1309)) - Transitioning to standby > state > {code} > h1. Solution > The NoNodeException clearly indicates that the Znode no longer exists, so we > can safely ignore this exception to avoid triggering a larger impact on the > cluster caused by ResourceManager failover. > h1. Other > We also need to discuss and optimize the same issues in safeCreate. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org