[ https://issues.apache.org/jira/browse/YARN-9237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752086#comment-16752086 ]
Weiwei Yang commented on YARN-9237: ----------------------------------- Hi [~yangjiandan] This looks good to me, even finished apps reported back to RM, it only wastes RM's time to process like below, {code:java} private static void handleRunningAppOnNode(RMNodeImpl rmNode, RMContext context, ApplicationId appId, NodeId nodeId) { ... // if we failed getting app by appId, maybe something wrong happened, just // add the app to the finishedApplications list so that the app can be // cleaned up on the NM if (null == app) { LOG.warn("Cannot get RMApp by appId=" + appId + ", just added it to finishedApplications list for cleanup"); rmNode.finishedApplications.add(appId); rmNode.runningApplications.remove(appId); return; } {code} agree to ignore them in the report. It might be OK to not report applications in state at both FINISHING_CONTAINERS_WAIT, APPLICATION_RESOURCES_CLEANINGUP, FINISHED states, but I agree with the safer option to just ignore FINISHED here. Regarding to the patch, can u change {code:java} + if (!appEntry.getValue().getApplicationState() + .equals(ApplicationState.FINISHED)) { + runningApplications.add(appEntry.getKey()); + } + } {code} to {code:java} ApplicationState.FINISHED != appEntry.getValue().getApplicationState() {code} to avoid unexpected null state? And 2nd, could you pls add an unit test for this? Thanks > RM prints a lot of "Cannot get RMApp by appId" log when RM failover > ------------------------------------------------------------------- > > Key: YARN-9237 > URL: https://issues.apache.org/jira/browse/YARN-9237 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn > Reporter: Jiandan Yang > Assignee: Jiandan Yang > Priority: Major > Attachments: YARN-9237.001.patch > > > I found a lot of following log in active RM log file after doing failover RM > {code:java} > 2019-01-24 15:43:58,999 WARN > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Cannot get > RMApp by appId=application_1542178952162_34746156, just added it to > finishedApplications list for cleanup > ..... > {code} > I looked forward RM logs and find this app had finished before hours > {code:java} > 2019-01-23 21:49:55,683 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1542178952162_34746156_000001 State change from FINAL_SAVING to > FINISHING > {code} > The reason of RM prints " Cannot get RMApp by appId" is as follows: > 1. RM failover > 2. NM reports all running apps to RM in register request > 3. The running apps are from NMContext, some apps may already finished > 4. In my cluster, yarn.log-aggregation-enable=false, > yarn.nodemanager.log.retain-seconds=86400(1day), so app is kept in NMContext > before app has finished for 24 hours > 5. My Yarn cluster runs 50k apps per day and 7k nodes, and NM will report > many finished apps to RM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org