[ https://issues.apache.org/jira/browse/YARN-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282726#comment-15282726 ]
Jason Lowe commented on YARN-4325: ---------------------------------- Appears Jenkins is having difficulty posting to JIRA. Overall was +1 from https://builds.apache.org/job/PreCommit-YARN-Build/11448/console. Patch is looking better, but there's still an issue in the NonAggregatingLogHandler. First the added code seems redundant, since just a few lines earlier it sent the same event: {code} // Inform the application before the actual delete itself, so that links // to logs will no longer be there on NM web-UI. NonAggregatingLogHandler.this.dispatcher.getEventHandler().handle( new ApplicationEvent(this.applicationId, ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED)); if (localAppLogDirs.size() > 0) { NonAggregatingLogHandler.this.delService.delete(user, null, (Path[]) localAppLogDirs.toArray(new Path[localAppLogDirs.size()])); } try { NonAggregatingLogHandler.this.stateStore.removeLogDeleter( this.applicationId); } catch (IOException e) { LOG.error("Error removing log deletion state", e); } finally { NonAggregatingLogHandler.this.dispatcher.getEventHandler().handle( new ApplicationEvent(this.applicationId, ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED)); } {code} It looks to me that once we get the LogDeleterRunnable going we're always sending the necessary event without any additional changes. What I meant by my previous comment was fixing the early out from this code where we initially receive the finished event: {code} case APPLICATION_FINISHED: LogHandlerAppFinishedEvent appFinishedEvent = (LogHandlerAppFinishedEvent) event; ApplicationId appId = appFinishedEvent.getApplicationId(); // Schedule - so that logs are available on the UI till they're deleted. LOG.info("Scheduling Log Deletion for application: " + appId + ", with delay of " + this.deleteDelaySeconds + " seconds"); String user = appOwners.remove(appId); if (user == null) { LOG.error("Unable to locate user for " + appId); break; } {code} In the unlikely event that we can't lookup the user for an appID we need to send a failed event so ApplicationImpl can cleanup the app from the state store since there won't be a LogDeleterRunnable to do it. > Purge app state from NM state-store should cover more LOG_HANDLING cases > ------------------------------------------------------------------------ > > Key: YARN-4325 > URL: https://issues.apache.org/jira/browse/YARN-4325 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.6.0 > Reporter: Junping Du > Assignee: Junping Du > Priority: Critical > Attachments: ApplicationImpl.PNG, YARN-4325-v1.1.patch, > YARN-4325-v1.patch, YARN-4325-v2.patch, YARN-4325-v3.1.patch, > YARN-4325-v3.patch, YARN-4325.patch > > > From a long running cluster, we found tens of thousands of stale apps still > be recovered in NM restart recovery. > After investigating, there are three issues cause app state leak in NM > state-store: > 1. APPLICATION_LOG_HANDLING_FAILED is not handled with remove App in > NMStateStore. > 2. APPLICATION_LOG_HANDLING_FAILED event is missing in sent when hit > aggregator's doAppLogAggregation() exception case. > 3. Only Application in FINISHED status receiving APPLICATION_LOG_FINISHED has > transition to remove app in NM state store. Application in other status - > like APPLICATION_RESOURCES_CLEANUP will ignore the event and later forget to > remove this app from NM state store even after app get finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org