[ https://issues.apache.org/jira/browse/YARN-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432723#comment-16432723 ]
Vrushali C commented on YARN-8130: ---------------------------------- Yes, I agree, we need a configurable delay like the collectorLingerPeriod in the PerNodeTimelineCollectorsAuxService#removeApplicationCollector. Need to check if there are other places where we are removing the app id from some map. Relevant jiras for collectorLingerPeriod YARN-3995 and YARN-7835 > Race condition when container events are published for KILLED applications > -------------------------------------------------------------------------- > > Key: YARN-8130 > URL: https://issues.apache.org/jira/browse/YARN-8130 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2 > Reporter: Charan Hebri > Priority: Major > > There seems to be a race condition happening when an application is KILLED > and the corresponding container event information is being published. For > completed containers, a YARN_CONTAINER_FINISHED event is generated but for > some containers in a KILLED application this information is missing. Below is > a node manager log snippet, > {code:java} > 2018-04-09 08:44:54,474 INFO shuffle.ExternalShuffleBlockResolver > (ExternalShuffleBlockResolver.java:applicationRemoved(186)) - Application > application_1523259757659_0003 removed, cleanupLocalDirs = false > 2018-04-09 08:44:54,478 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(632)) - Application > application_1523259757659_0003 transitioned from > APPLICATION_RESOURCES_CLEANINGUP to FINISHED > 2018-04-09 08:44:54,478 ERROR timelineservice.NMTimelinePublisher > (NMTimelinePublisher.java:putEntity(298)) - Seems like client has been > removed before the entity could be published for > TimelineEntity[type='YARN_CONTAINER', > id='container_1523259757659_0003_01_000002'] > 2018-04-09 08:44:54,478 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:finishLogAggregation(520)) - Application just > finished : application_1523259757659_0003 > 2018-04-09 08:44:54,488 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs > for container container_1523259757659_0003_01_000001. Current good log dirs > are /grid/0/hadoop/yarn/log > 2018-04-09 08:44:54,492 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs > for container container_1523259757659_0003_01_000002. Current good log dirs > are /grid/0/hadoop/yarn/log > 2018-04-09 08:44:55,470 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(192)) - The collector service for > application_1523259757659_0003 was removed > 2018-04-09 08:44:55,472 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:handle(1572)) - couldn't find application > application_1523259757659_0003 while processing FINISH_APPS event. The > ResourceManager allocated resources for this application to the NodeManager > but no active containers were found to process{code} > The container id specified in the log, > *container_1523259757659_0003_01_000002* is the one that has the finished > event missing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org