Charan Hebri created YARN-8130: ---------------------------------- Summary: Race condition when container events are published for KILLED applications Key: YARN-8130 URL: https://issues.apache.org/jira/browse/YARN-8130 Project: Hadoop YARN Issue Type: Bug Components: ATSv2 Reporter: Charan Hebri
There seems to be a race condition happening when an application is KILLED and the corresponding container event information is being published. For completed containers, a YARN_CONTAINER_FINISHED event is generated but for some containers in a KILLED application this information is missing. Below is a node manager log snippet, {code:java} 2018-04-09 08:44:54,474 INFO shuffle.ExternalShuffleBlockResolver (ExternalShuffleBlockResolver.java:applicationRemoved(186)) - Application application_1523259757659_0003 removed, cleanupLocalDirs = false 2018-04-09 08:44:54,478 INFO application.ApplicationImpl (ApplicationImpl.java:handle(632)) - Application application_1523259757659_0003 transitioned from APPLICATION_RESOURCES_CLEANINGUP to FINISHED 2018-04-09 08:44:54,478 ERROR timelineservice.NMTimelinePublisher (NMTimelinePublisher.java:putEntity(298)) - Seems like client has been removed before the entity could be published for TimelineEntity[type='YARN_CONTAINER', id='container_1523259757659_0003_01_000002'] 2018-04-09 08:44:54,478 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:finishLogAggregation(520)) - Application just finished : application_1523259757659_0003 2018-04-09 08:44:54,488 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs for container container_1523259757659_0003_01_000001. Current good log dirs are /grid/0/hadoop/yarn/log 2018-04-09 08:44:54,492 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs for container container_1523259757659_0003_01_000002. Current good log dirs are /grid/0/hadoop/yarn/log 2018-04-09 08:44:55,470 INFO collector.TimelineCollectorManager (TimelineCollectorManager.java:remove(192)) - The collector service for application_1523259757659_0003 was removed 2018-04-09 08:44:55,472 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:handle(1572)) - couldn't find application application_1523259757659_0003 while processing FINISH_APPS event. The ResourceManager allocated resources for this application to the NodeManager but no active containers were found to process{code} The container id specified in the log, *container_1523259757659_0003_01_000002* is the one that has the finished event missing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org