[ https://issues.apache.org/jira/browse/YARN-8130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470741#comment-16470741 ]
Rohith Sharma K S commented on YARN-8130: ----------------------------------------- Events are dispatched in FIFO but NMTimelinePublisher has internal dispatcher for processing timeline events. This internal dispatcher might also follow FIFO order which could be delayed. > Race condition when container events are published for KILLED applications > -------------------------------------------------------------------------- > > Key: YARN-8130 > URL: https://issues.apache.org/jira/browse/YARN-8130 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 > Reporter: Charan Hebri > Assignee: Rohith Sharma K S > Priority: Major > Attachments: YARN-8130.01.patch, YARN-8130.02.patch > > > There seems to be a race condition happening when an application is KILLED > and the corresponding container event information is being published. For > completed containers, a YARN_CONTAINER_FINISHED event is generated but for > some containers in a KILLED application this information is missing. Below is > a node manager log snippet, > {code:java} > 2018-04-09 08:44:54,474 INFO shuffle.ExternalShuffleBlockResolver > (ExternalShuffleBlockResolver.java:applicationRemoved(186)) - Application > application_1523259757659_0003 removed, cleanupLocalDirs = false > 2018-04-09 08:44:54,478 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(632)) - Application > application_1523259757659_0003 transitioned from > APPLICATION_RESOURCES_CLEANINGUP to FINISHED > 2018-04-09 08:44:54,478 ERROR timelineservice.NMTimelinePublisher > (NMTimelinePublisher.java:putEntity(298)) - Seems like client has been > removed before the entity could be published for > TimelineEntity[type='YARN_CONTAINER', > id='container_1523259757659_0003_01_000002'] > 2018-04-09 08:44:54,478 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:finishLogAggregation(520)) - Application just > finished : application_1523259757659_0003 > 2018-04-09 08:44:54,488 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs > for container container_1523259757659_0003_01_000001. Current good log dirs > are /grid/0/hadoop/yarn/log > 2018-04-09 08:44:54,492 INFO logaggregation.AppLogAggregatorImpl > (AppLogAggregatorImpl.java:doContainerLogAggregation(576)) - Uploading logs > for container container_1523259757659_0003_01_000002. Current good log dirs > are /grid/0/hadoop/yarn/log > 2018-04-09 08:44:55,470 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(192)) - The collector service for > application_1523259757659_0003 was removed > 2018-04-09 08:44:55,472 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:handle(1572)) - couldn't find application > application_1523259757659_0003 while processing FINISH_APPS event. The > ResourceManager allocated resources for this application to the NodeManager > but no active containers were found to process{code} > The container id specified in the log, > *container_1523259757659_0003_01_000002* is the one that has the finished > event missing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org