[ https://issues.apache.org/jira/browse/YARN-7835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341991#comment-16341991 ]
Rohith Sharma K S commented on YARN-7835: ----------------------------------------- The below log trace shows that 2nd attempt master container has come to same node manager and didn't add into timelineclient since it already exist! But when 1st attempt complected container received, NMTImelinePublisher removes the timelineclient. {code} 2018-01-27 04:55:35,193 INFO application.ApplicationImpl (ApplicationImpl.java:transition(446)) - Adding container_e22_1516990344374_0007_02_000001 to application application_1516990344374_0007 2018-01-27 04:55:35,195 INFO container.ContainerImpl (ContainerImpl.java:handle(2108)) - Container container_e22_1516990344374_0007_02_000001 transitioned from NEW to LOCALIZING 2018-01-27 04:55:35,195 INFO containermanager.AuxServices (AuxServices.java:handle(220)) - Got event CONTAINER_INIT for appId application_1516990344374_0007 2018-01-27 04:55:35,196 INFO collector.TimelineCollectorManager (TimelineCollectorManager.java:putIfAbsent(149)) - the collector for application_1516990344374_0007 already exists! ... ... 2018-01-27 04:55:36,109 INFO nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:removeOrTrackCompletedContainersFromContext(682)) - Removed completed containers from NM context: [container_e22_1516990344374_0007_01_000001] 2018-01-27 04:55:36,112 INFO collector.TimelineCollectorManager (TimelineCollectorManager.java:remove(192)) - The collector service for application_1516990344374_0007 was removed 2018-01-27 04:55:36,430 ERROR collector.TimelineCollectorWebService (TimelineCollectorWebService.java:putEntities(165)) - Application: application_1516990344374_0007 is not found 2018-01-27 04:55:36,430 ERROR collector.TimelineCollectorWebService (TimelineCollectorWebService.java:putEntities(179)) - Error putting entities org.apache.hadoop.yarn.webapp.NotFoundException at org.apache.hadoop.yarn.server.timelineservice.co {code} > [Atsv2] Race condition in NM while publishing events if second attempt > launched in same node > -------------------------------------------------------------------------------------------- > > Key: YARN-7835 > URL: https://issues.apache.org/jira/browse/YARN-7835 > Project: Hadoop YARN > Issue Type: Bug > Environment: It is observed race condition that if master container > is killed for some reason and launched on same node then NMTimelinePublisher > doesn't add timelineClient. But once completed container for 1st attempt has > come then NMTimelinePublisher removes the timelineClient. > It causes all subsequent event publishing from different client fails to > publish with exception Application is not found. ! > Reporter: Rohith Sharma K S > Assignee: Rohith Sharma K S > Priority: Critical > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org