[ https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736704#comment-16736704 ]
Rohith Sharma K S commented on YARN-6695: ----------------------------------------- [~eyang] Publishing container events from RM is disabled by default i.e *yarn.rm.system-metrics-publisher.emit-container-events* is set to *false*. Are you enabled this configuration? And we don't recommend to enable this configuration since it overloads RM with lot of events. If you can attach stack trace would be help full. Reg the patch, I am not a fan of catching NPE! Instead lets do explicit null check and log with right message something similar to NMTimelinePublisher#putEntity. > Race condition in RM for publishing container events vs appFinished events > causes NPE > -------------------------------------------------------------------------------------- > > Key: YARN-6695 > URL: https://issues.apache.org/jira/browse/YARN-6695 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Rohith Sharma K S > Priority: Critical > Attachments: YARN-6695.001.patch > > > When RM publishes container events i.e by enabling > *yarn.rm.system-metrics-publisher.emit-container-events*, there is race > condition for processing events > vs appFinished event that removes appId from collector list which cause NPE. > Look at the below trace where appId is removed from collectors first and then > corresponding events are processed. > {noformat} > 2017-06-06 19:28:48,896 INFO capacity.ParentQueue > (ParentQueue.java:removeApplication(472)) - Application removed - appId: > application_1496758895643_0005 user: root leaf-queue of parent: root > #applications: 0 > 2017-06-06 19:28:48,921 INFO collector.TimelineCollectorManager > (TimelineCollectorManager.java:remove(190)) - The collector service for > application_1496758895643_0005 was removed > 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher > (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing > entity TimelineEntity[type='YARN_CONTAINER', > id='container_e01_1496758895643_0005_01_000002'] > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org