[ https://issues.apache.org/jira/browse/YARN-5190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15483553#comment-15483553 ]
Junping Du commented on YARN-5190: ---------------------------------- Hi [~leftnoteasy], HADOOP-13362 is proposed to fix this issue for branch-2.7 and already get checked in. Anything more to fix here? > Registering/unregistering container metrics triggered by ContainerEvent and > ContainersMonitorEvent are conflict which cause uncaught exception in > ContainerMonitorImpl > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: YARN-5190 > URL: https://issues.apache.org/jira/browse/YARN-5190 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Junping Du > Assignee: Junping Du > Priority: Blocker > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: YARN-5190-branch-2.7.001.patch, YARN-5190-v2.patch, > YARN-5190.patch > > > The exception stack is as following: > {noformat} > 310735 2016-05-22 01:50:04,554 [Container Monitor] ERROR > org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Container > Monitor,5,main] threw an Exception. > 310736 org.apache.hadoop.metrics2.MetricsException: Metrics source > ContainerResource_container_1463840817638_14484_01_000010 already exists! > 310737 at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:135) > 310738 at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:112) > 310739 at > org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) > 310740 at > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.forContainer(ContainerMetrics.java:212) > 310741 at > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.forContainer(ContainerMetrics.java:198) > 310742 at > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:385) > {noformat} > After YARN-4906, we have multiple places to get ContainerMetrics for a > particular container that could cause race condition in registering the same > container metrics to DefaultMetricsSystem by different threads. Lacking of > proper handling of MetricsException which could get thrown, the exception > will could bring down daemon of ContainerMonitorImpl or even whole NM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org