[
https://issues.apache.org/jira/browse/YARN-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tao Yang updated YARN-11785:
----------------------------
Description:
Below is the error stack trace:
{code:java}
ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received
RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread,
Thread-12, that exited unexpectedly:
org.apache.hadoop.metrics2.MetricsException: Metrics source
PartitionQueueMetrics,partition=,q0=root,q1=xxx already exists!
at
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
at
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
at
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionQueueMetrics(QueueMetrics.java:286)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.setAvailableResourcesToUser(QueueMetrics.java:529)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.computeUserLimitAndSetHeadroom(LeafQueue.java:1490)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1146)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:803)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:803)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1697)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1632)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1787)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1536)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:606)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:653)
{code}
Reproduce this issue:
* 1. A RPC handling thread is executing refreshQueue command and adding new
item into QUEUE_METRICS map.
* 2. In the meanwhile, the async-scheduling thread fail to retrieve an
existing PartitionQueueMetric from QueueMetrics#QUEUE_METRICS (returns null),
then attempt to re-register the same queue name. This triggers a
MetricsException ("Duplicate metric name") and causes the async-scheduling
thread to exit unexpectedly.
The root cause is that QUEUE_METRICS field is implemented with HashMap, which
is not thread-safe. Concurrent put and get operations can lead to visibility
issue. This issue can be fixed by ensuring thread-safe access via
ConcurrentHashMap for QUEUE_METRICS field.
was:
Below is the error stack trace:
{code:java}
ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received
RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread,
Thread-12, that exited unexpectedly:
org.apache.hadoop.metrics2.MetricsException: Metrics source
PartitionQueueMetrics,partition=,q0=root,q1=xxx already exists!
at
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
at
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
at
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionQueueMetrics(QueueMetrics.java:286)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.setAvailableResourcesToUser(QueueMetrics.java:529)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.computeUserLimitAndSetHeadroom(LeafQueue.java:1490)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1146)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:803)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:803)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1697)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1632)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1787)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1536)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:606)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:653)
{code}
Reproduce this issue:
* 1. A RPC handling thread is executing refreshQueue command and adding new
item into QUEUE_METRICS map.
* 2. The async-scheduling thread fail to retrieve an existing
PartitionQueueMetric from QueueMetrics#QUEUE_METRICS (returns null), then
attempt to re-register the same queue name. This triggers a MetricsException
("Duplicate metric name") and causes the async-scheduling thread to exit
unexpectedly.
The root cause is that QUEUE_METRICS field is implemented with HashMap, which
is not thread-safe. Concurrent put and get operations can lead to visibility
issue. This issue can be fixed by ensuring thread-safe access via
ConcurrentHashMap for QUEUE_METRICS field.
> Race condition in QueueMetrics due to non-thread-safe HashMap causes
> MetricsException
> -------------------------------------------------------------------------------------
>
> Key: YARN-11785
> URL: https://issues.apache.org/jira/browse/YARN-11785
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacityscheduler
> Affects Versions: 3.2.4, 3.3.6, 3.4.1
> Reporter: Tao Yang
> Assignee: Tao Yang
> Priority: Major
>
> Below is the error stack trace:
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread,
> Thread-12, that exited unexpectedly:
> org.apache.hadoop.metrics2.MetricsException: Metrics source
> PartitionQueueMetrics,partition=,q0=root,q1=xxx already exists!
> at
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
> at
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
> at
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.getPartitionQueueMetrics(QueueMetrics.java:286)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics.setAvailableResourcesToUser(QueueMetrics.java:529)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.computeUserLimitAndSetHeadroom(LeafQueue.java:1490)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1146)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:803)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:803)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1697)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1632)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1787)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1536)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:606)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:653)
> {code}
>
> Reproduce this issue:
> * 1. A RPC handling thread is executing refreshQueue command and adding new
> item into QUEUE_METRICS map.
> * 2. In the meanwhile, the async-scheduling thread fail to retrieve an
> existing PartitionQueueMetric from QueueMetrics#QUEUE_METRICS (returns null),
> then attempt to re-register the same queue name. This triggers a
> MetricsException ("Duplicate metric name") and causes the async-scheduling
> thread to exit unexpectedly.
> The root cause is that QUEUE_METRICS field is implemented with HashMap, which
> is not thread-safe. Concurrent put and get operations can lead to visibility
> issue. This issue can be fixed by ensuring thread-safe access via
> ConcurrentHashMap for QUEUE_METRICS field.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]