[ 
https://issues.apache.org/jira/browse/YARN-9320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated YARN-9320:
-----------------------------------
    Description: 
We are running a snapshot of 2.9 branch, unfortunately I'm not sure off the top 
of my head what version it corresponds to. I can look it up if that's 
important, but I haven't found a bug like this so I suspect it would also 
affect a current version unless fixed by accident.

If it helps, the cluster is very large (1000s of NMs) so we expect node 
failures/restart frequently; I see this happens a couple of times (so it's not 
really "fatal") among a bunch of audit logging for 
"OPERATION=replaceLabelsOnNode" calls
{noformat}
2019-02-20 13:12:52,785 FATAL [SchedulerEventDispatcher:Event Processor] 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils: 
queueCapacities.getNodePartitionsSet() changed 
java.util.ConcurrentModificationException
        at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
        at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.updateQueueStatistics(CSQueueUtils.java:303)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.updateClusterResource(LeafQueue.java:1879)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.updateClusterResource(ParentQueue.java:897)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeLabelsAndQueueResource(CapacityScheduler.java:1775)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1633)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:154)
        at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:67)

{noformat}

  was:
We are running a snapshot of 2.9 branch, unfortunately I'm not sure off the top 
of my head what version it corresponds to. I can look it up if that's 
important, but I haven't found a bug like this so I suspect it would also 
affect a current version unless fixed by accident.

If it helps, the cluster is very large (1000s of NMs) so we expect node 
failures/restart frequently; also some apps may have misconfigured node labels 
specified so node label related stuff may go into corner cases. Still, this 
shouldn't happen based on a user-supplied parameter.

{noformat}
2019-02-20 13:12:52,785 FATAL [SchedulerEventDispatcher:Event Processor] 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils: 
queueCapacities.getNodePartitionsSet() changed 
java.util.ConcurrentModificationException
        at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
        at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.updateQueueStatistics(CSQueueUtils.java:303)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.updateClusterResource(LeafQueue.java:1879)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.updateClusterResource(ParentQueue.java:897)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeLabelsAndQueueResource(CapacityScheduler.java:1775)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1633)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:154)
        at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:67)

{noformat}


> ConcurrentModificationException in capacity scheduler (updateQueueStatistics)
> -----------------------------------------------------------------------------
>
>                 Key: YARN-9320
>                 URL: https://issues.apache.org/jira/browse/YARN-9320
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.9.3
>            Reporter: Sergey Shelukhin
>            Priority: Critical
>
> We are running a snapshot of 2.9 branch, unfortunately I'm not sure off the 
> top of my head what version it corresponds to. I can look it up if that's 
> important, but I haven't found a bug like this so I suspect it would also 
> affect a current version unless fixed by accident.
> If it helps, the cluster is very large (1000s of NMs) so we expect node 
> failures/restart frequently; I see this happens a couple of times (so it's 
> not really "fatal") among a bunch of audit logging for 
> "OPERATION=replaceLabelsOnNode" calls
> {noformat}
> 2019-02-20 13:12:52,785 FATAL [SchedulerEventDispatcher:Event Processor] 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils:
>  queueCapacities.getNodePartitionsSet() changed 
> java.util.ConcurrentModificationException
>       at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
>       at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.updateQueueStatistics(CSQueueUtils.java:303)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.updateClusterResource(LeafQueue.java:1879)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.updateClusterResource(ParentQueue.java:897)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeLabelsAndQueueResource(CapacityScheduler.java:1775)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1633)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:154)
>       at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:67)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to