[ https://issues.apache.org/jira/browse/YARN-9320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sergey Shelukhin updated YARN-9320: ----------------------------------- Description: We are running a snapshot of 2.9 branch, unfortunately I'm not sure off the top of my head what version it corresponds to. I can look it up if that's important, but I haven't found a bug like this so I suspect it would also affect a current version unless fixed by accident. If it helps, the cluster is very large (1000s of NMs) so we expect node failures/restart frequently; I see this happens a couple of times (so it's not really "fatal") among a bunch of audit logging for "OPERATION=replaceLabelsOnNode" calls {noformat} 2019-02-20 13:12:52,785 FATAL [SchedulerEventDispatcher:Event Processor] org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils: queueCapacities.getNodePartitionsSet() changed java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) at java.util.HashMap$KeyIterator.next(HashMap.java:1461) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.updateQueueStatistics(CSQueueUtils.java:303) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.updateClusterResource(LeafQueue.java:1879) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.updateClusterResource(ParentQueue.java:897) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeLabelsAndQueueResource(CapacityScheduler.java:1775) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1633) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:154) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:67) {noformat} was: We are running a snapshot of 2.9 branch, unfortunately I'm not sure off the top of my head what version it corresponds to. I can look it up if that's important, but I haven't found a bug like this so I suspect it would also affect a current version unless fixed by accident. If it helps, the cluster is very large (1000s of NMs) so we expect node failures/restart frequently; also some apps may have misconfigured node labels specified so node label related stuff may go into corner cases. Still, this shouldn't happen based on a user-supplied parameter. {noformat} 2019-02-20 13:12:52,785 FATAL [SchedulerEventDispatcher:Event Processor] org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils: queueCapacities.getNodePartitionsSet() changed java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) at java.util.HashMap$KeyIterator.next(HashMap.java:1461) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.updateQueueStatistics(CSQueueUtils.java:303) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.updateClusterResource(LeafQueue.java:1879) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.updateClusterResource(ParentQueue.java:897) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeLabelsAndQueueResource(CapacityScheduler.java:1775) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1633) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:154) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:67) {noformat} > ConcurrentModificationException in capacity scheduler (updateQueueStatistics) > ----------------------------------------------------------------------------- > > Key: YARN-9320 > URL: https://issues.apache.org/jira/browse/YARN-9320 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.9.3 > Reporter: Sergey Shelukhin > Priority: Critical > > We are running a snapshot of 2.9 branch, unfortunately I'm not sure off the > top of my head what version it corresponds to. I can look it up if that's > important, but I haven't found a bug like this so I suspect it would also > affect a current version unless fixed by accident. > If it helps, the cluster is very large (1000s of NMs) so we expect node > failures/restart frequently; I see this happens a couple of times (so it's > not really "fatal") among a bunch of audit logging for > "OPERATION=replaceLabelsOnNode" calls > {noformat} > 2019-02-20 13:12:52,785 FATAL [SchedulerEventDispatcher:Event Processor] > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils: > queueCapacities.getNodePartitionsSet() changed > java.util.ConcurrentModificationException > at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) > at java.util.HashMap$KeyIterator.next(HashMap.java:1461) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.updateQueueStatistics(CSQueueUtils.java:303) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.updateClusterResource(LeafQueue.java:1879) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.updateClusterResource(ParentQueue.java:897) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeLabelsAndQueueResource(CapacityScheduler.java:1775) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1633) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:154) > at > org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:67) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org