[ 
https://issues.apache.org/jira/browse/YARN-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18040920#comment-18040920
 ] 

ASF GitHub Bot commented on YARN-10892:
---------------------------------------

github-actions[bot] closed pull request #3324: YARN-10892. YARN Preemption 
Monitor got java.util.ConcurrentModificationException when three or more 
partitions exists
URL: https://github.com/apache/hadoop/pull/3324




> YARN Preemption Monitor got java.util.ConcurrentModificationException when 
> three or more partitions exists
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-10892
>                 URL: https://issues.apache.org/jira/browse/YARN-10892
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.1.2
>            Reporter: Jeongin Ju
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: YARN-10892.001.patch, YARN-10892.002.patch
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> On our cluster with a large number of NMs, preemption monitor thread 
> consistently got java.util.ConcurrentModificationException when specific 
> conditions met. (And preemption doesn't work, of course)
> What We found as conditions are as follow. (All 4 conditions should be met)
>  # There are at least two non-exclusive partitions except default partition 
> (let me call the partitions as X and Y partition)
>  # app1 in the queue belonging to default partition (let me call the queue as 
> 'dev' queue) borrowed resources from both X, Y partitions 
>  # app2, app3 submitted to queues belonging to each X, Y partition is 
> 'PENDING' because resources are consumed by app1
>  # Preemption monitor can clear borrowed resources from X or Y when the 
> container of app1 is preempted.  
> Main problem is that FifoCandiatesSelector.selectCandidates tried to remove 
> HashMap key(partition name) while iterating HashMap.
> Logically, it is correct because we didn't traverse the same partition again 
> on this 'selectCandidates'. However HashMap structure does not allow 
> modification while iterating.
> I made test case to reproduce the error 
> case(testResourceTypesInterQueuePreemptionWithThreePartitions).
> We found and patched our cluster on 3.1.2 but it seems trunk still has the 
> same problem.
> I attached patch based on the trunk.
>  
> Thanks!
>  
> {quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor 
> (SchedulingMonitor.java:run(116)) - Exception raised while executing 
> preemption checker, skip this run..., exception=
>  java.util.ConcurrentModificationException
>  at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
>  at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:489)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:320)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)}}
>  
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to