[
https://issues.apache.org/jira/browse/YARN-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18040736#comment-18040736
]
ASF GitHub Bot commented on YARN-10892:
---------------------------------------
github-actions[bot] commented on PR #3324:
URL: https://github.com/apache/hadoop/pull/3324#issuecomment-3578184351
We're closing this stale PR because it has been open for 100 days with no
activity. This isn't a judgement on the merit of the PR in any way. It's just a
way of keeping the PR queue manageable.
If you feel like this was a mistake, or you would like to continue working
on it, please feel free to re-open it and ask for a committer to remove the
stale tag and review again.
Thanks all for your contribution.
> YARN Preemption Monitor got java.util.ConcurrentModificationException when
> three or more partitions exists
> ----------------------------------------------------------------------------------------------------------
>
> Key: YARN-10892
> URL: https://issues.apache.org/jira/browse/YARN-10892
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 3.1.2
> Reporter: Jeongin Ju
> Priority: Major
> Labels: pull-request-available
> Attachments: YARN-10892.001.patch, YARN-10892.002.patch
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> On our cluster with a large number of NMs, preemption monitor thread
> consistently got java.util.ConcurrentModificationException when specific
> conditions met. (And preemption doesn't work, of course)
> What We found as conditions are as follow. (All 4 conditions should be met)
> # There are at least two non-exclusive partitions except default partition
> (let me call the partitions as X and Y partition)
> # app1 in the queue belonging to default partition (let me call the queue as
> 'dev' queue) borrowed resources from both X, Y partitions
> # app2, app3 submitted to queues belonging to each X, Y partition is
> 'PENDING' because resources are consumed by app1
> # Preemption monitor can clear borrowed resources from X or Y when the
> container of app1 is preempted.
> Main problem is that FifoCandiatesSelector.selectCandidates tried to remove
> HashMap key(partition name) while iterating HashMap.
> Logically, it is correct because we didn't traverse the same partition again
> on this 'selectCandidates'. However HashMap structure does not allow
> modification while iterating.
> I made test case to reproduce the error
> case(testResourceTypesInterQueuePreemptionWithThreePartitions).
> We found and patched our cluster on 3.1.2 but it seems trunk still has the
> same problem.
> I attached patch based on the trunk.
>
> Thanks!
>
> {quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor
> (SchedulingMonitor.java:run(116)) - Exception raised while executing
> preemption checker, skip this run..., exception=
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
> at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
> at
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105)
> at
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:489)
> at
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:320)
> at
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99)
> at
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)}}
>
> {quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]