[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841592#comment-17841592 ] ASF GitHub Bot commented on YARN-11191: --- hadoop-yetus commented on PR #6768: URL: https://github.com/apache/hadoop/pull/6768#issuecomment-2081364250 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| _ Prechecks _ | | +1 :green_heart: | dupname | 0m 01s | | No case conflicting files found. | | +0 :ok: | spotbugs | 0m 01s | | spotbugs executables are not available. | | +0 :ok: | codespell | 0m 01s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 01s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 00s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 00s | | The patch appears to include 1 new or modified test files. | _ branch-3.4 Compile Tests _ | | +1 :green_heart: | mvninstall | 88m 19s | | branch-3.4 passed | | +1 :green_heart: | compile | 5m 43s | | branch-3.4 passed | | +1 :green_heart: | checkstyle | 4m 35s | | branch-3.4 passed | | +1 :green_heart: | mvnsite | 5m 36s | | branch-3.4 passed | | +1 :green_heart: | javadoc | 5m 31s | | branch-3.4 passed | | +1 :green_heart: | shadedclient | 148m 13s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 3m 26s | | the patch passed | | +1 :green_heart: | compile | 3m 06s | | the patch passed | | +1 :green_heart: | javac | 3m 06s | | the patch passed | | +1 :green_heart: | blanks | 0m 00s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 2m 11s | | the patch passed | | +1 :green_heart: | mvnsite | 3m 18s | | the patch passed | | +1 :green_heart: | javadoc | 2m 55s | | the patch passed | | +1 :green_heart: | shadedclient | 155m 05s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 323m 16s | [/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6768/3/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-server-resourcemanager in the patch passed. | | -1 :x: | asflicense | 5m 28s | [/results-asflicense.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6768/3/artifact/out/results-asflicense.txt) | The patch generated 2 ASF License warnings. | | | | 747m 18s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.yarn.server.resourcemanager.metrics.TestCombinedSystemMetricsPublisher | | | hadoop.yarn.server.resourcemanager.recovery.TestLeveldbRMStateStore | | | hadoop.yarn.server.resourcemanager.resource.TestResourceProfiles | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.TestFSSchedulerConfigurationStore | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.TestLeveldbConfigurationStore | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCSAllocateCustomResource | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueConfigurationAutoRefreshPolicy | | | hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigArgumentHandler | | | hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigConverter | | | hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigConverterMain | | | hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSQueueConverter | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService | | | hadoop.yarn.server.resourcemanager.TestClientRMService | | | hadoop.yarn.server.resourcemanager.TestSignalContainer | | | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart | | | hadoop.yarn.server.resourcemanager.volume.csi.TestVolumeProcessor | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppCustomResourceTypes | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsCustomResourceTypes | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesContainers | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication | | |
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841549#comment-17841549 ] ASF GitHub Bot commented on YARN-11191: --- slfan1989 commented on PR #6768: URL: https://github.com/apache/hadoop/pull/6768#issuecomment-2081257034 @tomicooler Thanks for the contribution! > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841548#comment-17841548 ] ASF GitHub Bot commented on YARN-11191: --- slfan1989 merged PR #6768: URL: https://github.com/apache/hadoop/pull/6768 > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841541#comment-17841541 ] ASF GitHub Bot commented on YARN-11191: --- hadoop-yetus commented on PR #6768: URL: https://github.com/apache/hadoop/pull/6768#issuecomment-2081187638 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 6m 57s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ branch-3.4 Compile Tests _ | | +1 :green_heart: | mvninstall | 32m 44s | | branch-3.4 passed | | +1 :green_heart: | compile | 0m 36s | | branch-3.4 passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 0m 30s | | branch-3.4 passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 0m 31s | | branch-3.4 passed | | +1 :green_heart: | mvnsite | 0m 34s | | branch-3.4 passed | | +1 :green_heart: | javadoc | 0m 37s | | branch-3.4 passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 0m 29s | | branch-3.4 passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 1m 16s | | branch-3.4 passed | | +1 :green_heart: | shadedclient | 20m 32s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 27s | | the patch passed | | +1 :green_heart: | compile | 0m 28s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 0m 28s | | the patch passed | | +1 :green_heart: | compile | 0m 28s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | javac | 0m 28s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 24s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 28s | | the patch passed | | +1 :green_heart: | javadoc | 0m 25s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 0m 28s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 1m 9s | | the patch passed | | +1 :green_heart: | shadedclient | 20m 24s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 87m 33s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 26s | | The patch does not generate ASF License warnings. | | | | 179m 4s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6768 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 10200f517789 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | branch-3.4 / 31f1424fcd7011fdd78d507e9482cffee9bcd09b | | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/3/testReport/ | | Max. process+thread count | 950 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/3/console | | versions | git=2.25.1 maven=3.6.3
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841525#comment-17841525 ] ASF GitHub Bot commented on YARN-11191: --- tomicooler commented on PR #6768: URL: https://github.com/apache/hadoop/pull/6768#issuecomment-2081123115 @slfan1989 Yes, sure. Re-triggered. > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841465#comment-17841465 ] ASF GitHub Bot commented on YARN-11191: --- slfan1989 commented on PR #6768: URL: https://github.com/apache/hadoop/pull/6768#issuecomment-2080551000 @tomicooler Can we retrigger compilation? > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841464#comment-17841464 ] ASF GitHub Bot commented on YARN-11191: --- slfan1989 commented on PR #6769: URL: https://github.com/apache/hadoop/pull/6769#issuecomment-2080548182 @tomicooler Thanks for the contribution! > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841463#comment-17841463 ] ASF GitHub Bot commented on YARN-11191: --- slfan1989 merged PR #6769: URL: https://github.com/apache/hadoop/pull/6769 > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840745#comment-17840745 ] ASF GitHub Bot commented on YARN-11191: --- hadoop-yetus commented on PR #6768: URL: https://github.com/apache/hadoop/pull/6768#issuecomment-2076847309 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 21s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ branch-3.4 Compile Tests _ | | +1 :green_heart: | mvninstall | 32m 50s | | branch-3.4 passed | | +1 :green_heart: | compile | 0m 32s | | branch-3.4 passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 0m 31s | | branch-3.4 passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 0m 31s | | branch-3.4 passed | | +1 :green_heart: | mvnsite | 0m 36s | | branch-3.4 passed | | +1 :green_heart: | javadoc | 0m 37s | | branch-3.4 passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 0m 29s | | branch-3.4 passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 1m 9s | | branch-3.4 passed | | +1 :green_heart: | shadedclient | 20m 36s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 26s | | the patch passed | | +1 :green_heart: | compile | 0m 30s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 0m 30s | | the patch passed | | +1 :green_heart: | compile | 0m 25s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | javac | 0m 25s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 24s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 27s | | the patch passed | | +1 :green_heart: | javadoc | 0m 27s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 0m 25s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 1m 12s | | the patch passed | | +1 :green_heart: | shadedclient | 20m 31s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 87m 5s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 26s | | The patch does not generate ASF License warnings. | | | | 172m 52s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6768 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 111d4e1feebc 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | branch-3.4 / 985da47eda7acc4d92a0a5c4095c164f33520725 | | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/2/testReport/ | | Max. process+thread count | 955 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/2/console | | versions | git=2.25.1 maven=3.6.3
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840543#comment-17840543 ] ASF GitHub Bot commented on YARN-11191: --- hadoop-yetus commented on PR #6769: URL: https://github.com/apache/hadoop/pull/6769#issuecomment-2075632733 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 11m 0s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ branch-3.3 Compile Tests _ | | +1 :green_heart: | mvninstall | 50m 58s | | branch-3.3 passed | | +1 :green_heart: | compile | 0m 57s | | branch-3.3 passed | | +1 :green_heart: | checkstyle | 0m 47s | | branch-3.3 passed | | +1 :green_heart: | mvnsite | 1m 1s | | branch-3.3 passed | | +1 :green_heart: | javadoc | 0m 48s | | branch-3.3 passed | | +1 :green_heart: | spotbugs | 1m 58s | | branch-3.3 passed | | +1 :green_heart: | shadedclient | 41m 29s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 56s | | the patch passed | | +1 :green_heart: | compile | 0m 47s | | the patch passed | | +1 :green_heart: | javac | 0m 47s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 35s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 51s | | the patch passed | | +1 :green_heart: | javadoc | 0m 33s | | the patch passed | | +1 :green_heart: | spotbugs | 1m 58s | | the patch passed | | +1 :green_heart: | shadedclient | 40m 38s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 95m 54s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 38s | | The patch does not generate ASF License warnings. | | | | 254m 50s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6769/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6769 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux a214af58c8fa 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | branch-3.3 / 7c732b60916ebadef8ec7b06a634aaef28a76047 | | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6769/1/testReport/ | | Max. process+thread count | 857 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6769/1/console | | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840493#comment-17840493 ] ASF GitHub Bot commented on YARN-11191: --- hadoop-yetus commented on PR #6768: URL: https://github.com/apache/hadoop/pull/6768#issuecomment-2075284019 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 21s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ branch-3.4 Compile Tests _ | | +1 :green_heart: | mvninstall | 33m 3s | | branch-3.4 passed | | +1 :green_heart: | compile | 0m 34s | | branch-3.4 passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 0m 33s | | branch-3.4 passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 0m 34s | | branch-3.4 passed | | +1 :green_heart: | mvnsite | 0m 38s | | branch-3.4 passed | | +1 :green_heart: | javadoc | 0m 37s | | branch-3.4 passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 0m 30s | | branch-3.4 passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 1m 10s | | branch-3.4 passed | | +1 :green_heart: | shadedclient | 20m 29s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | -1 :x: | mvninstall | 0m 23s | [/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/1/artifact/out/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-server-resourcemanager in the patch failed. | | -1 :x: | compile | 0m 27s | [/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/1/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt) | hadoop-yarn-server-resourcemanager in the patch failed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1. | | -1 :x: | javac | 0m 27s | [/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/1/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt) | hadoop-yarn-server-resourcemanager in the patch failed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1. | | -1 :x: | compile | 0m 22s | [/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/1/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06.txt) | hadoop-yarn-server-resourcemanager in the patch failed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06. | | -1 :x: | javac | 0m 22s | [/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/1/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06.txt) | hadoop-yarn-server-resourcemanager in the patch failed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06. | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 24s | | the patch passed | | -1 :x: | mvnsite | 0m 24s |
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840457#comment-17840457 ] ASF GitHub Bot commented on YARN-11191: --- tomicooler opened a new pull request, #6769: URL: https://github.com/apache/hadoop/pull/6769 … (#6732) (cherry picked from commit ecf665c6facf89d3b87b6e3cc684274b8155ca60) Change-Id: I561bcad51af7810328c8b91cd9290d5198be0c6e ### Description of PR Backport, there were conflicts (abstractparent/leaf queue, and queuepath doesn't exist here yet). Jira: [YARN-11191](https://issues.apache.org/jira/browse/YARN-11191) Original PR: #6732 ### For code changes: - [ ] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840432#comment-17840432 ] ASF GitHub Bot commented on YARN-11191: --- tomicooler opened a new pull request, #6768: URL: https://github.com/apache/hadoop/pull/6768 … (#6732) (cherry picked from commit ecf665c6facf89d3b87b6e3cc684274b8155ca60) ### Description of PR backport. Jira: [YARN-11191](https://issues.apache.org/jira/browse/YARN-11191) Original PR: #6732 ### For code changes: - [ ] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840428#comment-17840428 ] ASF GitHub Bot commented on YARN-11191: --- brumi1024 merged PR #6732: URL: https://github.com/apache/hadoop/pull/6732 > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840427#comment-17840427 ] ASF GitHub Bot commented on YARN-11191: --- brumi1024 commented on PR #6732: URL: https://github.com/apache/hadoop/pull/6732#issuecomment-2074890741 Thanks for the patch @tomicooler and @slfan1989 @p-szucs for the review, merging to trunk. @tomicooler can you please check if a backport is possible to the branches 3.3 and 3.4? > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840239#comment-17840239 ] ASF GitHub Bot commented on YARN-11191: --- hadoop-yetus commented on PR #6732: URL: https://github.com/apache/hadoop/pull/6732#issuecomment-2073537387 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| _ Prechecks _ | | +1 :green_heart: | dupname | 0m 01s | | No case conflicting files found. | | +0 :ok: | spotbugs | 0m 01s | | spotbugs executables are not available. | | +0 :ok: | codespell | 0m 01s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 01s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 00s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 00s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 106m 35s | | trunk passed | | +1 :green_heart: | compile | 6m 48s | | trunk passed | | +1 :green_heart: | checkstyle | 5m 43s | | trunk passed | | +1 :green_heart: | mvnsite | 6m 51s | | trunk passed | | +1 :green_heart: | javadoc | 6m 25s | | trunk passed | | +1 :green_heart: | shadedclient | 174m 58s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 4m 23s | | the patch passed | | +1 :green_heart: | compile | 4m 06s | | the patch passed | | +1 :green_heart: | javac | 4m 06s | | the patch passed | | +1 :green_heart: | blanks | 0m 00s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 2m 38s | | the patch passed | | +1 :green_heart: | mvnsite | 3m 57s | | the patch passed | | +1 :green_heart: | javadoc | 3m 27s | | the patch passed | | +1 :green_heart: | shadedclient | 190m 26s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | asflicense | 6m 36s | | The patch does not generate ASF License warnings. | | | | 502m 33s | | | | Subsystem | Report/Notes | |--:|:-| | GITHUB PR | https://github.com/apache/hadoop/pull/6732 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | MINGW64_NT-10.0-17763 0cbe62357619 3.4.10-87d57229.x86_64 2024-02-14 20:17 UTC x86_64 Msys | | Build tool | maven | | Personality | /c/hadoop/dev-support/bin/hadoop.sh | | git revision | trunk / 73c01ccb77ca85cd83785726b55841fe8b94e973 | | Default Java | Azul Systems, Inc.-1.8.0_332-b09 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6732/1/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6732/1/console | | versions | git=2.44.0.windows.1 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold:
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840042#comment-17840042 ] ASF GitHub Bot commented on YARN-11191: --- hadoop-yetus commented on PR #6732: URL: https://github.com/apache/hadoop/pull/6732#issuecomment-2071872714 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 20s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 33m 18s | | trunk passed | | +1 :green_heart: | compile | 0m 36s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 0m 32s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 0m 32s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 35s | | trunk passed | | +1 :green_heart: | javadoc | 0m 38s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 0m 32s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 1m 10s | | trunk passed | | +1 :green_heart: | shadedclient | 20m 37s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 25s | | the patch passed | | +1 :green_heart: | compile | 0m 32s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 0m 32s | | the patch passed | | +1 :green_heart: | compile | 0m 29s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | javac | 0m 29s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 22s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 29s | | the patch passed | | +1 :green_heart: | javadoc | 0m 25s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 0m 26s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 1m 11s | | the patch passed | | +1 :green_heart: | shadedclient | 20m 45s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 90m 8s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 25s | | The patch does not generate ASF License warnings. | | | | 176m 11s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6732/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6732 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux e28442f7b76e 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 73c01ccb77ca85cd83785726b55841fe8b94e973 | | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6732/2/testReport/ | | Max. process+thread count | 941 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6732/2/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839966#comment-17839966 ] ASF GitHub Bot commented on YARN-11191: --- tomicooler commented on code in PR #6732: URL: https://github.com/apache/hadoop/pull/6732#discussion_r1575731004 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractParentQueue.java: ## @@ -1347,6 +1348,18 @@ public List getChildQueues() { } + @Override + public List getChildQueuesByTryLock() { +try { + while (!readLock.tryLock()){ +LockSupport.parkNanos(1); Review Comment: Ok, thanks. Fixed the checkstyle issues. > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839888#comment-17839888 ] ASF GitHub Bot commented on YARN-11191: --- slfan1989 commented on code in PR #6732: URL: https://github.com/apache/hadoop/pull/6732#discussion_r1575454919 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractParentQueue.java: ## @@ -1347,6 +1348,18 @@ public List getChildQueues() { } + @Override + public List getChildQueuesByTryLock() { +try { + while (!readLock.tryLock()){ +LockSupport.parkNanos(1); Review Comment: I have no other comments. we need to fix checkstyle, then we can merge this pr. > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837794#comment-17837794 ] ASF GitHub Bot commented on YARN-11191: --- tomicooler commented on code in PR #6732: URL: https://github.com/apache/hadoop/pull/6732#discussion_r1567660577 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractParentQueue.java: ## @@ -1347,6 +1348,18 @@ public List getChildQueues() { } + @Override + public List getChildQueuesByTryLock() { +try { + while (!readLock.tryLock()){ +LockSupport.parkNanos(1); Review Comment: I don't know how useful is to configure this value, the only reason to increase it if the loop is too busy with the 10 thousand nanos (=10 microseconds). https://hazelcast.com/blog/locksupport-parknanos-under-the-hood-and-the-curious-case-of-parking/ Based on this, the granularity should be around 50 microseconds anyway. > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837737#comment-17837737 ] ASF GitHub Bot commented on YARN-11191: --- slfan1989 commented on code in PR #6732: URL: https://github.com/apache/hadoop/pull/6732#discussion_r1567460352 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractParentQueue.java: ## @@ -1347,6 +1348,18 @@ public List getChildQueues() { } + @Override + public List getChildQueuesByTryLock() { +try { + while (!readLock.tryLock()){ +LockSupport.parkNanos(1); Review Comment: Is it better for us to add a configuration? This configuration is used to configure `1` > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837341#comment-17837341 ] ASF GitHub Bot commented on YARN-11191: --- p-szucs commented on PR #6732: URL: https://github.com/apache/hadoop/pull/6732#issuecomment-2057246374 Thanks @tomicooler, LGTM > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837326#comment-17837326 ] ASF GitHub Bot commented on YARN-11191: --- hadoop-yetus commented on PR #6732: URL: https://github.com/apache/hadoop/pull/6732#issuecomment-2057144840 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 7m 33s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 32m 59s | | trunk passed | | +1 :green_heart: | compile | 0m 34s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 0m 32s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 0m 31s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 38s | | trunk passed | | +1 :green_heart: | javadoc | 0m 36s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 0m 29s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 1m 7s | | trunk passed | | +1 :green_heart: | shadedclient | 20m 27s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 28s | | the patch passed | | +1 :green_heart: | compile | 0m 30s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 0m 30s | | the patch passed | | +1 :green_heart: | compile | 0m 26s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | javac | 0m 26s | | the patch passed | | +1 :green_heart: | blanks | 0m 1s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 25s | [/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6732/1/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 2 new + 139 unchanged - 0 fixed = 141 total (was 139) | | +1 :green_heart: | mvnsite | 0m 26s | | the patch passed | | +1 :green_heart: | javadoc | 0m 26s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 0m 27s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 1m 11s | | the patch passed | | +1 :green_heart: | shadedclient | 20m 39s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 89m 55s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 24s | | The patch does not generate ASF License warnings. | | | | 182m 42s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6732/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6732 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 8af00fba70d5 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / aeb1b430e30affd6549348feda637ba93b45a298 | | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6732/1/testReport/ | | Max.
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837295#comment-17837295 ] ASF GitHub Bot commented on YARN-11191: --- slfan1989 commented on PR #6732: URL: https://github.com/apache/hadoop/pull/6732#issuecomment-2056988311 LGTM > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837222#comment-17837222 ] ASF GitHub Bot commented on YARN-11191: --- tomicooler opened a new pull request, #6732: URL: https://github.com/apache/hadoop/pull/6732 Change-Id: I4ca3998748494c1148b1ef79b33500eecae02a97 ### Description of PR Jira: [YARN-11191](https://issues.apache.org/jira/browse/YARN-11191) Original PR: https://github.com/apache/hadoop/pull/4726 The investigation and the fix was done by @yb12138, see the Original PR for further details. I fixed the leftover review comments and I slightly modified the new test code to actually test the `refreshQueues`. The new test fails without the fix. ### How was this patch tested? I created a sample application for demonstration: ```java import java.util.concurrent.locks.LockSupport; import java.util.concurrent.locks.ReentrantReadWriteLock; public class Main { static final long start = System.currentTimeMillis(); public static void main(String[] args) { ReentrantReadWriteLock queue = new ReentrantReadWriteLock(); ReentrantReadWriteLock preemption = new ReentrantReadWriteLock(); Thread schedulerThread = new Thread(() -> { log("queue.readLock().lock()"); queue.readLock().lock(); try { Thread.sleep(5 * 1000); } catch (InterruptedException e) { e.printStackTrace(); } log("preemption.readLock().lock()"); preemption.readLock().lock(); log("preemption.readLock().unlock()"); preemption.readLock().unlock(); log("queue.readLock().unlock()"); queue.readLock().unlock(); }, "SCHEDULE"); Thread completeThread = new Thread(() -> { try { Thread.sleep(1000); } catch (InterruptedException e) { e.printStackTrace(); } log("queue.writeLock().lock()"); queue.writeLock().lock(); log("queue.writeLock().unlock()"); queue.writeLock().unlock(); }, "COMPLETE"); Thread refreshThread = new Thread(() -> { try { Thread.sleep(2 * 1000); } catch (InterruptedException e) { e.printStackTrace(); } log("preemption.writeLock().lock()"); preemption.writeLock().lock(); log("queue.readLock().lock()"); // OUTPUT_A queue.readLock().lock(); // OUTPUT_B //while (!queue.readLock().tryLock()) { // LockSupport.parkNanos(1); //} log("queue.readLock().unlock()"); queue.readLock().unlock(); log("preemption.writeLock().unlock()"); preemption.writeLock().unlock(); }, "REFRESH "); schedulerThread.start(); completeThread.start(); refreshThread.start(); try { schedulerThread.join(); completeThread.join(); refreshThread.join(); } catch (InterruptedException e) { throw new RuntimeException(e); } } static void log(String what) { System.out.printf("%04d %s %s%n", System.currentTimeMillis() - start, Thread.currentThread().getName(), what); } } ``` `OUTPUT_A` deadlock happens. ``` 0005 SCHEDULE queue.readLock().lock() 1007 COMPLETE queue.writeLock().lock() 2006 REFRESH preemption.writeLock().lock() 2006 REFRESH queue.readLock().lock() 5042 SCHEDULE preemption.readLock().lock() ``` ![threads](https://github.com/apache/hadoop/assets/7456325/d35d8f2a-ab62-4fed-ba5e-d7c352cb504d) `OUTPUT_B` deadlock does not happen. (tryLock) ``` 0003 SCHEDULE queue.readLock().lock() 1003 COMPLETE queue.writeLock().lock() 2003 REFRESH preemption.writeLock().lock() 2004 REFRESH queue.readLock().lock() 2005 REFRESH queue.readLock().unlock() 2005 REFRESH preemption.writeLock().unlock() 5026 SCHEDULE preemption.readLock().lock() 5027 SCHEDULE preemption.readLock().unlock() 5027 SCHEDULE queue.readLock().unlock() 5027 COMPLETE queue.writeLock().unlock() ``` ### For code changes: - [x] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'YARN-11191. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > Global Scheduler refreshQueue cause deadLock > - >
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836047#comment-17836047 ] ASF GitHub Bot commented on YARN-11191: --- tomicooler commented on PR #4726: URL: https://github.com/apache/hadoop/pull/4726#issuecomment-2049153875 Hi @yb12138 @goiri, is there any plan to fix the remaining review comments and get this merged? If there is no time, may I take over? (I'll create a new PR) > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605924#comment-17605924 ] ASF GitHub Bot commented on YARN-11191: --- goiri commented on code in PR #4726: URL: https://github.com/apache/hadoop/pull/4726#discussion_r973210207 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java: ## @@ -1384,7 +1385,19 @@ public List getChildQueues() { } } - + + @Override + public List getChildQueuesByTryLock() { +try { + while (!readLock.tryLock()){ Review Comment: Why not just a regular lock()? ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java: ## @@ -3026,4 +3030,69 @@ public void testReservedContainerLeakWhenMoveApplication() throws Exception { Assert.assertEquals(0, desQueue.getUsedResources().getMemorySize()); rm1.close(); } + @Test + public void testRefreshQueueWithOpenPreemption() throws Exception { +CapacitySchedulerConfiguration csConf Review Comment: The line limit is 100 chars so this should fit. ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/preemption/PreemptionManager.java: ## @@ -25,10 +25,7 @@ import org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueue; import org.apache.hadoop.yarn.util.resource.Resources; -import java.util.Collections; Review Comment: Avoid. ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java: ## @@ -3026,4 +3030,69 @@ public void testReservedContainerLeakWhenMoveApplication() throws Exception { Assert.assertEquals(0, desQueue.getUsedResources().getMemorySize()); rm1.close(); } + @Test + public void testRefreshQueueWithOpenPreemption() throws Exception { +CapacitySchedulerConfiguration csConf += new CapacitySchedulerConfiguration(); +csConf.setQueues(CapacitySchedulerConfiguration.ROOT, +new String[] {"a"}); +csConf.setCapacity("root.a", 100); +csConf.setMaximumCapacity("root.a", 100); +csConf.setUserLimitFactor("root.a", 100); + +YarnConfiguration conf=new YarnConfiguration(csConf); +conf.setClass(YarnConfiguration.RM_SCHEDULER, CapacityScheduler.class, +ResourceScheduler.class); +RMNodeLabelsManager mgr=new NullRMNodeLabelsManager(); +mgr.init(conf); +MockRM rm1 = new MockRM(csConf); +CapacityScheduler scheduler=(CapacityScheduler) rm1.getResourceScheduler(); +PreemptionManager preemptionManager = scheduler.getPreemptionManager();; Review Comment: ;; ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java: ## @@ -3026,4 +3030,69 @@ public void testReservedContainerLeakWhenMoveApplication() throws Exception { Assert.assertEquals(0, desQueue.getUsedResources().getMemorySize()); rm1.close(); } + @Test + public void testRefreshQueueWithOpenPreemption() throws Exception { +CapacitySchedulerConfiguration csConf += new CapacitySchedulerConfiguration(); +csConf.setQueues(CapacitySchedulerConfiguration.ROOT, +new String[] {"a"}); +csConf.setCapacity("root.a", 100); +csConf.setMaximumCapacity("root.a", 100); +csConf.setUserLimitFactor("root.a", 100); + +YarnConfiguration conf=new YarnConfiguration(csConf); +conf.setClass(YarnConfiguration.RM_SCHEDULER, CapacityScheduler.class, +ResourceScheduler.class); +RMNodeLabelsManager mgr=new NullRMNodeLabelsManager(); Review Comment: Spaces ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java: ## @@ -3026,4 +3030,69 @@ public void testReservedContainerLeakWhenMoveApplication() throws Exception { Assert.assertEquals(0, desQueue.getUsedResources().getMemorySize()); rm1.close(); } + @Test + public void testRefreshQueueWithOpenPreemption() throws Exception { +CapacitySchedulerConfiguration csConf += new CapacitySchedulerConfiguration(); +csConf.setQueues(CapacitySchedulerConfiguration.ROOT, +new String[] {"a"}); +csConf.setCapacity("root.a", 100); +csConf.setMaximumCapacity("root.a", 100); +
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605882#comment-17605882 ] ben yang commented on YARN-11191: - Could you kindly take a look for the pr? Ths. [~elgoiri] > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605762#comment-17605762 ] ASF GitHub Bot commented on YARN-11191: --- hadoop-yetus commented on PR #4726: URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1249184324 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 57s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 40m 31s | | trunk passed | | +1 :green_heart: | compile | 1m 24s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 1m 16s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 5s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 16s | | trunk passed | | +1 :green_heart: | javadoc | 1m 10s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 0m 55s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 2m 28s | | trunk passed | | +1 :green_heart: | shadedclient | 23m 21s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 57s | | the patch passed | | +1 :green_heart: | compile | 1m 1s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 1m 1s | | the patch passed | | +1 :green_heart: | compile | 0m 54s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 0m 54s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 49s | [/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4726/2/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 8 new + 138 unchanged - 0 fixed = 146 total (was 138) | | +1 :green_heart: | mvnsite | 0m 58s | | the patch passed | | +1 :green_heart: | javadoc | 0m 42s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 0m 45s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 2m 4s | | the patch passed | | +1 :green_heart: | shadedclient | 21m 44s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 103m 9s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 49s | | The patch does not generate ASF License warnings. | | | | 207m 54s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4726/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4726 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 11f01de99262 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 0547b173b9e01e11633f045eca0e75dc8c417c8f | | Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4726/2/testReport/ | | Max.
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605761#comment-17605761 ] ASF GitHub Bot commented on YARN-11191: --- hadoop-yetus commented on PR #4726: URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1249182002 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 42s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 37m 54s | | trunk passed | | +1 :green_heart: | compile | 1m 10s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 1m 6s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 2m 4s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 18s | | trunk passed | | +1 :green_heart: | javadoc | 1m 13s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 0m 59s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 2m 22s | | trunk passed | | +1 :green_heart: | shadedclient | 23m 46s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 56s | | the patch passed | | +1 :green_heart: | compile | 1m 7s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 1m 7s | | the patch passed | | +1 :green_heart: | compile | 0m 57s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 0m 57s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 44s | [/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4726/3/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 8 new + 138 unchanged - 0 fixed = 146 total (was 138) | | +1 :green_heart: | mvnsite | 1m 4s | | the patch passed | | +1 :green_heart: | javadoc | 0m 45s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 0m 44s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 2m 2s | | the patch passed | | +1 :green_heart: | shadedclient | 21m 36s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 99m 12s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 43s | | The patch does not generate ASF License warnings. | | | | 201m 55s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4726/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4726 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 4e2b917b27d3 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 281c169aca12bd5e932295c6eb78e69ab9d28e86 | | Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4726/3/testReport/ | | Max.
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579571#comment-17579571 ] ASF GitHub Bot commented on YARN-11191: --- shellwjl commented on PR #4726: URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1214712685 > Global Scheduler has muti-threads to assign containers normally. > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578851#comment-17578851 ] ASF GitHub Bot commented on YARN-11191: --- luoyuan3471 commented on PR #4726: URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1212840341 > @luoyuan3471 1.The key to deadlock is that refresh thread can‘t acquire csqueue read lock. The read lock request is blocked by a write lock (as: https://bugs.openjdk.org/browse/JDK-6893626).so i use tryLock to break the condition.The PremmptionManager lock will be released soon after refresh thread gets csqueue read lock. 2.just preemption, but global scheduler increases the chance For 2, Why does Global Scheduler increase the chance of this dead lock case? > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578850#comment-17578850 ] ASF GitHub Bot commented on YARN-11191: --- luoyuan3471 commented on PR #4726: URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1212838719 > @luoyuan3471 yes! ReadLock.lock() will call function(tryAcquireShared), this method will check whether the waiting queue first node is shared. if false, lock method will be blocked even if sharedCount(c) >0.But readLock.tryLock() do not need the check, it can get the csqueue readLock directly. you can diff tryReadLock() and Lock(). Thanks for your explanation. I checked the code. You're right! > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578539#comment-17578539 ] ASF GitHub Bot commented on YARN-11191: --- yb12138 commented on PR #4726: URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1212163886 @luoyuan3471 yes! ReadLock.lock() will call function(tryAcquireShared), this method will check whether the waiting queue first node is shared. if false, lock method will be blocked even if sharedCount(c) >0.But readLock.tryLock() do not need the check, it can get the csqueue readLock directly. you can diff tryReadLock() and Lock(), it's an interesting design. > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578521#comment-17578521 ] ben yang commented on YARN-11191: - it's so cool! I will try to add a test like this! Thanks! [~luoyuan] > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578409#comment-17578409 ] Yuan Luo commented on YARN-11191: - {code:java} public class ThreadLockTest { private static DateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss"); public static void BeforeFixDeadLock() throws InterruptedException { System.out.println( "BeforeFixDeadLock test start, this will cause deadlock.."); ReentrantReadWriteLock queueLock = new ReentrantReadWriteLock(); ReentrantReadWriteLock.ReadLock queueReadLock = queueLock.readLock(); ReentrantReadWriteLock.WriteLock queueWriteLock = queueLock.writeLock(); ReentrantReadWriteLock premmptionLock = new ReentrantReadWriteLock(); ReentrantReadWriteLock.ReadLock premmptionReadLock = premmptionLock.readLock(); ReentrantReadWriteLock.WriteLock premmptionWriteLock = premmptionLock.writeLock(); Thread schedulerThread = new Thread(() -> { System.out.println("current time: " + sdf.format(new Date()) + ", schedulerThread start!"); //hold: csqueue.readLock queueReadLock.lock(); System.out.println("current time: " + sdf.format(new Date()) + ", schedulerThread get queueReadLock!"); try { Thread.sleep(1000 * 15); } catch (InterruptedException e) { e.printStackTrace(); } //require: PremmptionManager.readLock premmptionReadLock.lock(); System.out.println("current time: " + sdf.format(new Date()) + ", schedulerThread get premmptionReadLock!"); premmptionReadLock.unlock(); queueReadLock.unlock(); System.out.println("current time: " + sdf.format(new Date()) + ", schedulerThread finish!"); }); Thread refreshQueueThread = new Thread(() -> { System.out.println("current time: " + sdf.format(new Date()) + ", refreshQueueThread start!"); //hold: PremmptionManager.writeLock premmptionWriteLock.lock(); System.out.println("current time: " + sdf.format(new Date()) + ", refreshQueueThread get premmptionWriteLock!"); try { Thread.sleep(1000 * 10); } catch (InterruptedException e) { e.printStackTrace(); } //require: csqueue.readLock queueReadLock.lock(); System.out.println("current time: " + sdf.format(new Date()) + ", refreshQueueThread get queueReadLock!"); queueReadLock.unlock(); premmptionWriteLock.unlock(); System.out.println("current time: " + sdf.format(new Date()) + ", refreshQueueThread finish!"); }); Thread otherThread = new Thread(() -> { //make otherThread request queue write lock after schedule thread hold // queue write lock, and before refres thread to get queue read lock try { Thread.sleep(1000 * 5); } catch (InterruptedException e) { e.printStackTrace(); } System.out.println( "current time: " + sdf.format(new Date()) + ", otherThread start!"); queueWriteLock.lock(); System.out.println("current time: " + sdf.format(new Date()) + ", otherThread get queueWriteLock!"); queueWriteLock.unlock(); System.out.println( "current time: " + sdf.format(new Date()) + ", otherThread finish!"); }); schedulerThread.start(); refreshQueueThread.start(); otherThread.start(); refreshQueueThread.join(); schedulerThread.join(); otherThread.join(); } public static void AfterFixDeadLock() throws InterruptedException { System.out.println("AfterFixDeadLock test start.."); ReentrantReadWriteLock queueLock = new ReentrantReadWriteLock(); ReentrantReadWriteLock.ReadLock queueReadLock = queueLock.readLock(); ReentrantReadWriteLock.WriteLock queueWriteLock = queueLock.writeLock(); ReentrantReadWriteLock premmptionLock = new ReentrantReadWriteLock(); ReentrantReadWriteLock.ReadLock premmptionReadLock = premmptionLock.readLock(); ReentrantReadWriteLock.WriteLock premmptionWriteLock = premmptionLock.writeLock(); Thread schedulerThread = new Thread(() -> { System.out.println("current time: " + sdf.format(new Date()) + ", schedulerThread start!"); //hold: csqueue.readLock queueReadLock.lock(); System.out.println("current time: " + sdf.format(new Date()) + ", schedulerThread get queueReadLock!"); try { Thread.sleep(1000 * 15); } catch (InterruptedException e) { e.printStackTrace(); } //require: PremmptionManager.readLock premmptionReadLock.lock(); System.out.println("current time: " + sdf.format(new Date()) + ", schedulerThread get premmptionReadLock!"); premmptionReadLock.unlock(); queueReadLock.unlock(); System.out.println("current time: " +
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578355#comment-17578355 ] ASF GitHub Bot commented on YARN-11191: --- luoyuan3471 commented on PR #4726: URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1211760018 > @luoyuan3471 ![未命名](https://user-images.githubusercontent.com/29743168/183837904-187ebe71-d5a6-474c-948d-1160f0d3407e.png) you can see this image. this problem will happen when refresh thread is calling PreemptionManager.refreshQueue and schedule thread is calling AbstractCSQueue.getTotalKillableResource.At this time, refresh thread will require csqueue.readLock,but csqueue.readLock will blocked by schedule thread and "other thread"( https://bugs.openjdk.org/browse/JDK-6893626 ).And schedule thread will require PremmptionManager.readLock,but this readLock will blocked by refresh thread held writeLock. so i use tryLock to make refresh thread get csqueue.readLock. Wait for the refresh thread complete PreemptionManager.refreshQueue,the schedule thread will get premmptionManager.readLock, then can allocate new container. --- Do you mean readLock.tryLock() will make readLock place first ,though a write lock request is already in the head of waiting queue? @yb12138 > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577886#comment-17577886 ] ASF GitHub Bot commented on YARN-11191: --- hadoop-yetus commented on PR #4726: URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1210402542 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 50s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 41m 33s | | trunk passed | | +1 :green_heart: | compile | 1m 16s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 1m 8s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 5s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 13s | | trunk passed | | +1 :green_heart: | javadoc | 1m 7s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 0m 55s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 2m 19s | | trunk passed | | +1 :green_heart: | shadedclient | 24m 43s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 56s | | the patch passed | | +1 :green_heart: | compile | 1m 6s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 1m 6s | | the patch passed | | +1 :green_heart: | compile | 0m 55s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 0m 55s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 46s | [/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4726/1/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 2 new + 95 unchanged - 0 fixed = 97 total (was 95) | | +1 :green_heart: | mvnsite | 1m 0s | | the patch passed | | +1 :green_heart: | javadoc | 0m 49s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 0m 41s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 2m 11s | | the patch passed | | +1 :green_heart: | shadedclient | 24m 45s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 102m 29s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 42s | | The patch does not generate ASF License warnings. | | | | 211m 50s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4726/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4726 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux f7c0894bff81 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 65650a20ef43e8eb3311c63e3a6b0585e9e2d95a | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577834#comment-17577834 ] ASF GitHub Bot commented on YARN-11191: --- yb12138 commented on PR #4726: URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1210271008 @luoyuan3471 ![未命名](https://user-images.githubusercontent.com/29743168/183837904-187ebe71-d5a6-474c-948d-1160f0d3407e.png) you can see this image. this problem will occupy when refresh thread is calling PreemptionManager.refreshQueue and schedule thread is calling AbstractCSQueue.getTotalKillableResource.At this time, refresh thread will require csqueue.readLock,but csqueue.readLock will blocked by schedule thread and "other thread"( https://bugs.openjdk.org/browse/JDK-6893626 ).And schedule thread will require PremmptionManager.readLock,but this readLock will blocked by refresh thread held writeLock. so i use tryLock to make refresh thread get csqueue.readLock. Wait for the refresh thread complete PreemptionManager.refreshQueue,the schedule thread will get premmptionManager.readLock, then can allocate new container. > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577807#comment-17577807 ] ASF GitHub Bot commented on YARN-11191: --- luoyuan3471 commented on PR #4726: URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1210237613 CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock require: csqueue.readLock CapacityScheduler.schedule: hold: csqueue.readLock require: PremmptionManager.readLock other thread(completeContainer,release Resource,etc.): require: csqueue.writeLock @yb12138 schedule thread hold csqueue.readLock and it is blocked by PremmptionManager.readLock , and PremmptionManager.writeLock is hold by refreshQueue thread, seems refreshQueue have no chance to get csqueue.readLock. Very sorry, I'm still a little confused on this point. Can you explain more about it? Thank you! > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577800#comment-17577800 ] ASF GitHub Bot commented on YARN-11191: --- yb12138 commented on PR #4726: URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1210219429 @luoyuan3471 1.The key to deadlock is that the refresh thread can‘t acquire the csqueue read lock. beacuse it’s read lock request is blocked by a write lock (as: https://bugs.openjdk.org/browse/JDK-6893626).so i use tryLock to break the conditions for deadlock.The PremmptionManager lock will be released soon after it gets csqueue read lock. 2.just preemption, but global scheduler increases the chance of deadlock > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577784#comment-17577784 ] Yuan Luo commented on YARN-11191: - I want to ask two question: + @Override + public List getChildQueuesByTryLock() { + try { + while (!readLock.tryLock()){ + LockSupport.parkNanos(1); + } + return new ArrayList(childQueues); + } finally { + readLock.unlock(); + } + } 1.Though you use tryLock and park, so refresh queue thread switch to block state, but this thread still hold PremmptionManager lock ,so scheduler thread still can't allocate new container. Is it right? 2.Does this issue related to global Scheduler or just the preemption function? Looking forward to your reply, thanks! > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577780#comment-17577780 ] ASF GitHub Bot commented on YARN-11191: --- luoyuan3471 commented on PR #4726: URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1210199622 @yb12138 I want to ask two question: + @Override + public List getChildQueuesByTryLock() { +try { + while (!readLock.tryLock()){ +LockSupport.parkNanos(1); + } + return new ArrayList(childQueues); +} finally { + readLock.unlock(); +} + } 1.Though you use tryLock and park, so refresh queue thread switch to block state, but this thread still hold PremmptionManager lock ,so scheduler thread still can allocate new container. Is it right? 2.Does this issue related to global Scheduler or just the preemption function? Looking forward to your reply, thanks! > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1757#comment-1757 ] ben yang commented on YARN-11191: - it is difficult to add a test, because this problem under concurrency not always happen > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1754#comment-1754 ] ASF GitHub Bot commented on YARN-11191: --- yb12138 opened a new pull request, #4726: URL: https://github.com/apache/hadoop/pull/4726 ### Description of PR ### How was this patch tested? ### For code changes: - [ ] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Attachments: 1.jstack, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577643#comment-17577643 ] Íñigo Goiri commented on YARN-11191: Do you mind preparing a PR? Is it possible to add a test? > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Attachments: 1.jstack, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577260#comment-17577260 ] Ryan Wu commented on YARN-11191: We do refreshQueue periodically ,not always reproduce this problem. But once happened, it leads to cluster can not assign resource. [~aajisaka] > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Attachments: 1.jstack, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock
[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577178#comment-17577178 ] Yuan Luo commented on YARN-11191: - We meet a same issue when refresh yarn queue, the refresh thread stuck in below function: {code:java} 'preemptionManager.refreshQueues(null, this.getRootQueue()) {code} Can you help have a look at this issue. [~elgoiri] [~aajisaka] > Global Scheduler refreshQueue cause deadLock > - > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 >Reporter: ben yang >Priority: Major > Attachments: 1.jstack, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org