subject:"\[jira\] \[Commented\] \(YARN\-11191\) Global Scheduler refreshQueue cause deadLock"

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841592#comment-17841592
 ] 

ASF GitHub Bot commented on YARN-11191:
---

hadoop-yetus commented on PR #6768:
URL: https://github.com/apache/hadoop/pull/6768#issuecomment-2081364250

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m 01s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  spotbugs  |   0m 01s |  |  spotbugs executables are not 
available.  |
   | +0 :ok: |  codespell  |   0m 01s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m 01s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m 00s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m 00s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ branch-3.4 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  88m 19s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  compile  |   5m 43s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  checkstyle  |   4m 35s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  mvnsite  |   5m 36s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  javadoc  |   5m 31s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  shadedclient  | 148m 13s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   3m 26s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   3m 06s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   3m 06s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m 00s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   2m 11s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   3m 18s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   2m 55s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  | 155m 05s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 323m 16s | 
[/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6768/3/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  hadoop-yarn-server-resourcemanager in the patch passed.  |
   | -1 :x: |  asflicense  |   5m 28s | 
[/results-asflicense.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6768/3/artifact/out/results-asflicense.txt)
 |  The patch generated 2 ASF License warnings.  |
   |  |   | 747m 18s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.yarn.server.resourcemanager.metrics.TestCombinedSystemMetricsPublisher |
   |   | hadoop.yarn.server.resourcemanager.recovery.TestLeveldbRMStateStore |
   |   | hadoop.yarn.server.resourcemanager.resource.TestResourceProfiles |
   |   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.TestFSSchedulerConfigurationStore
 |
   |   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.TestLeveldbConfigurationStore
 |
   |   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCSAllocateCustomResource
 |
   |   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestQueueConfigurationAutoRefreshPolicy
 |
   |   | 
hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigArgumentHandler
 |
   |   | 
hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigConverter
 |
   |   | 
hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigConverterMain
 |
   |   | 
hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSQueueConverter
 |
   |   | 
hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService
 |
   |   | hadoop.yarn.server.resourcemanager.TestClientRMService |
   |   | hadoop.yarn.server.resourcemanager.TestSignalContainer |
   |   | hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart |
   |   | hadoop.yarn.server.resourcemanager.volume.csi.TestVolumeProcessor |
   |   | hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication |
   |   | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppCustomResourceTypes
 |
   |   | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsCustomResourceTypes
 |
   |   | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesContainers |
   |   | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication
 |
   |   |

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841549#comment-17841549
 ] 

ASF GitHub Bot commented on YARN-11191:
---

slfan1989 commented on PR #6768:
URL: https://github.com/apache/hadoop/pull/6768#issuecomment-2081257034

   @tomicooler Thanks for the contribution!




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841548#comment-17841548
 ] 

ASF GitHub Bot commented on YARN-11191:
---

slfan1989 merged PR #6768:
URL: https://github.com/apache/hadoop/pull/6768




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841541#comment-17841541
 ] 

ASF GitHub Bot commented on YARN-11191:
---

hadoop-yetus commented on PR #6768:
URL: https://github.com/apache/hadoop/pull/6768#issuecomment-2081187638

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   6m 57s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ branch-3.4 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  32m 44s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  compile  |   0m 36s |  |  branch-3.4 passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |   0m 30s |  |  branch-3.4 passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   0m 31s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 34s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  javadoc  |   0m 37s |  |  branch-3.4 passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 29s |  |  branch-3.4 passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   1m 16s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  shadedclient  |  20m 32s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 27s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 28s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |   0m 28s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 28s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  javac  |   0m 28s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 24s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 28s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 25s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 28s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   1m  9s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m 24s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  87m 33s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 26s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 179m  4s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6768 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 10200f517789 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.4 / 31f1424fcd7011fdd78d507e9482cffee9bcd09b |
   | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/3/testReport/ |
   | Max. process+thread count | 950 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/3/console |
   | versions | git=2.25.1 maven=3.6.3

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841525#comment-17841525
 ] 

ASF GitHub Bot commented on YARN-11191:
---

tomicooler commented on PR #6768:
URL: https://github.com/apache/hadoop/pull/6768#issuecomment-2081123115

   @slfan1989 Yes, sure. Re-triggered.




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841465#comment-17841465
 ] 

ASF GitHub Bot commented on YARN-11191:
---

slfan1989 commented on PR #6768:
URL: https://github.com/apache/hadoop/pull/6768#issuecomment-2080551000

   @tomicooler Can we retrigger compilation?




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841464#comment-17841464
 ] 

ASF GitHub Bot commented on YARN-11191:
---

slfan1989 commented on PR #6769:
URL: https://github.com/apache/hadoop/pull/6769#issuecomment-2080548182

   @tomicooler Thanks for the contribution!




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841463#comment-17841463
 ] 

ASF GitHub Bot commented on YARN-11191:
---

slfan1989 merged PR #6769:
URL: https://github.com/apache/hadoop/pull/6769




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840745#comment-17840745
 ] 

ASF GitHub Bot commented on YARN-11191:
---

hadoop-yetus commented on PR #6768:
URL: https://github.com/apache/hadoop/pull/6768#issuecomment-2076847309

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 21s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ branch-3.4 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  32m 50s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  compile  |   0m 32s |  |  branch-3.4 passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |   0m 31s |  |  branch-3.4 passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   0m 31s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 36s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  javadoc  |   0m 37s |  |  branch-3.4 passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 29s |  |  branch-3.4 passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   1m  9s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  shadedclient  |  20m 36s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 26s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 30s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |   0m 30s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 25s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  javac  |   0m 25s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 24s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 27s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 27s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 25s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   1m 12s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m 31s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  87m  5s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 26s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 172m 52s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6768 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 111d4e1feebc 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.4 / 985da47eda7acc4d92a0a5c4095c164f33520725 |
   | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/2/testReport/ |
   | Max. process+thread count | 955 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/2/console |
   | versions | git=2.25.1 maven=3.6.3

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840543#comment-17840543
 ] 

ASF GitHub Bot commented on YARN-11191:
---

hadoop-yetus commented on PR #6769:
URL: https://github.com/apache/hadoop/pull/6769#issuecomment-2075632733

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  11m  0s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  50m 58s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   0m 57s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 47s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   1m  1s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   0m 48s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   1m 58s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  41m 29s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 56s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 47s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 47s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 51s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   1m 58s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  40m 38s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  95m 54s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 38s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 254m 50s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6769/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6769 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux a214af58c8fa 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / 7c732b60916ebadef8ec7b06a634aaef28a76047 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6769/1/testReport/ |
   | Max. process+thread count | 857 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6769/1/console |
   | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840493#comment-17840493
 ] 

ASF GitHub Bot commented on YARN-11191:
---

hadoop-yetus commented on PR #6768:
URL: https://github.com/apache/hadoop/pull/6768#issuecomment-2075284019

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 21s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ branch-3.4 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  33m  3s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  compile  |   0m 34s |  |  branch-3.4 passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |   0m 33s |  |  branch-3.4 passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   0m 34s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 38s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  javadoc  |   0m 37s |  |  branch-3.4 passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 30s |  |  branch-3.4 passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   1m 10s |  |  branch-3.4 passed  |
   | +1 :green_heart: |  shadedclient  |  20m 29s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | -1 :x: |  mvninstall  |   0m 23s | 
[/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/1/artifact/out/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  hadoop-yarn-server-resourcemanager in the patch failed.  |
   | -1 :x: |  compile  |   0m 27s | 
[/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/1/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt)
 |  hadoop-yarn-server-resourcemanager in the patch failed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.  |
   | -1 :x: |  javac  |   0m 27s | 
[/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/1/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.txt)
 |  hadoop-yarn-server-resourcemanager in the patch failed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1.  |
   | -1 :x: |  compile  |   0m 22s | 
[/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/1/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06.txt)
 |  hadoop-yarn-server-resourcemanager in the patch failed with JDK Private 
Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06.  |
   | -1 :x: |  javac  |   0m 22s | 
[/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6768/1/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06.txt)
 |  hadoop-yarn-server-resourcemanager in the patch failed with JDK Private 
Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06.  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 24s |  |  the patch passed  |
   | -1 :x: |  mvnsite  |   0m 24s |

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840457#comment-17840457
 ] 

ASF GitHub Bot commented on YARN-11191:
---

tomicooler opened a new pull request, #6769:
URL: https://github.com/apache/hadoop/pull/6769

   … (#6732)
   
   (cherry picked from commit ecf665c6facf89d3b87b6e3cc684274b8155ca60)
   Change-Id: I561bcad51af7810328c8b91cd9290d5198be0c6e
   
   
   
   ### Description of PR
   
   Backport, there were conflicts (abstractparent/leaf queue, and queuepath 
doesn't exist here yet).
   
   Jira: [YARN-11191](https://issues.apache.org/jira/browse/YARN-11191)
   Original PR: #6732
   
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840432#comment-17840432
 ] 

ASF GitHub Bot commented on YARN-11191:
---

tomicooler opened a new pull request, #6768:
URL: https://github.com/apache/hadoop/pull/6768

   … (#6732)
   
   (cherry picked from commit ecf665c6facf89d3b87b6e3cc684274b8155ca60)
   
   
   
   ### Description of PR
   
   backport.
   
   Jira: [YARN-11191](https://issues.apache.org/jira/browse/YARN-11191)
   Original PR: #6732
   
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840428#comment-17840428
 ] 

ASF GitHub Bot commented on YARN-11191:
---

brumi1024 merged PR #6732:
URL: https://github.com/apache/hadoop/pull/6732




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840427#comment-17840427
 ] 

ASF GitHub Bot commented on YARN-11191:
---

brumi1024 commented on PR #6732:
URL: https://github.com/apache/hadoop/pull/6732#issuecomment-2074890741

   Thanks for the patch @tomicooler and @slfan1989 @p-szucs for the review, 
merging to trunk. @tomicooler can you please check if a backport is possible to 
the branches 3.3 and 3.4?




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840239#comment-17840239
 ] 

ASF GitHub Bot commented on YARN-11191:
---

hadoop-yetus commented on PR #6732:
URL: https://github.com/apache/hadoop/pull/6732#issuecomment-2073537387

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m 01s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  spotbugs  |   0m 01s |  |  spotbugs executables are not 
available.  |
   | +0 :ok: |  codespell  |   0m 01s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m 01s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m 00s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m 00s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  | 106m 35s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   6m 48s |  |  trunk passed  |
   | +1 :green_heart: |  checkstyle  |   5m 43s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   6m 51s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   6m 25s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  | 174m 58s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   4m 23s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   4m 06s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   4m 06s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m 00s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   2m 38s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   3m 57s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   3m 27s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  | 190m 26s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  asflicense  |   6m 36s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 502m 33s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | GITHUB PR | https://github.com/apache/hadoop/pull/6732 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | MINGW64_NT-10.0-17763 0cbe62357619 3.4.10-87d57229.x86_64 
2024-02-14 20:17 UTC x86_64 Msys |
   | Build tool | maven |
   | Personality | /c/hadoop/dev-support/bin/hadoop.sh |
   | git revision | trunk / 73c01ccb77ca85cd83785726b55841fe8b94e973 |
   | Default Java | Azul Systems, Inc.-1.8.0_332-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6732/1/testReport/
 |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6732/1/console
 |
   | versions | git=2.44.0.windows.1 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold:

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840042#comment-17840042
 ] 

ASF GitHub Bot commented on YARN-11191:
---

hadoop-yetus commented on PR #6732:
URL: https://github.com/apache/hadoop/pull/6732#issuecomment-2071872714

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 20s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  33m 18s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 36s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |   0m 32s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   0m 32s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 35s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 38s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 32s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   1m 10s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  20m 37s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 25s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 32s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 29s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  javac  |   0m 29s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 22s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 29s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 25s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 26s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   1m 11s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m 45s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  90m  8s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 25s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 176m 11s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6732/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6732 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux e28442f7b76e 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 73c01ccb77ca85cd83785726b55841fe8b94e973 |
   | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6732/2/testReport/ |
   | Max. process+thread count | 941 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6732/2/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839966#comment-17839966
 ] 

ASF GitHub Bot commented on YARN-11191:
---

tomicooler commented on code in PR #6732:
URL: https://github.com/apache/hadoop/pull/6732#discussion_r1575731004


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractParentQueue.java:
##
@@ -1347,6 +1348,18 @@ public List getChildQueues() {
 
   }
 
+  @Override
+  public List getChildQueuesByTryLock() {
+try {
+  while (!readLock.tryLock()){
+LockSupport.parkNanos(1);

Review Comment:
   Ok, thanks. Fixed the checkstyle issues.





> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839888#comment-17839888
 ] 

ASF GitHub Bot commented on YARN-11191:
---

slfan1989 commented on code in PR #6732:
URL: https://github.com/apache/hadoop/pull/6732#discussion_r1575454919


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractParentQueue.java:
##
@@ -1347,6 +1348,18 @@ public List getChildQueues() {
 
   }
 
+  @Override
+  public List getChildQueuesByTryLock() {
+try {
+  while (!readLock.tryLock()){
+LockSupport.parkNanos(1);

Review Comment:
   I have no other comments. we need to fix checkstyle, then we can merge this 
pr.





> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-16 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837794#comment-17837794
 ] 

ASF GitHub Bot commented on YARN-11191:
---

tomicooler commented on code in PR #6732:
URL: https://github.com/apache/hadoop/pull/6732#discussion_r1567660577


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractParentQueue.java:
##
@@ -1347,6 +1348,18 @@ public List getChildQueues() {
 
   }
 
+  @Override
+  public List getChildQueuesByTryLock() {
+try {
+  while (!readLock.tryLock()){
+LockSupport.parkNanos(1);

Review Comment:
   I don't know how useful is to configure this value, the only reason to 
increase it if the loop is too busy with the 10 thousand nanos (=10 
microseconds).
   
   
https://hazelcast.com/blog/locksupport-parknanos-under-the-hood-and-the-curious-case-of-parking/
   
   Based on this, the granularity should be around 50 microseconds anyway.





> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-16 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837737#comment-17837737
 ] 

ASF GitHub Bot commented on YARN-11191:
---

slfan1989 commented on code in PR #6732:
URL: https://github.com/apache/hadoop/pull/6732#discussion_r1567460352


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractParentQueue.java:
##
@@ -1347,6 +1348,18 @@ public List getChildQueues() {
 
   }
 
+  @Override
+  public List getChildQueuesByTryLock() {
+try {
+  while (!readLock.tryLock()){
+LockSupport.parkNanos(1);

Review Comment:
   Is it better for us to add a configuration? This configuration is used to 
configure `1`





> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-15 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837341#comment-17837341
 ] 

ASF GitHub Bot commented on YARN-11191:
---

p-szucs commented on PR #6732:
URL: https://github.com/apache/hadoop/pull/6732#issuecomment-2057246374

   Thanks @tomicooler, LGTM 




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-15 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837326#comment-17837326
 ] 

ASF GitHub Bot commented on YARN-11191:
---

hadoop-yetus commented on PR #6732:
URL: https://github.com/apache/hadoop/pull/6732#issuecomment-2057144840

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   7m 33s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  32m 59s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 34s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |   0m 32s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   0m 31s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 38s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 36s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 29s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   1m  7s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  20m 27s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 28s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 30s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |   0m 30s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 26s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  javac  |   0m 26s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  1s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 25s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6732/1/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 2 new + 139 unchanged - 0 fixed = 141 total (was 139)  |
   | +1 :green_heart: |  mvnsite  |   0m 26s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 26s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 27s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   1m 11s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m 39s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  89m 55s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 24s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 182m 42s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6732/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6732 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 8af00fba70d5 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / aeb1b430e30affd6549348feda637ba93b45a298 |
   | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6732/1/testReport/ |
   | Max.

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-15 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837295#comment-17837295
 ] 

ASF GitHub Bot commented on YARN-11191:
---

slfan1989 commented on PR #6732:
URL: https://github.com/apache/hadoop/pull/6732#issuecomment-2056988311

   LGTM




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-15 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837222#comment-17837222
 ] 

ASF GitHub Bot commented on YARN-11191:
---

tomicooler opened a new pull request, #6732:
URL: https://github.com/apache/hadoop/pull/6732

   Change-Id: I4ca3998748494c1148b1ef79b33500eecae02a97
   
   
   
   ### Description of PR
   
   Jira: [YARN-11191](https://issues.apache.org/jira/browse/YARN-11191)
   Original PR: https://github.com/apache/hadoop/pull/4726
   
   The investigation and the fix was done by @yb12138, see the Original PR for 
further details.
   
   I fixed the leftover review comments and I slightly modified the new test 
code to actually test
   the `refreshQueues`. The new test fails without the fix.
   
   ### How was this patch tested?
   
   I created a sample application for demonstration:
   ```java
   import java.util.concurrent.locks.LockSupport;
   import java.util.concurrent.locks.ReentrantReadWriteLock;
   
   public class Main {
   
 static final long start = System.currentTimeMillis();
   
 public static void main(String[] args) {
   ReentrantReadWriteLock queue = new ReentrantReadWriteLock();
   ReentrantReadWriteLock preemption = new ReentrantReadWriteLock();
   
   Thread schedulerThread = new Thread(() -> {
 log("queue.readLock().lock()");
 queue.readLock().lock();
 try {
   Thread.sleep(5 * 1000);
 } catch (InterruptedException e) {
   e.printStackTrace();
 }
 log("preemption.readLock().lock()");
 preemption.readLock().lock();
 log("preemption.readLock().unlock()");
 preemption.readLock().unlock();
 log("queue.readLock().unlock()");
 queue.readLock().unlock();
   }, "SCHEDULE");
   
   Thread completeThread = new Thread(() -> {
 try {
   Thread.sleep(1000);
 } catch (InterruptedException e) {
   e.printStackTrace();
 }
 log("queue.writeLock().lock()");
 queue.writeLock().lock();
 log("queue.writeLock().unlock()");
 queue.writeLock().unlock();
   }, "COMPLETE");
   
   
   Thread refreshThread = new Thread(() -> {
 try {
   Thread.sleep(2 * 1000);
 } catch (InterruptedException e) {
   e.printStackTrace();
 }
 log("preemption.writeLock().lock()");
 preemption.writeLock().lock();
 log("queue.readLock().lock()");
 // OUTPUT_A
 queue.readLock().lock();
 // OUTPUT_B
 //while (!queue.readLock().tryLock()) {
 //  LockSupport.parkNanos(1);
 //}
 log("queue.readLock().unlock()");
 queue.readLock().unlock();
 log("preemption.writeLock().unlock()");
 preemption.writeLock().unlock();
   }, "REFRESH ");
   schedulerThread.start();
   completeThread.start();
   refreshThread.start();
   
   try {
 schedulerThread.join();
 completeThread.join();
 refreshThread.join();
   } catch (InterruptedException e) {
 throw new RuntimeException(e);
   }
 }
   
 static void log(String what) {
   System.out.printf("%04d %s %s%n", System.currentTimeMillis() - start, 
Thread.currentThread().getName(), what);
 }
   }
   ```
   
   `OUTPUT_A` deadlock happens.
   ```
   0005 SCHEDULE queue.readLock().lock()
   1007 COMPLETE queue.writeLock().lock()
   2006 REFRESH  preemption.writeLock().lock()
   2006 REFRESH  queue.readLock().lock()
   5042 SCHEDULE preemption.readLock().lock()
   ```
   
![threads](https://github.com/apache/hadoop/assets/7456325/d35d8f2a-ab62-4fed-ba5e-d7c352cb504d)
   
   `OUTPUT_B` deadlock does not happen. (tryLock)
   ```
   0003 SCHEDULE queue.readLock().lock()
   1003 COMPLETE queue.writeLock().lock()
   2003 REFRESH  preemption.writeLock().lock()
   2004 REFRESH  queue.readLock().lock()
   2005 REFRESH  queue.readLock().unlock()
   2005 REFRESH  preemption.writeLock().unlock()
   5026 SCHEDULE preemption.readLock().lock()
   5027 SCHEDULE preemption.readLock().unlock()
   5027 SCHEDULE queue.readLock().unlock()
   5027 COMPLETE queue.writeLock().unlock()
   ```
   
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'YARN-11191. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> Global Scheduler refreshQueue cause deadLock 
> -
>

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2024-04-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836047#comment-17836047
 ] 

ASF GitHub Bot commented on YARN-11191:
---

tomicooler commented on PR #4726:
URL: https://github.com/apache/hadoop/pull/4726#issuecomment-2049153875

   Hi @yb12138 @goiri,
   
   is there any plan to fix the remaining review comments and get this merged?
   If there is no time, may I take over? (I'll create a new PR)




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-09-16 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605924#comment-17605924
 ] 

ASF GitHub Bot commented on YARN-11191:
---

goiri commented on code in PR #4726:
URL: https://github.com/apache/hadoop/pull/4726#discussion_r973210207


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java:
##
@@ -1384,7 +1385,19 @@ public List getChildQueues() {
 }
 
   }
-  
+
+  @Override
+  public List getChildQueuesByTryLock() {
+try {
+  while (!readLock.tryLock()){

Review Comment:
   Why not just a regular lock()?



##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java:
##
@@ -3026,4 +3030,69 @@ public void 
testReservedContainerLeakWhenMoveApplication() throws Exception {
 Assert.assertEquals(0, desQueue.getUsedResources().getMemorySize());
 rm1.close();
   }
+  @Test
+  public void testRefreshQueueWithOpenPreemption() throws Exception {
+CapacitySchedulerConfiguration csConf

Review Comment:
   The line limit is 100 chars so this should fit.



##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/preemption/PreemptionManager.java:
##
@@ -25,10 +25,7 @@
 import 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueue;
 import org.apache.hadoop.yarn.util.resource.Resources;
 
-import java.util.Collections;

Review Comment:
   Avoid.



##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java:
##
@@ -3026,4 +3030,69 @@ public void 
testReservedContainerLeakWhenMoveApplication() throws Exception {
 Assert.assertEquals(0, desQueue.getUsedResources().getMemorySize());
 rm1.close();
   }
+  @Test
+  public void testRefreshQueueWithOpenPreemption() throws Exception {
+CapacitySchedulerConfiguration csConf
+= new CapacitySchedulerConfiguration();
+csConf.setQueues(CapacitySchedulerConfiguration.ROOT,
+new String[] {"a"});
+csConf.setCapacity("root.a", 100);
+csConf.setMaximumCapacity("root.a", 100);
+csConf.setUserLimitFactor("root.a", 100);
+
+YarnConfiguration conf=new YarnConfiguration(csConf);
+conf.setClass(YarnConfiguration.RM_SCHEDULER, CapacityScheduler.class,
+ResourceScheduler.class);
+RMNodeLabelsManager mgr=new NullRMNodeLabelsManager();
+mgr.init(conf);
+MockRM rm1 = new MockRM(csConf);
+CapacityScheduler scheduler=(CapacityScheduler) rm1.getResourceScheduler();
+PreemptionManager preemptionManager = scheduler.getPreemptionManager();;

Review Comment:
   ;;



##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java:
##
@@ -3026,4 +3030,69 @@ public void 
testReservedContainerLeakWhenMoveApplication() throws Exception {
 Assert.assertEquals(0, desQueue.getUsedResources().getMemorySize());
 rm1.close();
   }
+  @Test
+  public void testRefreshQueueWithOpenPreemption() throws Exception {
+CapacitySchedulerConfiguration csConf
+= new CapacitySchedulerConfiguration();
+csConf.setQueues(CapacitySchedulerConfiguration.ROOT,
+new String[] {"a"});
+csConf.setCapacity("root.a", 100);
+csConf.setMaximumCapacity("root.a", 100);
+csConf.setUserLimitFactor("root.a", 100);
+
+YarnConfiguration conf=new YarnConfiguration(csConf);
+conf.setClass(YarnConfiguration.RM_SCHEDULER, CapacityScheduler.class,
+ResourceScheduler.class);
+RMNodeLabelsManager mgr=new NullRMNodeLabelsManager();

Review Comment:
   Spaces



##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java:
##
@@ -3026,4 +3030,69 @@ public void 
testReservedContainerLeakWhenMoveApplication() throws Exception {
 Assert.assertEquals(0, desQueue.getUsedResources().getMemorySize());
 rm1.close();
   }
+  @Test
+  public void testRefreshQueueWithOpenPreemption() throws Exception {
+CapacitySchedulerConfiguration csConf
+= new CapacitySchedulerConfiguration();
+csConf.setQueues(CapacitySchedulerConfiguration.ROOT,
+new String[] {"a"});
+csConf.setCapacity("root.a", 100);
+csConf.setMaximumCapacity("root.a", 100);
+

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-09-16 Thread ben yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605882#comment-17605882
 ] 

ben yang commented on YARN-11191:
-

Could you kindly take a look for the pr? Ths. [~elgoiri] 

> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-09-16 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605762#comment-17605762
 ] 

ASF GitHub Bot commented on YARN-11191:
---

hadoop-yetus commented on PR #4726:
URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1249184324

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 57s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  40m 31s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 24s |  |  trunk passed with JDK 
Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   1m 16s |  |  trunk passed with JDK 
Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   1m  5s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 16s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 10s |  |  trunk passed with JDK 
Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 55s |  |  trunk passed with JDK 
Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   2m 28s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  23m 21s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 57s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  1s |  |  the patch passed with JDK 
Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   1m  1s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 54s |  |  the patch passed with JDK 
Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   0m 54s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 49s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4726/2/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 8 new + 138 unchanged - 0 fixed = 146 total (was 138)  |
   | +1 :green_heart: |  mvnsite  |   0m 58s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 42s |  |  the patch passed with JDK 
Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 45s |  |  the patch passed with JDK 
Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   2m  4s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  21m 44s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 103m  9s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 49s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 207m 54s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4726/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4726 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 11f01de99262 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 
01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 0547b173b9e01e11633f045eca0e75dc8c417c8f |
   | Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4726/2/testReport/ |
   | Max.

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-09-16 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605761#comment-17605761
 ] 

ASF GitHub Bot commented on YARN-11191:
---

hadoop-yetus commented on PR #4726:
URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1249182002

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 42s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  37m 54s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 10s |  |  trunk passed with JDK 
Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   1m  6s |  |  trunk passed with JDK 
Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   2m  4s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 18s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 13s |  |  trunk passed with JDK 
Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 59s |  |  trunk passed with JDK 
Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   2m 22s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  23m 46s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 56s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  7s |  |  the patch passed with JDK 
Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   1m  7s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 57s |  |  the patch passed with JDK 
Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   0m 57s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 44s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4726/3/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 8 new + 138 unchanged - 0 fixed = 146 total (was 138)  |
   | +1 :green_heart: |  mvnsite  |   1m  4s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 45s |  |  the patch passed with JDK 
Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 44s |  |  the patch passed with JDK 
Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   2m  2s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  21m 36s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  99m 12s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 43s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 201m 55s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4726/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4726 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 4e2b917b27d3 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 
01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 281c169aca12bd5e932295c6eb78e69ab9d28e86 |
   | Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4726/3/testReport/ |
   | Max.

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-08-15 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17579571#comment-17579571
 ] 

ASF GitHub Bot commented on YARN-11191:
---

shellwjl commented on PR #4726:
URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1214712685

   > 
   Global Scheduler has muti-threads to assign containers normally.
   




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-08-12 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578851#comment-17578851
 ] 

ASF GitHub Bot commented on YARN-11191:
---

luoyuan3471 commented on PR #4726:
URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1212840341

   > @luoyuan3471 1.The key to deadlock is that refresh thread can‘t acquire 
csqueue read lock. The read lock request is blocked by a write lock (as: 
https://bugs.openjdk.org/browse/JDK-6893626).so i use tryLock to break the 
condition.The PremmptionManager lock will be released soon after refresh thread 
gets csqueue read lock. 2.just preemption, but global scheduler increases the 
chance
   
   For 2, Why does Global Scheduler increase the chance of  this dead lock case?




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-08-12 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578850#comment-17578850
 ] 

ASF GitHub Bot commented on YARN-11191:
---

luoyuan3471 commented on PR #4726:
URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1212838719

   > @luoyuan3471 yes! ReadLock.lock() will call function(tryAcquireShared), 
this method will check whether the waiting queue first node is shared. if 
false, lock method will be blocked even if sharedCount(c) >0.But 
readLock.tryLock() do not need the check, it can get the csqueue readLock 
directly. you can diff tryReadLock() and Lock().
   
   Thanks for your explanation. I checked the code. You're right!




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-08-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578539#comment-17578539
 ] 

ASF GitHub Bot commented on YARN-11191:
---

yb12138 commented on PR #4726:
URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1212163886

   @luoyuan3471 
   yes! ReadLock.lock() will call function(tryAcquireShared), this method will 
check whether the waiting queue first node is shared.
   if false, lock method will be blocked even if sharedCount(c) >0.But 
readLock.tryLock() do not need the check, it can get the  csqueue readLock  
directly. 
   you can diff tryReadLock() and Lock(), it's an interesting design.




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-08-11 Thread ben yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578521#comment-17578521
 ] 

ben yang commented on YARN-11191:
-

it's so cool! I will try to add a test like this! Thanks! [~luoyuan] 

> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-08-11 Thread Yuan Luo (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578409#comment-17578409
 ] 

Yuan Luo commented on YARN-11191:
-

 
{code:java}
public class ThreadLockTest {

  private static DateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss");

  public static void BeforeFixDeadLock() throws InterruptedException {
System.out.println(
"BeforeFixDeadLock test start, this will cause deadlock..");
ReentrantReadWriteLock queueLock = new ReentrantReadWriteLock();
ReentrantReadWriteLock.ReadLock queueReadLock = queueLock.readLock();
ReentrantReadWriteLock.WriteLock queueWriteLock = queueLock.writeLock();

ReentrantReadWriteLock premmptionLock = new ReentrantReadWriteLock();
ReentrantReadWriteLock.ReadLock premmptionReadLock =
premmptionLock.readLock();
ReentrantReadWriteLock.WriteLock premmptionWriteLock =
premmptionLock.writeLock();

Thread schedulerThread = new Thread(() -> {
  System.out.println("current time: " + sdf.format(new Date()) +
  ", schedulerThread start!");
  //hold: csqueue.readLock
  queueReadLock.lock();
  System.out.println("current time: " + sdf.format(new Date()) +
  ", schedulerThread get queueReadLock!");
  try {
Thread.sleep(1000 * 15);
  } catch (InterruptedException e) {
e.printStackTrace();
  }

  //require: PremmptionManager.readLock
  premmptionReadLock.lock();
  System.out.println("current time: " + sdf.format(new Date()) +
  ", schedulerThread get premmptionReadLock!");

  premmptionReadLock.unlock();
  queueReadLock.unlock();
  System.out.println("current time: " + sdf.format(new Date()) +
  ", schedulerThread finish!");
});

Thread refreshQueueThread = new Thread(() -> {
  System.out.println("current time: " + sdf.format(new Date()) +
  ", refreshQueueThread start!");
  //hold: PremmptionManager.writeLock
  premmptionWriteLock.lock();
  System.out.println("current time: " + sdf.format(new Date()) +
  ", refreshQueueThread get premmptionWriteLock!");

  try {
Thread.sleep(1000 * 10);
  } catch (InterruptedException e) {
e.printStackTrace();
  }

  //require: csqueue.readLock
  queueReadLock.lock();
  System.out.println("current time: " + sdf.format(new Date()) +
  ", refreshQueueThread get queueReadLock!");

  queueReadLock.unlock();
  premmptionWriteLock.unlock();
  System.out.println("current time: " + sdf.format(new Date()) +
  ", refreshQueueThread finish!");
});

Thread otherThread = new Thread(() -> {
  //make otherThread request queue write lock after schedule thread hold
  // queue write lock, and before refres thread to get queue read lock
  try {
Thread.sleep(1000 * 5);
  } catch (InterruptedException e) {
e.printStackTrace();
  }
  System.out.println(
  "current time: " + sdf.format(new Date()) + ", otherThread start!");
  queueWriteLock.lock();


  System.out.println("current time: " + sdf.format(new Date()) +
  ", otherThread get queueWriteLock!");
  queueWriteLock.unlock();
  System.out.println(
  "current time: " + sdf.format(new Date()) + ", otherThread finish!");
});

schedulerThread.start();
refreshQueueThread.start();
otherThread.start();

refreshQueueThread.join();
schedulerThread.join();
otherThread.join();
  }

  public static void AfterFixDeadLock() throws InterruptedException {
System.out.println("AfterFixDeadLock test start..");
ReentrantReadWriteLock queueLock = new ReentrantReadWriteLock();
ReentrantReadWriteLock.ReadLock queueReadLock = queueLock.readLock();
ReentrantReadWriteLock.WriteLock queueWriteLock = queueLock.writeLock();

ReentrantReadWriteLock premmptionLock = new ReentrantReadWriteLock();
ReentrantReadWriteLock.ReadLock premmptionReadLock =
premmptionLock.readLock();
ReentrantReadWriteLock.WriteLock premmptionWriteLock =
premmptionLock.writeLock();

Thread schedulerThread = new Thread(() -> {
  System.out.println("current time: " + sdf.format(new Date()) +
  ", schedulerThread start!");
  //hold: csqueue.readLock
  queueReadLock.lock();
  System.out.println("current time: " + sdf.format(new Date()) +
  ", schedulerThread get queueReadLock!");
  try {
Thread.sleep(1000 * 15);
  } catch (InterruptedException e) {
e.printStackTrace();
  }

  //require: PremmptionManager.readLock
  premmptionReadLock.lock();
  System.out.println("current time: " + sdf.format(new Date()) +
  ", schedulerThread get premmptionReadLock!");

  premmptionReadLock.unlock();
  queueReadLock.unlock();
  System.out.println("current time: " +

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-08-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578355#comment-17578355
 ] 

ASF GitHub Bot commented on YARN-11191:
---

luoyuan3471 commented on PR #4726:
URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1211760018

   > @luoyuan3471 
![未命名](https://user-images.githubusercontent.com/29743168/183837904-187ebe71-d5a6-474c-948d-1160f0d3407e.png)
 you can see this image. this problem will happen when refresh thread is 
calling PreemptionManager.refreshQueue and schedule thread is calling 
AbstractCSQueue.getTotalKillableResource.At this time, refresh thread will 
require csqueue.readLock，but csqueue.readLock will blocked by schedule thread 
and "other thread"( https://bugs.openjdk.org/browse/JDK-6893626 ).And schedule 
thread will require PremmptionManager.readLock,but this readLock will blocked 
by refresh thread held writeLock. so i use tryLock to make refresh thread get 
csqueue.readLock. Wait for the refresh thread complete 
PreemptionManager.refreshQueue,the schedule thread will get 
premmptionManager.readLock, then can allocate new container.
   
   ---
   Do you mean readLock.tryLock() will make readLock place first ,though a 
write lock request is already in the head of waiting queue? @yb12138 




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-08-10 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577886#comment-17577886
 ] 

ASF GitHub Bot commented on YARN-11191:
---

hadoop-yetus commented on PR #4726:
URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1210402542

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 50s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  41m 33s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 16s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  compile  |   1m  8s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   1m  5s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 13s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  7s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 55s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   2m 19s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  24m 43s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 56s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  6s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javac  |   1m  6s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 55s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   0m 55s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 46s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4726/1/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 2 new + 95 unchanged - 0 fixed = 97 total (was 95)  |
   | +1 :green_heart: |  mvnsite  |   1m  0s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 49s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 41s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   2m 11s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  24m 45s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 102m 29s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 42s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 211m 50s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4726/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4726 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux f7c0894bff81 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 
17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 65650a20ef43e8eb3311c63e3a6b0585e9e2d95a |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private 
Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-08-10 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577834#comment-17577834
 ] 

ASF GitHub Bot commented on YARN-11191:
---

yb12138 commented on PR #4726:
URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1210271008

   @luoyuan3471 
   
![未命名](https://user-images.githubusercontent.com/29743168/183837904-187ebe71-d5a6-474c-948d-1160f0d3407e.png)
   you can see this image.
   this problem will occupy when refresh thread is calling 
PreemptionManager.refreshQueue and schedule thread is calling 
AbstractCSQueue.getTotalKillableResource.At this time, refresh thread will 
require csqueue.readLock，but csqueue.readLock will blocked by schedule thread 
and "other thread"( https://bugs.openjdk.org/browse/JDK-6893626 ).And schedule 
thread will require PremmptionManager.readLock,but this readLock will blocked 
by refresh thread held writeLock. so i use tryLock to make refresh thread get 
csqueue.readLock. Wait for the refresh thread complete 
PreemptionManager.refreshQueue,the schedule thread will get 
premmptionManager.readLock, then can allocate new container.
   




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-08-10 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577807#comment-17577807
 ] 

ASF GitHub Bot commented on YARN-11191:
---

luoyuan3471 commented on PR #4726:
URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1210237613

   CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
   require: csqueue.readLock
   
   CapacityScheduler.schedule: hold: csqueue.readLock
   require: PremmptionManager.readLock
   
   other thread(completeContainer,release Resource,etc.): require: 
csqueue.writeLock 
   
   @yb12138 
   schedule thread hold csqueue.readLock and it is blocked by 
PremmptionManager.readLock , and PremmptionManager.writeLock is hold by 
refreshQueue thread, seems refreshQueue have no chance to get csqueue.readLock. 
   
   Very sorry, I'm still a little confused on this point. Can you explain more 
about it? Thank you！
   
   
   




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-08-10 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577800#comment-17577800
 ] 

ASF GitHub Bot commented on YARN-11191:
---

yb12138 commented on PR #4726:
URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1210219429

   @luoyuan3471 
   1.The key to deadlock is that the refresh thread can‘t acquire the csqueue 
read lock. beacuse it’s  read lock request is blocked by a write lock (as: 
https://bugs.openjdk.org/browse/JDK-6893626).so i use tryLock to break the 
conditions for deadlock.The PremmptionManager lock will be released soon after 
it gets csqueue read lock. 
   2.just preemption, but global scheduler increases the chance of deadlock




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-08-10 Thread Yuan Luo (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577784#comment-17577784
 ] 

Yuan Luo commented on YARN-11191:
-

 I want to ask two question:
+  @Override
+  public List getChildQueuesByTryLock() {
+    try {
+      while (!readLock.tryLock()){
+        LockSupport.parkNanos(1);
+      }
+      return new ArrayList(childQueues);
+    } finally {
+      readLock.unlock();
+    }
+  }

1.Though you use tryLock and park, so refresh queue thread switch to block 
state, but this thread still hold PremmptionManager lock ,so scheduler thread 
still can't allocate new container. Is it right?

2.Does this issue related to global Scheduler or just the preemption function?

Looking forward to your reply, thanks!

> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-08-10 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577780#comment-17577780
 ] 

ASF GitHub Bot commented on YARN-11191:
---

luoyuan3471 commented on PR #4726:
URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1210199622

   @yb12138  I want to ask two question:
   +  @Override
   +  public List getChildQueuesByTryLock() {
   +try {
   +  while (!readLock.tryLock()){
   +LockSupport.parkNanos(1);
   +  }
   +  return new ArrayList(childQueues);
   +} finally {
   +  readLock.unlock();
   +}
   +  }
   
   1.Though you use tryLock and park, so refresh queue thread switch to block 
state, but this thread still hold PremmptionManager lock ,so scheduler thread 
still can allocate new container. Is it right?
   
   2.Does this issue related to global Scheduler or just the preemption 
function?
   
   Looking forward to your reply, thanks!
   
   




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-08-10 Thread ben yang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1757#comment-1757
 ] 

ben yang commented on YARN-11191:
-

it is difficult to add a test, because this problem  under concurrency not 
always happen

> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 1.jstack, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-08-09 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1754#comment-1754
 ] 

ASF GitHub Bot commented on YARN-11191:
---

yb12138 opened a new pull request, #4726:
URL: https://github.com/apache/hadoop/pull/4726

   
   
   ### Description of PR
   
   
   ### How was this patch tested?
   
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
> Attachments: 1.jstack, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-08-09 Thread Jira



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577643#comment-17577643
 ] 

Íñigo Goiri commented on YARN-11191:


Do you mind preparing a PR?
Is it possible to add a test?

> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
> Attachments: 1.jstack, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-08-09 Thread Ryan Wu (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577260#comment-17577260
 ] 

Ryan Wu commented on YARN-11191:


We do refreshQueue periodically ,not always reproduce this problem. But once  
happened, it leads to cluster can not assign resource. [~aajisaka] 

> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
> Attachments: 1.jstack, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

2022-08-09 Thread Yuan Luo (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577178#comment-17577178
 ] 

Yuan Luo commented on YARN-11191:
-

We meet a same issue when refresh yarn queue, the refresh thread stuck in below 
function:

 
{code:java}
'preemptionManager.refreshQueues(null, this.getRootQueue()) {code}
Can you help have a look at this issue. [~elgoiri] [~aajisaka] 

 

 

 

 

> Global Scheduler refreshQueue cause deadLock 
> -
>
> Key: YARN-11191
> URL: https://issues.apache.org/jira/browse/YARN-11191
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>Reporter: ben yang
>Priority: Major
> Attachments: 1.jstack, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
> require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
> require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

48 matches

Mail list logo