[jira] [Commented] (YARN-7953) [GQ] Data structures for federation global queues calculations
[ https://issues.apache.org/jira/browse/YARN-7953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808107#comment-17808107 ] ASF GitHub Bot commented on YARN-7953: -- hadoop-yetus commented on PR #6361: URL: https://github.com/apache/hadoop/pull/6361#issuecomment-1898116112 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 49s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 0s | | xmllint was not available. | | +0 :ok: | jsonlint | 0m 0s | | jsonlint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 5 new or modified test files. | _ trunk Compile Tests _ | | -1 :x: | mvninstall | 46m 29s | [/branch-mvninstall-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6361/6/artifact/out/branch-mvninstall-root.txt) | root in trunk failed. | | +1 :green_heart: | compile | 0m 25s | | trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 0m 22s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 0m 25s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 28s | | trunk passed | | +1 :green_heart: | javadoc | 0m 30s | | trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 0m 22s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 0m 44s | | trunk passed | | +1 :green_heart: | shadedclient | 37m 34s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 19s | | the patch passed | | +1 :green_heart: | compile | 0m 18s | | the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 0m 18s | | the patch passed | | +1 :green_heart: | compile | 0m 17s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | javac | 0m 17s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 13s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 19s | | the patch passed | | +1 :green_heart: | javadoc | 0m 19s | | the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 0m 18s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 0m 44s | | the patch passed | | +1 :green_heart: | shadedclient | 37m 44s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 1m 2s | | hadoop-yarn-server-globalpolicygenerator in the patch passed. | | +1 :green_heart: | asflicense | 0m 33s | | The patch does not generate ASF License warnings. | | | | 134m 43s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6361/6/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6361 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient codespell detsecrets xmllint spotbugs checkstyle jsonlint | | uname | Linux 0e9655687bf5 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 3898b5ed0452452aed9f4fa0d7db9d9aaf8d390e | | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6361/6/testReport/ | | Max. process+thread count | 534 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-globalpolicygenerator U: hadoop-yarn-
[jira] [Commented] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy
[ https://issues.apache.org/jira/browse/YARN-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808255#comment-17808255 ] ASF GitHub Bot commented on YARN-11639: --- hadoop-yetus commented on PR #6455: URL: https://github.com/apache/hadoop/pull/6455#issuecomment-1898626198 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 13m 0s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 42m 10s | | trunk passed | | +1 :green_heart: | compile | 1m 0s | | trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 0m 53s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 0m 53s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 57s | | trunk passed | | +1 :green_heart: | javadoc | 0m 55s | | trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 0m 46s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 1m 55s | | trunk passed | | +1 :green_heart: | shadedclient | 33m 13s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 46s | | the patch passed | | +1 :green_heart: | compile | 0m 52s | | the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 0m 52s | | the patch passed | | +1 :green_heart: | compile | 0m 44s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | javac | 0m 44s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 40s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 48s | | the patch passed | | +1 :green_heart: | javadoc | 0m 43s | | the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 0m 40s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 1m 56s | | the patch passed | | +1 :green_heart: | shadedclient | 33m 14s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 101m 15s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 34s | | The patch does not generate ASF License warnings. | | | | 239m 50s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6455/4/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6455 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 950df887c33c 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / f38bf0fafb2a83efaf4be83e5461e146c43d0201 | | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6455/4/testReport/ | | Max. process+thread count | 994 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6455/4/console | | ve
[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart
[ https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808274#comment-17808274 ] Adam Binford commented on YARN-4771: {quote}{quote}However it may have issues with very long-running apps that churn a lot of containers, since the container state won't be released until the application completes. {quote} This is going to be problematic, impacting NM memory usage. {quote} We just started encountering this issue, though not NM memory usage. We have run-forever Spark Structured Streaming applications that use dynamic allocation to grab resources when they need it. After restarting our Node Managers, the recovery process can end up DoS'ing our Resource Manager, especially if we restart a large amount at once, as there can be thousands of tracked "completed" containers. We're also seeing issues with the servers running the Node Managers sometimes dying during the recovery process as well. It seems like there's multiple issues here but it mostly stems from keeping all containers for all time for active applications in the state store: * As part of the recovery process, the NM seems to send a "container released" message to the RM, which the RM just logs as "Thanks, I don't know what this container is though". This is what can cause DoS'ing of the RM * On the NM itself, it seems that part of the recovery process is actually trying to allocate resources for completed containers, resulting in the server running out of memory. We've only seen this a couple times so still trying to exactly track down what's happening. Our metrics show spikes of up to 100x the resources being used on the NM than the NM actually has resources (i.e. the NM is reporting terabytes of memory is allocated, but the node only has ~300 GiB of memory). The metrics might be a weird side effect of the recovery process that doesn't actually hurt things, but the nodes dying is what's concerning I'm still trying to track down all the moving pieces here, as traversing around the event passing system isn't easy to follow. So far I've just tracked this down for why containers are never removed from the state store until an application finishes. We use the rolling log aggregation so I'm currently trying to see if we can use that mechanism to release containers from the state store once the logs have been aggregated. But this would also be a non-issue if I could figure out why the other issues are happening and how to prevent them. > Some containers can be skipped during log aggregation after NM restart > -- > > Key: YARN-4771 > URL: https://issues.apache.org/jira/browse/YARN-4771 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.10.0, 3.2.1, 3.1.3 >Reporter: Jason Darrell Lowe >Assignee: Jim Brennan >Priority: Major > Fix For: 3.2.2, 3.1.4, 2.10.1, 3.4.0, 3.3.1 > > Attachments: YARN-4771.001.patch, YARN-4771.002.patch, > YARN-4771.003.patch > > > A container can be skipped during log aggregation after a work-preserving > nodemanager restart if the following events occur: > # Container completes more than > yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the > restart > # At least one other container completes after the above container and before > the restart -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-4771) Some containers can be skipped during log aggregation after NM restart
[ https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808274#comment-17808274 ] Adam Binford edited comment on YARN-4771 at 1/18/24 3:54 PM: - {quote}However it may have issues with very long-running apps that churn a lot of containers, since the container state won't be released until the application completes. {quote} {quote}This is going to be problematic, impacting NM memory usage. {quote} We just started encountering this issue, though not NM memory usage. We have run-forever Spark Structured Streaming applications that use dynamic allocation to grab resources when they need it. After restarting our Node Managers, the recovery process can end up DoS'ing our Resource Manager, especially if we restart a large amount at once, as there can be thousands of tracked "completed" containers. We're also seeing issues with the servers running the Node Managers sometimes dying during the recovery process as well. It seems like there's multiple issues here but it mostly stems from keeping all containers for all time for active applications in the state store: * As part of the recovery process, the NM seems to send a "container released" message to the RM, which the RM just logs as "Thanks, I don't know what this container is though". This is what can cause DoS'ing of the RM * On the NM itself, it seems that part of the recovery process is actually trying to allocate resources for completed containers, resulting in the server running out of memory. We've only seen this a couple times so still trying to exactly track down what's happening. Our metrics show spikes of up to 100x the resources being used on the NM than the NM actually has resources (i.e. the NM is reporting terabytes of memory is allocated, but the node only has ~300 GiB of memory). The metrics might be a weird side effect of the recovery process that doesn't actually hurt things, but the nodes dying is what's concerning I'm still trying to track down all the moving pieces here, as traversing around the event passing system isn't easy to follow. So far I've just tracked this down for why containers are never removed from the state store until an application finishes. We use the rolling log aggregation so I'm currently trying to see if we can use that mechanism to release containers from the state store once the logs have been aggregated. But this would also be a non-issue if I could figure out why the other issues are happening and how to prevent them. was (Author: kimahriman): {quote}{quote}However it may have issues with very long-running apps that churn a lot of containers, since the container state won't be released until the application completes. {quote} This is going to be problematic, impacting NM memory usage. {quote} We just started encountering this issue, though not NM memory usage. We have run-forever Spark Structured Streaming applications that use dynamic allocation to grab resources when they need it. After restarting our Node Managers, the recovery process can end up DoS'ing our Resource Manager, especially if we restart a large amount at once, as there can be thousands of tracked "completed" containers. We're also seeing issues with the servers running the Node Managers sometimes dying during the recovery process as well. It seems like there's multiple issues here but it mostly stems from keeping all containers for all time for active applications in the state store: * As part of the recovery process, the NM seems to send a "container released" message to the RM, which the RM just logs as "Thanks, I don't know what this container is though". This is what can cause DoS'ing of the RM * On the NM itself, it seems that part of the recovery process is actually trying to allocate resources for completed containers, resulting in the server running out of memory. We've only seen this a couple times so still trying to exactly track down what's happening. Our metrics show spikes of up to 100x the resources being used on the NM than the NM actually has resources (i.e. the NM is reporting terabytes of memory is allocated, but the node only has ~300 GiB of memory). The metrics might be a weird side effect of the recovery process that doesn't actually hurt things, but the nodes dying is what's concerning I'm still trying to track down all the moving pieces here, as traversing around the event passing system isn't easy to follow. So far I've just tracked this down for why containers are never removed from the state store until an application finishes. We use the rolling log aggregation so I'm currently trying to see if we can use that mechanism to release containers from the state store once the logs have been aggregated. But this would also be a non-issue if I could figure out why the other issues are happening and how to prevent them.
[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception
[ https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808332#comment-17808332 ] ASF GitHub Bot commented on YARN-11622: --- hadoop-yetus commented on PR #6352: URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1898963967 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 4m 20s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ branch-3.3 Compile Tests _ | | +1 :green_heart: | mvninstall | 33m 44s | | branch-3.3 passed | | +1 :green_heart: | compile | 0m 35s | | branch-3.3 passed | | +1 :green_heart: | checkstyle | 0m 28s | | branch-3.3 passed | | +1 :green_heart: | mvnsite | 0m 40s | | branch-3.3 passed | | +1 :green_heart: | javadoc | 0m 30s | | branch-3.3 passed | | +1 :green_heart: | spotbugs | 1m 14s | | branch-3.3 passed | | +1 :green_heart: | shadedclient | 21m 17s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 34s | | the patch passed | | +1 :green_heart: | compile | 0m 29s | | the patch passed | | +1 :green_heart: | javac | 0m 29s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 19s | | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 0 new + 66 unchanged - 1 fixed = 66 total (was 67) | | +1 :green_heart: | mvnsite | 0m 29s | | the patch passed | | +1 :green_heart: | javadoc | 0m 22s | | the patch passed | | -1 :x: | spotbugs | 1m 15s | [/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/10/artifact/out/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html) | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) | | +1 :green_heart: | shadedclient | 21m 33s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 78m 12s | [/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/10/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 23s | | The patch does not generate ASF License warnings. | | | | 167m 20s | | | | Reason | Tests | |---:|:--| | SpotBugs | module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | | Exceptional return value of java.util.concurrent.ExecutorService.submit(Callable) ignored in org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread() At ResourceManager.java:ignored in org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread() At ResourceManager.java:[line 1131] | | Failed junit tests | hadoop.yarn.server.resourcemanager.TestRMHA | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/10/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6352 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux f16f271e28e6 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | branch-3.3 / 5ae791898e1e8d053e7aebefd0532ff533b09087 | | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 | | Test Results | https://ci-hadoop.apache.org/job
[jira] [Created] (YARN-11648) CapacityScheduler does not activate applications when resources are released from another Leaf Queue
Brian Goerlitz created YARN-11648: - Summary: CapacityScheduler does not activate applications when resources are released from another Leaf Queue Key: YARN-11648 URL: https://issues.apache.org/jira/browse/YARN-11648 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Reporter: Brian Goerlitz Create a queue with low minimum capacity and high maximum capacity. If multiple apps are submitted to the queue such that the Queue's Max AM resource limit is exceeded while other cluster resources are consumed by different queues, these apps will not be considered for activation when cluster resources from the other queues are freed. As the AM limit is calculated based on available resources for the queue, these apps should be activated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11648) CapacityScheduler does not activate applications when resources are released from another Leaf Queue
[ https://issues.apache.org/jira/browse/YARN-11648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808391#comment-17808391 ] Brian Goerlitz commented on YARN-11648: --- This scenario is relatively rare as app activation in a queue is currently triggered by 3 conditions: * reconfiguration of the queue or cluster resources * application submission in the queue * application completion in the queue Certain use-cases that can be sensitive to deadlocks can consistently encounter this problem. For example, if the low capacity queue is used for scheduled Oozie jobs while the rest of the cluster is periodically consumed by other higher priority workloads, the Oozie AM will wait for its child job to complete, while that child job may be stuck in a pending state until the next scheduled job is submitted or a user notices the issue. A similar situation can be achieved using DistributedShell by using all but 2 containers resources in a different queue, then submitting the following to the low capacity queue ("queueA"). {noformat} yarn jar hadoop-yarn-applications-distributedshell.jar -jar hadoop-yarn-applications-distributedshell.jar -shell_command "yarn jar /path/to/hadoop-mapreduce-examples.jar pi -Dmapred.job.queue.name=root.queueA 10 100" -num_containers 1 -queue root.queueA {noformat} The DistributedShell job will run and submit the MR job. The MR job will be pending as the queue AM limit is reached. The situation remains the same even after the workloads in other queues have completed. > CapacityScheduler does not activate applications when resources are released > from another Leaf Queue > > > Key: YARN-11648 > URL: https://issues.apache.org/jira/browse/YARN-11648 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Brian Goerlitz >Priority: Major > > Create a queue with low minimum capacity and high maximum capacity. If > multiple apps are submitted to the queue such that the Queue's Max AM > resource limit is exceeded while other cluster resources are consumed by > different queues, these apps will not be considered for activation when > cluster resources from the other queues are freed. As the AM limit is > calculated based on available resources for the queue, these apps should be > activated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11607) TestTimelineAuthFilterForV2 fails intermittently
[ https://issues.apache.org/jira/browse/YARN-11607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808473#comment-17808473 ] ASF GitHub Bot commented on YARN-11607: --- slfan1989 commented on PR #6459: URL: https://github.com/apache/hadoop/pull/6459#issuecomment-1899819100 > Tried running the testcase 100 times in a row but the testcase failure cannot be reproduced. The sleep interval between retries is relatively short, I believe that the execution time of the asynchronous call might be insufficient. @susheelgupta7 Thank you for your contribution! Although this may not be a complete solution, I couldn't think of a better way either. Hopefully, there will be a better solution in the future. > TestTimelineAuthFilterForV2 fails intermittently > - > > Key: YARN-11607 > URL: https://issues.apache.org/jira/browse/YARN-11607 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ayush Saxena >Assignee: Susheel Gupta >Priority: Major > Labels: pull-request-available > > Ref: > https://ci-hadoop.apache.org/view/Hadoop/job/hadoop-qbt-trunk-java8-linux-x86_64/1398/testReport/junit/org.apache.hadoop.yarn.server.timelineservice.security/TestTimelineAuthFilterForV2/testPutTimelineEntities_boolean__boolean__3_/ > {noformat} > org.opentest4j.AssertionFailedError: expected: <2> but was: <1> > at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55) > at > org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62) > at > org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150) > at > org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145) > at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:527) > at > org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2.publishAndVerifyEntity(TestTimelineAuthFilterForV2.java:324) > at > org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2.publishWithRetries(TestTimelineAuthFilterForV2.java:337) > at > org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2.testPutTimelineEntities(TestTimelineAuthFilterForV2.java:383) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org