date:20240118

[jira] [Commented] (YARN-7953) [GQ] Data structures for federation global queues calculations

2024-01-18 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-7953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808107#comment-17808107
 ] 

ASF GitHub Bot commented on YARN-7953:
--

hadoop-yetus commented on PR #6361:
URL: https://github.com/apache/hadoop/pull/6361#issuecomment-1898116112

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 49s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  xmllint  |   0m  0s |  |  xmllint was not available.  |
   | +0 :ok: |  jsonlint  |   0m  0s |  |  jsonlint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 5 new or modified test files.  |
    _ trunk Compile Tests _ |
   | -1 :x: |  mvninstall  |  46m 29s | 
[/branch-mvninstall-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6361/6/artifact/out/branch-mvninstall-root.txt)
 |  root in trunk failed.  |
   | +1 :green_heart: |  compile  |   0m 25s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   0m 22s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 25s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 28s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 30s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 22s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   0m 44s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  37m 34s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 19s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 18s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   0m 18s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 17s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 17s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 13s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 19s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 19s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 18s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   0m 44s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  37m 44s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m  2s |  |  
hadoop-yarn-server-globalpolicygenerator in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 33s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 134m 43s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6361/6/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6361 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient codespell detsecrets xmllint spotbugs checkstyle 
jsonlint |
   | uname | Linux 0e9655687bf5 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 3898b5ed0452452aed9f4fa0d7db9d9aaf8d390e |
   | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6361/6/testReport/ |
   | Max. process+thread count | 534 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-globalpolicygenerator
 U: 
hadoop-yarn-

[jira] [Commented] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy

2024-01-18 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808255#comment-17808255
 ] 

ASF GitHub Bot commented on YARN-11639:
---

hadoop-yetus commented on PR #6455:
URL: https://github.com/apache/hadoop/pull/6455#issuecomment-1898626198

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  13m  0s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  42m 10s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m  0s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   0m 53s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 53s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 57s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 55s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 46s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 55s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  33m 13s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 46s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 52s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   0m 52s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 44s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 44s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 40s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 48s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 43s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 40s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 56s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  33m 14s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 101m 15s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 34s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 239m 50s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6455/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6455 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 950df887c33c 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / f38bf0fafb2a83efaf4be83e5461e146c43d0201 |
   | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6455/4/testReport/ |
   | Max. process+thread count | 994 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6455/4/console |
   | ve

[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

2024-01-18 Thread Adam Binford (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808274#comment-17808274
 ] 

Adam Binford commented on YARN-4771:


{quote}{quote}However it may have issues with very long-running apps that churn 
a lot of containers, since the container state won't be released until the 
application completes.
{quote}
This is going to be problematic, impacting NM memory usage.
{quote}
We just started encountering this issue, though not NM memory usage. We have 
run-forever Spark Structured Streaming applications that use dynamic allocation 
to grab resources when they need it. After restarting our Node Managers, the 
recovery process can end up DoS'ing our Resource Manager, especially if we 
restart a large amount at once, as there can be thousands of tracked 
"completed" containers. We're also seeing issues with the servers running the 
Node Managers sometimes dying during the recovery process as well. It seems 
like there's multiple issues here but it mostly stems from keeping all 
containers for all time for active applications in the state store:
 * As part of the recovery process, the NM seems to send a "container released" 
message to the RM, which the RM just logs as "Thanks, I don't know what this 
container is though". This is what can cause DoS'ing of the RM
 * On the NM itself, it seems that part of the recovery process is actually 
trying to allocate resources for completed containers, resulting in the server 
running out of memory. We've only seen this a couple times so still trying to 
exactly track down what's happening. Our metrics show spikes of up to 100x the 
resources being used on the NM than the NM actually has resources (i.e. the NM 
is reporting terabytes of memory is allocated, but the node only has ~300 GiB 
of memory). The metrics might be a weird side effect of the recovery process 
that doesn't actually hurt things, but the nodes dying is what's concerning

I'm still trying to track down all the moving pieces here, as traversing around 
the event passing system isn't easy to follow. So far I've just tracked this 
down for why containers are never removed from the state store until an 
application finishes. We use the rolling log aggregation so I'm currently 
trying to see if we can use that mechanism to release containers from the state 
store once the logs have been aggregated. But this would also be a non-issue if 
I could figure out why the other issues are happening and how to prevent them.

> Some containers can be skipped during log aggregation after NM restart
> --
>
> Key: YARN-4771
> URL: https://issues.apache.org/jira/browse/YARN-4771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.10.0, 3.2.1, 3.1.3
>Reporter: Jason Darrell Lowe
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.2.2, 3.1.4, 2.10.1, 3.4.0, 3.3.1
>
> Attachments: YARN-4771.001.patch, YARN-4771.002.patch, 
> YARN-4771.003.patch
>
>
> A container can be skipped during log aggregation after a work-preserving 
> nodemanager restart if the following events occur:
> # Container completes more than 
> yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the 
> restart
> # At least one other container completes after the above container and before 
> the restart



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

2024-01-18 Thread Adam Binford (Jira)

[
https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808274#comment-17808274
]

Adam Binford edited comment on YARN-4771 at 1/18/24 3:54 PM:
-

{quote}However it may have issues with very long-running apps that churn a lot
of containers, since the container state won't be released until the
application completes.
{quote}
{quote}This is going to be problematic, impacting NM memory usage.
{quote}
We just started encountering this issue, though not NM memory usage. We have
run-forever Spark Structured Streaming applications that use dynamic allocation
to grab resources when they need it. After restarting our Node Managers, the
recovery process can end up DoS'ing our Resource Manager, especially if we
restart a large amount at once, as there can be thousands of tracked
"completed" containers. We're also seeing issues with the servers running the
Node Managers sometimes dying during the recovery process as well. It seems
like there's multiple issues here but it mostly stems from keeping all
containers for all time for active applications in the state store:
* As part of the recovery process, the NM seems to send a "container released"
message to the RM, which the RM just logs as "Thanks, I don't know what this
container is though". This is what can cause DoS'ing of the RM
* On the NM itself, it seems that part of the recovery process is actually
trying to allocate resources for completed containers, resulting in the server
running out of memory. We've only seen this a couple times so still trying to
exactly track down what's happening. Our metrics show spikes of up to 100x the
resources being used on the NM than the NM actually has resources (i.e. the NM
is reporting terabytes of memory is allocated, but the node only has ~300 GiB
of memory). The metrics might be a weird side effect of the recovery process
that doesn't actually hurt things, but the nodes dying is what's concerning

I'm still trying to track down all the moving pieces here, as traversing around
the event passing system isn't easy to follow. So far I've just tracked this
down for why containers are never removed from the state store until an
application finishes. We use the rolling log aggregation so I'm currently
trying to see if we can use that mechanism to release containers from the state
store once the logs have been aggregated. But this would also be a non-issue if
I could figure out why the other issues are happening and how to prevent them.

was (Author: kimahriman):
{quote}{quote}However it may have issues with very long-running apps that churn
a lot of containers, since the container state won't be released until the
application completes.
{quote}
This is going to be problematic, impacting NM memory usage.
{quote}
We just started encountering this issue, though not NM memory usage. We have
run-forever Spark Structured Streaming applications that use dynamic allocation
to grab resources when they need it. After restarting our Node Managers, the
recovery process can end up DoS'ing our Resource Manager, especially if we
restart a large amount at once, as there can be thousands of tracked
"completed" containers. We're also seeing issues with the servers running the
Node Managers sometimes dying during the recovery process as well. It seems
like there's multiple issues here but it mostly stems from keeping all
containers for all time for active applications in the state store:
* As part of the recovery process, the NM seems to send a "container released"
message to the RM, which the RM just logs as "Thanks, I don't know what this
container is though". This is what can cause DoS'ing of the RM
* On the NM itself, it seems that part of the recovery process is actually
trying to allocate resources for completed containers, resulting in the server
running out of memory. We've only seen this a couple times so still trying to
exactly track down what's happening. Our metrics show spikes of up to 100x the
resources being used on the NM than the NM actually has resources (i.e. the NM
is reporting terabytes of memory is allocated, but the node only has ~300 GiB
of memory). The metrics might be a weird side effect of the recovery process
that doesn't actually hurt things, but the nodes dying is what's concerning

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-01-18 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808332#comment-17808332
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hadoop-yetus commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1898963967

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   4m 20s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  33m 44s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   0m 35s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 28s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 40s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   0m 30s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   1m 14s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  21m 17s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 34s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 29s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 29s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 19s |  |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 0 new + 66 unchanged - 1 fixed = 66 total (was 67)  |
   | +1 :green_heart: |  mvnsite  |   0m 29s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 22s |  |  the patch passed  |
   | -1 :x: |  spotbugs  |   1m 15s | 
[/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/10/artifact/out/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)  |
   | +1 :green_heart: |  shadedclient  |  21m 33s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  |  78m 12s | 
[/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/10/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 23s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 167m 20s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | SpotBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   |  |  Exceptional return value of 
java.util.concurrent.ExecutorService.submit(Callable) ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:[line 1131] |
   | Failed junit tests | hadoop.yarn.server.resourcemanager.TestRMHA |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/10/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6352 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux f16f271e28e6 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / 5ae791898e1e8d053e7aebefd0532ff533b09087 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job

[jira] [Created] (YARN-11648) CapacityScheduler does not activate applications when resources are released from another Leaf Queue

2024-01-18 Thread Brian Goerlitz (Jira)

Brian Goerlitz created YARN-11648:
-

 Summary: CapacityScheduler does not activate applications when 
resources are released from another Leaf Queue
 Key: YARN-11648
 URL: https://issues.apache.org/jira/browse/YARN-11648
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Reporter: Brian Goerlitz


Create a queue with low minimum capacity and high maximum capacity. If multiple 
apps are submitted to the queue such that the Queue's Max AM resource limit is 
exceeded while other cluster resources are consumed by different queues, these 
apps will not be considered for activation when cluster resources from the 
other queues are freed. As the AM limit is calculated based on available 
resources for the queue, these apps should be activated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11648) CapacityScheduler does not activate applications when resources are released from another Leaf Queue

2024-01-18 Thread Brian Goerlitz (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808391#comment-17808391
 ] 

Brian Goerlitz commented on YARN-11648:
---

This scenario is relatively rare as app activation in a queue is currently 
triggered by 3 conditions:
* reconfiguration of the queue or cluster resources
* application submission in the queue
* application completion in the queue

Certain use-cases that can be sensitive to deadlocks can consistently encounter 
this problem. For example, if the low capacity queue is used for scheduled 
Oozie jobs while the rest of the cluster is periodically consumed by other 
higher priority workloads, the Oozie AM will wait for its child job to 
complete, while that child job may be stuck in a pending state until the next 
scheduled job is submitted or a user notices the issue.

A similar situation can be achieved using DistributedShell by using all but 2 
containers resources in a different queue, then submitting the following to the 
low capacity queue ("queueA").

{noformat}
yarn jar hadoop-yarn-applications-distributedshell.jar -jar 
hadoop-yarn-applications-distributedshell.jar -shell_command "yarn jar 
/path/to/hadoop-mapreduce-examples.jar pi -Dmapred.job.queue.name=root.queueA 
10 100" -num_containers 1 -queue root.queueA
{noformat}

The DistributedShell job will run and submit the MR job. The MR job will be 
pending as the queue AM limit is reached. The situation remains the same even 
after the workloads in other queues have completed.

> CapacityScheduler does not activate applications when resources are released 
> from another Leaf Queue
> 
>
> Key: YARN-11648
> URL: https://issues.apache.org/jira/browse/YARN-11648
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Brian Goerlitz
>Priority: Major
>
> Create a queue with low minimum capacity and high maximum capacity. If 
> multiple apps are submitted to the queue such that the Queue's Max AM 
> resource limit is exceeded while other cluster resources are consumed by 
> different queues, these apps will not be considered for activation when 
> cluster resources from the other queues are freed. As the AM limit is 
> calculated based on available resources for the queue, these apps should be 
> activated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11607) TestTimelineAuthFilterForV2 fails intermittently

2024-01-18 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808473#comment-17808473
 ] 

ASF GitHub Bot commented on YARN-11607:
---

slfan1989 commented on PR #6459:
URL: https://github.com/apache/hadoop/pull/6459#issuecomment-1899819100

   > Tried running the testcase 100 times in a row but the testcase failure 
cannot be reproduced. The sleep interval between retries is relatively short, I 
believe that the execution time of the asynchronous call might be insufficient.
   
   @susheelgupta7 Thank you for your contribution! Although this may not be a 
complete solution, I couldn't think of a better way either. Hopefully, there 
will be a better solution in the future.




> TestTimelineAuthFilterForV2 fails intermittently 
> -
>
> Key: YARN-11607
> URL: https://issues.apache.org/jira/browse/YARN-11607
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Ayush Saxena
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
>
> Ref:
> https://ci-hadoop.apache.org/view/Hadoop/job/hadoop-qbt-trunk-java8-linux-x86_64/1398/testReport/junit/org.apache.hadoop.yarn.server.timelineservice.security/TestTimelineAuthFilterForV2/testPutTimelineEntities_boolean__boolean__3_/
> {noformat}
> org.opentest4j.AssertionFailedError: expected: <2> but was: <1>
>   at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55)
>   at 
> org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62)
>   at 
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
>   at 
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145)
>   at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:527)
>   at 
> org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2.publishAndVerifyEntity(TestTimelineAuthFilterForV2.java:324)
>   at 
> org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2.publishWithRetries(TestTimelineAuthFilterForV2.java:337)
>   at 
> org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2.testPutTimelineEntities(TestTimelineAuthFilterForV2.java:383)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7953) [GQ] Data structures for federation global queues calculations

[jira] [Commented] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy

[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

[jira] [Comment Edited] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

[jira] [Created] (YARN-11648) CapacityScheduler does not activate applications when resources are released from another Leaf Queue

[jira] [Commented] (YARN-11648) CapacityScheduler does not activate applications when resources are released from another Leaf Queue

[jira] [Commented] (YARN-11607) TestTimelineAuthFilterForV2 fails intermittently

8 matches

Site Navigation

Mail list logo

Footer information