from:"ASF GitHub Bot \(Jira\)"

[jira] [Commented] (YARN-11732) Potential NPE when calling SchedulerNode#reservedContainer for CapacityScheduler

2024-09-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885650#comment-17885650
 ] 

ASF GitHub Bot commented on YARN-11732:
---

TaoYang526 commented on code in PR #7065:
URL: https://github.com/apache/hadoop/pull/7065#discussion_r1779890901


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java:
##
@@ -757,10 +757,9 @@ private void completeOustandingUpdatesWhichAreReserved(
   RMContainer rmContainer, ContainerStatus containerStatus,
   RMContainerEventType event) {
 N schedulerNode = getSchedulerNode(rmContainer.getNodeId());
-if (schedulerNode != null &&
-schedulerNode.getReservedContainer() != null) {
+if (schedulerNode != null) {
   RMContainer resContainer = schedulerNode.getReservedContainer();
-  if (resContainer.getReservedSchedulerKey() != null) {
+  if (resContainer != null && resContainer.getReservedSchedulerKey() != 
null) {

Review Comment:
   Thanks @zeekling for the review.
   I'm not sure why your environment still reported NPE after changing like 
that, `resContainer.getReservedSchedulerKey()` won't throw NPE any more with 
the previous not-null check: `if resContainer is null`. Could you please attach 
some details of the NPE and your changes?
   For this change,  I think NPE should be fixed instead of be caught in 
general.





> Potential NPE when calling SchedulerNode#reservedContainer for 
> CapacityScheduler
> 
>
> Key: YARN-11732
> URL: https://issues.apache.org/jira/browse/YARN-11732
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.4.0, 3.2.4, 3.3.6, 3.5.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: pull-request-available
>
> I found some places calling *SchedulerNode#getReservedContainer* to get 
> reservedContainer (returned value) but not do sanity(not-null) check before 
> calling internal methods of it, which can have a risk to raise 
> NullPointerException if it's null.
> Most of these places have a premise that node has reserved container a few 
> moments ago, but may getting null by calling 
> *SchedulerNode#getReservedContainer* in the next moment, since the 
> reservedContainer can be updated to null concurrently in scheduling and 
> monitoring(preemption) thread. So that not-null check should be done before 
> calling internal methods of the reservedContainer.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11732) Potential NPE when calling SchedulerNode#reservedContainer for CapacityScheduler

2024-09-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885649#comment-17885649
 ] 

ASF GitHub Bot commented on YARN-11732:
---

TaoYang526 commented on code in PR #7065:
URL: https://github.com/apache/hadoop/pull/7065#discussion_r1779890901


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java:
##
@@ -757,10 +757,9 @@ private void completeOustandingUpdatesWhichAreReserved(
   RMContainer rmContainer, ContainerStatus containerStatus,
   RMContainerEventType event) {
 N schedulerNode = getSchedulerNode(rmContainer.getNodeId());
-if (schedulerNode != null &&
-schedulerNode.getReservedContainer() != null) {
+if (schedulerNode != null) {
   RMContainer resContainer = schedulerNode.getReservedContainer();
-  if (resContainer.getReservedSchedulerKey() != null) {
+  if (resContainer != null && resContainer.getReservedSchedulerKey() != 
null) {

Review Comment:
   Thanks @zeekling for the review.
   I'm not sure why your environment still reported NPE after changing like 
that, `resContainer.getReservedSchedulerKey()` won't throw NPE any more if 
resContainer is null. Could you please attach some details of the NPE and your 
changes?
   For this change,  I think NPE should be fixed instead of be caught in 
general.





> Potential NPE when calling SchedulerNode#reservedContainer for 
> CapacityScheduler
> 
>
> Key: YARN-11732
> URL: https://issues.apache.org/jira/browse/YARN-11732
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.4.0, 3.2.4, 3.3.6, 3.5.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: pull-request-available
>
> I found some places calling *SchedulerNode#getReservedContainer* to get 
> reservedContainer (returned value) but not do sanity(not-null) check before 
> calling internal methods of it, which can have a risk to raise 
> NullPointerException if it's null.
> Most of these places have a premise that node has reserved container a few 
> moments ago, but may getting null by calling 
> *SchedulerNode#getReservedContainer* in the next moment, since the 
> reservedContainer can be updated to null concurrently in scheduling and 
> monitoring(preemption) thread. So that not-null check should be done before 
> calling internal methods of the reservedContainer.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11732) Potential NPE when calling SchedulerNode#reservedContainer for CapacityScheduler

2024-09-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885644#comment-17885644
 ] 

ASF GitHub Bot commented on YARN-11732:
---

TaoYang526 commented on PR #7065:
URL: https://github.com/apache/hadoop/pull/7065#issuecomment-2381073760

   @Hexiaoqiao Thanks for the review. 
   I have considered the test cases but found all of these changes are about 
race condition inside the method (private visibility or totally located in the 
method), just like this `if(node.getReservedContainer() != null){ LOG.info("... 
container="+ node.getReservedContainer().getContainerId()); }`. It's hard to 
reproduce the NPE in test cases.




> Potential NPE when calling SchedulerNode#reservedContainer for 
> CapacityScheduler
> 
>
> Key: YARN-11732
> URL: https://issues.apache.org/jira/browse/YARN-11732
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.4.0, 3.2.4, 3.3.6, 3.5.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: pull-request-available
>
> I found some places calling *SchedulerNode#getReservedContainer* to get 
> reservedContainer (returned value) but not do sanity(not-null) check before 
> calling internal methods of it, which can have a risk to raise 
> NullPointerException if it's null.
> Most of these places have a premise that node has reserved container a few 
> moments ago, but may getting null by calling 
> *SchedulerNode#getReservedContainer* in the next moment, since the 
> reservedContainer can be updated to null concurrently in scheduling and 
> monitoring(preemption) thread. So that not-null check should be done before 
> calling internal methods of the reservedContainer.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11732) Potential NPE when calling SchedulerNode#reservedContainer for CapacityScheduler

2024-09-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885593#comment-17885593
 ] 

ASF GitHub Bot commented on YARN-11732:
---

zeekling commented on code in PR #7065:
URL: https://github.com/apache/hadoop/pull/7065#discussion_r1779503372


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java:
##
@@ -757,10 +757,9 @@ private void completeOustandingUpdatesWhichAreReserved(
   RMContainer rmContainer, ContainerStatus containerStatus,
   RMContainerEventType event) {
 N schedulerNode = getSchedulerNode(rmContainer.getNodeId());
-if (schedulerNode != null &&
-schedulerNode.getReservedContainer() != null) {
+if (schedulerNode != null) {
   RMContainer resContainer = schedulerNode.getReservedContainer();
-  if (resContainer.getReservedSchedulerKey() != null) {
+  if (resContainer != null && resContainer.getReservedSchedulerKey() != 
null) {

Review Comment:
   It is recommended to add a try catch module. I have made this change in the 
production environment before, but it still reports a null pointer.





> Potential NPE when calling SchedulerNode#reservedContainer for 
> CapacityScheduler
> 
>
> Key: YARN-11732
> URL: https://issues.apache.org/jira/browse/YARN-11732
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.4.0, 3.2.4, 3.3.6, 3.5.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: pull-request-available
>
> I found some places calling *SchedulerNode#getReservedContainer* to get 
> reservedContainer (returned value) but not do sanity(not-null) check before 
> calling internal methods of it, which can have a risk to raise 
> NullPointerException if it's null.
> Most of these places have a premise that node has reserved container a few 
> moments ago, but may getting null by calling 
> *SchedulerNode#getReservedContainer* in the next moment, since the 
> reservedContainer can be updated to null concurrently in scheduling and 
> monitoring(preemption) thread. So that not-null check should be done before 
> calling internal methods of the reservedContainer.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11719) The job is stuck in the new state.

2024-09-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885577#comment-17885577
 ] 

ASF GitHub Bot commented on YARN-11719:
---

hadoop-yetus commented on PR #7077:
URL: https://github.com/apache/hadoop/pull/7077#issuecomment-2380640345

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  16m 57s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  44m 48s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m  3s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   0m 56s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 56s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m  2s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  0s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 52s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   2m  0s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  36m 26s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 50s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 55s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   0m 55s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 48s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   0m 48s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 41s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7077/1/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 1 new + 32 unchanged - 0 fixed = 33 total (was 32)  |
   | +1 :green_heart: |  mvnsite  |   0m 50s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 45s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 42s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   2m  0s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  35m 39s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 109m 46s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 38s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 259m 19s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.47 ServerAPI=1.47 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7077/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7077 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux c52bb8302dea 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 
20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 79eb1bd99287b784ec6b7cc44cf9fa22c1cea2bb |
   | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 
|
   |  Test Result

[jira] [Updated] (YARN-11719) The job is stuck in the new state.

2024-09-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-11719:
--
Labels: pull-request-available  (was: )

> The job is stuck in the new state.
> --
>
> Key: YARN-11719
> URL: https://issues.apache.org/jira/browse/YARN-11719
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.1
>Reporter: zeekling
>Priority: Major
>  Labels: pull-request-available
>
> After I restarted the router in the production environment, several jobs 
> remained in the new state. and i found related log here.
>  
> {code:java}
> 2024-08-30 00:12:41,380 | WARN  | DelegationTokenRenewer #667 | Unable to add 
> the application to the delegation token renewer. | 
> DelegationTokenRenewer.java:1215
> java.io.IOException: Failed to renew token: Kind: HDFS_DELEGATION_TOKEN, 
> Service: ha-hdfs:nsfed, Ident: (token for admintest: HDFS_DELEGATION_TOKEN 
> owner=admintest@9FCE074E_691F_480F_98F5_58C1CA310829.COM, renewer=mapred, 
> realUser=, issueDate=1724947875776, maxDate=1725552675776, 
> sequenceNumber=156, masterKeyId=116)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:641)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$2200(DelegationTokenRenewer.java:86)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:1211)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:1188)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)Caused by: 
> java.io.InterruptedIOException: Retry interrupted
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.processWaitTimeAndRetryInfo(RetryInvocationHandler.java:141)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:112)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:366)
>         at com.sun.proxy.$Proxy96.renewDelegationToken(Unknown Source)        
> at org.apache.hadoop.hdfs.DFSClient$Renewer.renew(DFSClient.java:849)        
> at org.apache.hadoop.security.token.Token.renew(Token.java:498)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:771)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:768)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422) 
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1890)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.renewToken(DelegationTokenRenewer.java:767)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:627)
>  
>         ... 8 more
> Caused by: java.lang.InterruptedException: sleep interrupted
>         at java.lang.Thread.sleep(Native Method)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.processWaitTimeAndRetryInfo(RetryInvocationHandler.java:135)
>         ... 20 more
> 2024-08-30 00:12:41,380 | WARN  | DelegationTokenRenewer #667 | 
> AsyncDispatcher thread interrupted | AsyncDispatcher.java:437
> java.lang.InterruptedException
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1233)
>         at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
>         at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:434)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:1221)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:1188)
>

[jira] [Commented] (YARN-11719) The job is stuck in the new state.

2024-09-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885553#comment-17885553
 ] 

ASF GitHub Bot commented on YARN-11719:
---

zeekling opened a new pull request, #7077:
URL: https://github.com/apache/hadoop/pull/7077

   
   
   ### Description of PR
   
   
   PR for https://issues.apache.org/jira/browse/YARN-11719
   
   ### How was this patch tested?
   
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> The job is stuck in the new state.
> --
>
> Key: YARN-11719
> URL: https://issues.apache.org/jira/browse/YARN-11719
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.1
>Reporter: zeekling
>Priority: Major
>
> After I restarted the router in the production environment, several jobs 
> remained in the new state. and i found related log here.
>  
> {code:java}
> 2024-08-30 00:12:41,380 | WARN  | DelegationTokenRenewer #667 | Unable to add 
> the application to the delegation token renewer. | 
> DelegationTokenRenewer.java:1215
> java.io.IOException: Failed to renew token: Kind: HDFS_DELEGATION_TOKEN, 
> Service: ha-hdfs:nsfed, Ident: (token for admintest: HDFS_DELEGATION_TOKEN 
> owner=admintest@9FCE074E_691F_480F_98F5_58C1CA310829.COM, renewer=mapred, 
> realUser=, issueDate=1724947875776, maxDate=1725552675776, 
> sequenceNumber=156, masterKeyId=116)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:641)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$2200(DelegationTokenRenewer.java:86)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:1211)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:1188)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:750)Caused by: 
> java.io.InterruptedIOException: Retry interrupted
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.processWaitTimeAndRetryInfo(RetryInvocationHandler.java:141)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:112)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:366)
>         at com.sun.proxy.$Proxy96.renewDelegationToken(Unknown Source)        
> at org.apache.hadoop.hdfs.DFSClient$Renewer.renew(DFSClient.java:849)        
> at org.apache.hadoop.security.token.Token.renew(Token.java:498)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:771)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:768)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422) 
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1890)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.renewToken(DelegationTokenRenewer.java:767)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:627)
>  
>         ... 8 more
> Caused by: java.lang.InterruptedException: sleep interrupted
>         at java.lang.Thread.sleep(Native Method)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.processWaitTimeAndRetryInfo(RetryInvocationHandler.java:135)
>         ... 20 more
> 2024-08-30 00:12:41,380 | WARN  | DelegationTokenRenewer #667 | 
> AsyncDispatcher thread interrupted | AsyncDispatcher.java:437
> java.lang.Inte

[jira] [Commented] (YARN-11732) Potential NPE when calling SchedulerNode#reservedContainer for CapacityScheduler

2024-09-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884818#comment-17884818
 ] 

ASF GitHub Bot commented on YARN-11732:
---

TaoYang526 commented on PR #7065:
URL: https://github.com/apache/hadoop/pull/7065#issuecomment-2375553315

   @szilard-nemeth @brumi1024 Could you please help to review this PR?




> Potential NPE when calling SchedulerNode#reservedContainer for 
> CapacityScheduler
> 
>
> Key: YARN-11732
> URL: https://issues.apache.org/jira/browse/YARN-11732
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.4.0, 3.2.4, 3.3.6, 3.5.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: pull-request-available
>
> I found some places calling *SchedulerNode#getReservedContainer* to get 
> reservedContainer (returned value) but not do sanity(not-null) check before 
> calling internal methods of it, which can have a risk to raise 
> NullPointerException if it's null.
> Most of these places have a premise that node has reserved container a few 
> moments ago, but may getting null by calling 
> *SchedulerNode#getReservedContainer* in the next moment, since the 
> reservedContainer can be updated to null concurrently in scheduling and 
> monitoring(preemption) thread. So that not-null check should be done before 
> calling internal methods of the reservedContainer.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers

2024-09-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884495#comment-17884495
 ] 

ASF GitHub Bot commented on YARN-11702:
---

slfan1989 commented on PR #6990:
URL: https://github.com/apache/hadoop/pull/6990#issuecomment-2372879525

   @shameersss1 Thanks for the contribution! @aajisaka @zeekling Thanks for the 
review!




> Fix Yarn over allocating containers
> ---
>
> Key: YARN-11702
> URL: https://issues.apache.org/jira/browse/YARN-11702
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, fairscheduler, scheduler, yarn
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> *Replication Steps:*
> Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler)
>  
> {code:java}
> spark.executor.memory            1024M
> spark.driver.memory              2048M
> spark.executor.cores             1
> spark.executor.instances 20
> spark.dynamicAllocation.enabled false{code}
>  
> Based on the setup, there should be 20 spark executors, but from the 
> ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 
> of them were released in seconds. On analyzing the Spark ApplicationMaster 
> (AM) logs, The following logs were observed.
>  
> {code:java}
> 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) 
> for  ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with 
> custom resources:  vCores:2147483647>
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, 
> launching executors on 4 of them.
> 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, 
> launching executors on 0 of them.
> {code}
> It was clear for the logs that extra allocated 12 containers are being 
> ignored from Spark side. Inorder to debug this further, additional log lines 
> were added to 
> [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427]
>  class in increment and decrement of container request to expose additional 
> information about the request.
>  
> {code:java}
> 2024-06-24 14:10:14,075 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 
> Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, 
> containerToUpdate=null} for: appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,113 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingConta

[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers

2024-09-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884494#comment-17884494
 ] 

ASF GitHub Bot commented on YARN-11702:
---

slfan1989 merged PR #6990:
URL: https://github.com/apache/hadoop/pull/6990




> Fix Yarn over allocating containers
> ---
>
> Key: YARN-11702
> URL: https://issues.apache.org/jira/browse/YARN-11702
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, fairscheduler, scheduler, yarn
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
>
> *Replication Steps:*
> Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler)
>  
> {code:java}
> spark.executor.memory            1024M
> spark.driver.memory              2048M
> spark.executor.cores             1
> spark.executor.instances 20
> spark.dynamicAllocation.enabled false{code}
>  
> Based on the setup, there should be 20 spark executors, but from the 
> ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 
> of them were released in seconds. On analyzing the Spark ApplicationMaster 
> (AM) logs, The following logs were observed.
>  
> {code:java}
> 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) 
> for  ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with 
> custom resources:  vCores:2147483647>
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, 
> launching executors on 4 of them.
> 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, 
> launching executors on 0 of them.
> {code}
> It was clear for the logs that extra allocated 12 containers are being 
> ignored from Spark side. Inorder to debug this further, additional log lines 
> were added to 
> [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427]
>  class in increment and decrement of container request to expose additional 
> information about the request.
>  
> {code:java}
> 2024-06-24 14:10:14,075 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 
> Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, 
> containerToUpdate=null} for: appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,113 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 15 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_000

[jira] [Commented] (YARN-11733) Fix the order of updating CPU controls with cgroup v1

2024-09-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884322#comment-17884322
 ] 

ASF GitHub Bot commented on YARN-11733:
---

brumi1024 merged PR #7069:
URL: https://github.com/apache/hadoop/pull/7069




> Fix the order of updating CPU controls with cgroup v1
> -
>
> Key: YARN-11733
> URL: https://issues.apache.org/jira/browse/YARN-11733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Peter Szucs
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
>
> After YARN-11674 (Update CpuResourceHandler implementation for cgroup v2 
> support) the order of updating cpu.cfs_period_us and cpu.cfs_quota_us 
> controls have changed which can cause the below errors when launching 
> containers with CPU limits on cgroupv1:
> {code:java}
> PrintWriter unable to write to 
> /var/cgroupv1/cpu/hadoop-yarn/container_e02_1727079571170_0040_02_01/cpu.cfs_quota_us
>  with value: 112500{code}
>  
> *Reproduction:*
> I set CPU limits on yarn-site.xml for cgroup:
> {code:java}
> yarn.nodemanager.resource.percentage-physical-cpu-limit: 90
> yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage: 
> true{code}
> After that the limits were applied on the hadoop-yarn root hierarchy:
> {code:java}
> root@pszucs-test-2 hadoop-yarn]# cat cpu.cfs_period_us 100
> root@pszucs-test-2 hadoop-yarn]# cat cpu.cfs_quota_us 90
> {code}
> When I tried to launch a container it gave me the following error:
> {code:java}
> PrintWriter unable to write to 
> /var/cgroupv1/cpu/hadoop-yarn/container_e02_1727079571170_0040_02_01/cpu.cfs_quota_us
>  with value: 112500{code}
> It is because the container tries to exceed the limit defined at higher level 
> with the 112 500 value for cfs_quota_us. If I try to create a test cgroup 
> manually and try to update this control it lets me to do that up to the value 
> of 90 000 as well:
> {code:java}
> [root@pszucs-test-2 hadoop-yarn]# cat test/cpu.cfs_period_us
> 10
> [root@pszucs-test-2 hadoop-yarn]# echo "90001" > test/cpu.cfs_quota_us
> -bash: echo: write error: Invalid argument
> [root@pszucs-test-2 hadoop-yarn]# echo "9" > test/cpu.cfs_quota_us{code}
>  
> *Solution:*
> The cause for this issue is that the cfs_period_us control get the default 
> value of 100 000 when a new cgroup is created, but when YARN calculates the 
> limit, it uses 1 000 000 for that. Because of this we need to update 
> cpu.cfs_period_us before cpu.cfs_quota_us, to keep the ratio between the two 
> values and not to overcome the limit defined at parent level.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11733) Fix the order of updating CPU controls with cgroup v1

2024-09-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884321#comment-17884321
 ] 

ASF GitHub Bot commented on YARN-11733:
---

brumi1024 commented on PR #7069:
URL: https://github.com/apache/hadoop/pull/7069#issuecomment-2371595567

   Thanks @p-szucs for the patch, LGTM. Merging to trunk.




> Fix the order of updating CPU controls with cgroup v1
> -
>
> Key: YARN-11733
> URL: https://issues.apache.org/jira/browse/YARN-11733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Peter Szucs
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
>
> After YARN-11674 (Update CpuResourceHandler implementation for cgroup v2 
> support) the order of updating cpu.cfs_period_us and cpu.cfs_quota_us 
> controls have changed which can cause the below errors when launching 
> containers with CPU limits on cgroupv1:
> {code:java}
> PrintWriter unable to write to 
> /var/cgroupv1/cpu/hadoop-yarn/container_e02_1727079571170_0040_02_01/cpu.cfs_quota_us
>  with value: 112500{code}
>  
> *Reproduction:*
> I set CPU limits on yarn-site.xml for cgroup:
> {code:java}
> yarn.nodemanager.resource.percentage-physical-cpu-limit: 90
> yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage: 
> true{code}
> After that the limits were applied on the hadoop-yarn root hierarchy:
> {code:java}
> root@pszucs-test-2 hadoop-yarn]# cat cpu.cfs_period_us 100
> root@pszucs-test-2 hadoop-yarn]# cat cpu.cfs_quota_us 90
> {code}
> When I tried to launch a container it gave me the following error:
> {code:java}
> PrintWriter unable to write to 
> /var/cgroupv1/cpu/hadoop-yarn/container_e02_1727079571170_0040_02_01/cpu.cfs_quota_us
>  with value: 112500{code}
> It is because the container tries to exceed the limit defined at higher level 
> with the 112 500 value for cfs_quota_us. If I try to create a test cgroup 
> manually and try to update this control it lets me to do that up to the value 
> of 90 000 as well:
> {code:java}
> [root@pszucs-test-2 hadoop-yarn]# cat test/cpu.cfs_period_us
> 10
> [root@pszucs-test-2 hadoop-yarn]# echo "90001" > test/cpu.cfs_quota_us
> -bash: echo: write error: Invalid argument
> [root@pszucs-test-2 hadoop-yarn]# echo "9" > test/cpu.cfs_quota_us{code}
>  
> *Solution:*
> The cause for this issue is that the cfs_period_us control get the default 
> value of 100 000 when a new cgroup is created, but when YARN calculates the 
> limit, it uses 1 000 000 for that. Because of this we need to update 
> cpu.cfs_period_us before cpu.cfs_quota_us, to keep the ratio between the two 
> values and not to overcome the limit defined at parent level.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11733) Fix the order of updating CPU controls with cgroup v1

2024-09-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884291#comment-17884291
 ] 

ASF GitHub Bot commented on YARN-11733:
---

hadoop-yetus commented on PR #7069:
URL: https://github.com/apache/hadoop/pull/7069#issuecomment-2371415509

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 50s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  50m 12s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 40s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   1m 26s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 39s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 45s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 47s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 38s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 29s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  38m 12s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 25s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   1m 25s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 19s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   1m 19s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 28s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 38s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 32s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 27s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  40m 25s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  24m 29s |  |  hadoop-yarn-server-nodemanager 
in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 38s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 169m 51s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.47 ServerAPI=1.47 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7069/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7069 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 55805aae17e5 5.15.0-119-generic #129-Ubuntu SMP Fri Aug 2 
19:25:20 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 4c87c3db99c44628b07ce123c8fa43be5dc18bde |
   | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 
|
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7069/2/testReport/ |
   | Max. process+thread count | 527 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7069/2/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This

[jira] [Commented] (YARN-11733) Fix the order of updating CPU controls with cgroup v1

2024-09-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884276#comment-17884276
 ] 

ASF GitHub Bot commented on YARN-11733:
---

hadoop-yetus commented on PR #7069:
URL: https://github.com/apache/hadoop/pull/7069#issuecomment-2371250866

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 49s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  51m 55s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 33s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   1m 28s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 38s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 50s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 48s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 39s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 30s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  40m 17s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 34s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 26s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   1m 26s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 18s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   1m 18s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 27s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 36s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 36s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 32s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 31s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  41m 52s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  24m 33s |  |  hadoop-yarn-server-nodemanager 
in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 37s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 175m  4s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.47 ServerAPI=1.47 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7069/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7069 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux effaca3fa0c9 5.15.0-119-generic #129-Ubuntu SMP Fri Aug 2 
19:25:20 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 68828951a523a75632637671dc9a8de6be9d6469 |
   | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 
|
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7069/1/testReport/ |
   | Max. process+thread count | 614 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7069/1/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This

[jira] [Commented] (YARN-11733) Fix the order of updating CPU controls with cgroup v1

2024-09-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884211#comment-17884211
 ] 

ASF GitHub Bot commented on YARN-11733:
---

p-szucs opened a new pull request, #7069:
URL: https://github.com/apache/hadoop/pull/7069

   Change-Id: I09429c878c124be9d6a09e8f027ab89d34606f2f
   
   
   
   ### Description of PR
   With using cgroup v1, cpu.cfs_period_us control gets the default value of 
100 000 when a new cgroup is created, but when YARN calculates the limit, it 
uses 1 000 000 for that. Because of this we need to update cpu.cfs_period_us 
before cpu.cfs_quota_us, to keep the ratio between the two values and not to 
exceed the limit defined at parent level.
   
   ### How was this patch tested?
   Unit test
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> Fix the order of updating CPU controls with cgroup v1
> -
>
> Key: YARN-11733
> URL: https://issues.apache.org/jira/browse/YARN-11733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Peter Szucs
>Assignee: Peter Szucs
>Priority: Major
>
> After YARN-11674 (Update CpuResourceHandler implementation for cgroup v2 
> support) the order of updating cpu.cfs_period_us and cpu.cfs_quota_us 
> controls have changed which can cause the below errors when launching 
> containers with CPU limits on cgroupv1:
> {code:java}
> PrintWriter unable to write to 
> /var/cgroupv1/cpu/hadoop-yarn/container_e02_1727079571170_0040_02_01/cpu.cfs_quota_us
>  with value: 112500{code}
>  
> *Reproduction:*
> I set CPU limits on yarn-site.xml for cgroup:
> {code:java}
> yarn.nodemanager.resource.percentage-physical-cpu-limit: 90
> yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage: 
> true{code}
> After that the limits were applied on the hadoop-yarn root hierarchy:
> {code:java}
> root@pszucs-test-2 hadoop-yarn]# cat cpu.cfs_period_us 100
> root@pszucs-test-2 hadoop-yarn]# cat cpu.cfs_quota_us 90
> {code}
> When I tried to launch a container it gave me the following error:
> {code:java}
> PrintWriter unable to write to 
> /var/cgroupv1/cpu/hadoop-yarn/container_e02_1727079571170_0040_02_01/cpu.cfs_quota_us
>  with value: 112500{code}
> It is because the container tries to exceed the limit defined at higher level 
> with the 112 500 value for cfs_quota_us. If I try to create a test cgroup 
> manually and try to update this control it lets me to do that up to the value 
> of 90 000 as well:
> {code:java}
> [root@pszucs-test-2 hadoop-yarn]# cat test/cpu.cfs_period_us
> 10
> [root@pszucs-test-2 hadoop-yarn]# echo "90001" > test/cpu.cfs_quota_us
> -bash: echo: write error: Invalid argument
> [root@pszucs-test-2 hadoop-yarn]# echo "9" > test/cpu.cfs_quota_us{code}
>  
> *Solution:*
> The cause for this issue is that the cfs_period_us control get the default 
> value of 100 000 when a new cgroup is created, but when YARN calculates the 
> limit, it uses 1 000 000 for that. Because of this we need to update 
> cpu.cfs_period_us before cpu.cfs_quota_us, to keep the ratio between the two 
> values and not to overcome the limit defined at parent level.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11733) Fix the order of updating CPU controls with cgroup v1

2024-09-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-11733:
--
Labels: pull-request-available  (was: )

> Fix the order of updating CPU controls with cgroup v1
> -
>
> Key: YARN-11733
> URL: https://issues.apache.org/jira/browse/YARN-11733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Peter Szucs
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
>
> After YARN-11674 (Update CpuResourceHandler implementation for cgroup v2 
> support) the order of updating cpu.cfs_period_us and cpu.cfs_quota_us 
> controls have changed which can cause the below errors when launching 
> containers with CPU limits on cgroupv1:
> {code:java}
> PrintWriter unable to write to 
> /var/cgroupv1/cpu/hadoop-yarn/container_e02_1727079571170_0040_02_01/cpu.cfs_quota_us
>  with value: 112500{code}
>  
> *Reproduction:*
> I set CPU limits on yarn-site.xml for cgroup:
> {code:java}
> yarn.nodemanager.resource.percentage-physical-cpu-limit: 90
> yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage: 
> true{code}
> After that the limits were applied on the hadoop-yarn root hierarchy:
> {code:java}
> root@pszucs-test-2 hadoop-yarn]# cat cpu.cfs_period_us 100
> root@pszucs-test-2 hadoop-yarn]# cat cpu.cfs_quota_us 90
> {code}
> When I tried to launch a container it gave me the following error:
> {code:java}
> PrintWriter unable to write to 
> /var/cgroupv1/cpu/hadoop-yarn/container_e02_1727079571170_0040_02_01/cpu.cfs_quota_us
>  with value: 112500{code}
> It is because the container tries to exceed the limit defined at higher level 
> with the 112 500 value for cfs_quota_us. If I try to create a test cgroup 
> manually and try to update this control it lets me to do that up to the value 
> of 90 000 as well:
> {code:java}
> [root@pszucs-test-2 hadoop-yarn]# cat test/cpu.cfs_period_us
> 10
> [root@pszucs-test-2 hadoop-yarn]# echo "90001" > test/cpu.cfs_quota_us
> -bash: echo: write error: Invalid argument
> [root@pszucs-test-2 hadoop-yarn]# echo "9" > test/cpu.cfs_quota_us{code}
>  
> *Solution:*
> The cause for this issue is that the cfs_period_us control get the default 
> value of 100 000 when a new cgroup is created, but when YARN calculates the 
> limit, it uses 1 000 000 for that. Because of this we need to update 
> cpu.cfs_period_us before cpu.cfs_quota_us, to keep the ratio between the two 
> values and not to overcome the limit defined at parent level.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11732) Potential NPE when calling SchedulerNode#reservedContainer for CapacityScheduler

2024-09-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884186#comment-17884186
 ] 

ASF GitHub Bot commented on YARN-11732:
---

hadoop-yetus commented on PR #7065:
URL: https://github.com/apache/hadoop/pull/7065#issuecomment-2370643050

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  18m 16s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  49m 25s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m  4s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   0m 56s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 56s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m  0s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  0s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 48s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   2m  0s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  40m 33s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 50s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 55s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   0m 55s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 46s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   0m 46s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 45s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 50s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 46s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 42s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   2m  0s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  41m  2s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 108m 50s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 37s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 273m 38s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.47 ServerAPI=1.47 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7065/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7065 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux f342e868de9d 5.15.0-119-generic #129-Ubuntu SMP Fri Aug 2 
19:25:20 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 38b63adff112620f11ebbb0089d9d181cbd7b5fe |
   | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 
|
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7065/1/testReport/ |
   | Max. process+thread count | 937 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7065/1/console

[jira] [Updated] (YARN-11732) Potential NPE when calling SchedulerNode#reservedContainer for CapacityScheduler

2024-09-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-11732:
--
Labels: pull-request-available  (was: )

> Potential NPE when calling SchedulerNode#reservedContainer for 
> CapacityScheduler
> 
>
> Key: YARN-11732
> URL: https://issues.apache.org/jira/browse/YARN-11732
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.4.0, 3.2.4, 3.3.6, 3.5.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>  Labels: pull-request-available
>
> I found some places calling *SchedulerNode#getReservedContainer* to get 
> reservedContainer (returned value) but not do sanity(not-null) check before 
> calling internal methods of it, which can have a risk to raise 
> NullPointerException if it's null.
> Most of these places have a premise that node has reserved container a few 
> moments ago, but may getting null by calling 
> *SchedulerNode#getReservedContainer* in the next moment, since the 
> reservedContainer can be updated to null concurrently in scheduling and 
> monitoring(preemption) thread. So that not-null check should be done before 
> calling internal methods of the reservedContainer.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11732) Potential NPE when calling SchedulerNode#reservedContainer for CapacityScheduler

2024-09-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884102#comment-17884102
 ] 

ASF GitHub Bot commented on YARN-11732:
---

TaoYang526 opened a new pull request, #7065:
URL: https://github.com/apache/hadoop/pull/7065

   
   
   
   ### Description of PR
   Details please refer to YARN-11732.
   Add sanity check before calling internal methods of reservedContainer.
   
   ### How was this patch tested?
   Not necessary.
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> Potential NPE when calling SchedulerNode#reservedContainer for 
> CapacityScheduler
> 
>
> Key: YARN-11732
> URL: https://issues.apache.org/jira/browse/YARN-11732
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.4.0, 3.2.4, 3.3.6, 3.5.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>
> I found some places calling *SchedulerNode#getReservedContainer* to get 
> reservedContainer (returned value) but not do sanity(not-null) check before 
> calling internal methods of it, which can have a risk to raise 
> NullPointerException if it's null.
> Most of these places have a premise that node has reserved container a few 
> moments ago, but may getting null by calling 
> *SchedulerNode#getReservedContainer* in the next moment, since the 
> reservedContainer can be updated to null concurrently in scheduling and 
> monitoring(preemption) thread. So that not-null check should be done before 
> calling internal methods of the reservedContainer.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers

2024-09-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883860#comment-17883860
 ] 

ASF GitHub Bot commented on YARN-11702:
---

slfan1989 commented on PR #6990:
URL: https://github.com/apache/hadoop/pull/6990#issuecomment-2368101194

   @shameersss1 Thanks for the contribution! If there are no other comments in 
the next 2 days, we will merge this PR to the trunk branch.




> Fix Yarn over allocating containers
> ---
>
> Key: YARN-11702
> URL: https://issues.apache.org/jira/browse/YARN-11702
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, fairscheduler, scheduler, yarn
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
>
> *Replication Steps:*
> Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler)
>  
> {code:java}
> spark.executor.memory            1024M
> spark.driver.memory              2048M
> spark.executor.cores             1
> spark.executor.instances 20
> spark.dynamicAllocation.enabled false{code}
>  
> Based on the setup, there should be 20 spark executors, but from the 
> ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 
> of them were released in seconds. On analyzing the Spark ApplicationMaster 
> (AM) logs, The following logs were observed.
>  
> {code:java}
> 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) 
> for  ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with 
> custom resources:  vCores:2147483647>
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, 
> launching executors on 4 of them.
> 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, 
> launching executors on 0 of them.
> {code}
> It was clear for the logs that extra allocated 12 containers are being 
> ignored from Spark side. Inorder to debug this further, additional log lines 
> were added to 
> [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427]
>  class in increment and decrement of container request to expose additional 
> information about the request.
>  
> {code:java}
> 2024-06-24 14:10:14,075 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 
> Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, 
> containerToUpdate=null} for: appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,113 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Upd

[jira] [Commented] (YARN-11560) Fix NPE bug when multi-node enabled with schedule asynchronously

2024-09-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883741#comment-17883741
 ] 

ASF GitHub Bot commented on YARN-11560:
---

ayushtkn merged PR #6021:
URL: https://github.com/apache/hadoop/pull/6021




> Fix NPE bug when multi-node enabled with schedule asynchronously
> 
>
> Key: YARN-11560
> URL: https://issues.apache.org/jira/browse/YARN-11560
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.3
>Reporter: wangzhongwei
>Assignee: wangzhongwei
>Priority: Blocker
>  Labels: pull-request-available
>
> when multiNodePlacementEnabled，using global scheduler，NPE may happend when 
> commit thread calling  allocateFromReservedContainer with param 
> reservedContainer ,while the container may be unreserved by the judgment 
> thread in tryCommit->apply  function 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11560) Fix NPE bug when multi-node enabled with schedule asynchronously

2024-09-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883683#comment-17883683
 ] 

ASF GitHub Bot commented on YARN-11560:
---

hadoop-yetus commented on PR #6021:
URL: https://github.com/apache/hadoop/pull/6021#issuecomment-2367248090

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   7m 34s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  36m 35s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 34s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   0m 32s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 32s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 35s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 40s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 33s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 17s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  20m 43s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 27s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 30s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   0m 30s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 27s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   0m 27s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 24s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 31s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 27s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 27s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 10s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m 20s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  89m 52s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 26s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 184m 24s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.47 ServerAPI=1.47 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6021/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6021 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 10d0b101643f 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 
10:14:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 056f22135781508f8b465660591046919b3b1cfb |
   | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 
|
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6021/4/testReport/ |
   | Max. process+thread count | 951 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6021/4/console

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883660#comment-17883660
 ] 

ASF GitHub Bot commented on YARN-11730:
---

slfan1989 merged PR #7049:
URL: https://github.com/apache/hadoop/pull/7049




> Resourcemanager node reporting enhancement for unregistered hosts
> -
>
> Key: YARN-11730
> URL: https://issues.apache.org/jira/browse/YARN-11730
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, yarn
>Affects Versions: 3.4.0
> Environment: Tested on multiple environments:
> A. Docker Environment{*}:{*}
>  * Base OS: *Ubuntu 20.04*
>  * *Java 8* installed from OpenJDK.
>  * Docker image includes Hadoop binaries, user configurations, and ports for 
> YARN services.
>  * Verified behavior using a Hadoop snapshot in a containerized environment.
>  * Performed Namenode formatting and validated service interactions through 
> exposed ports.
>  * Repo reference: 
> [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main]
> B. Bare-metal Distributed Setup (RedHat Linux){*}:{*}
>  * Running *Java 8* in a High-Availability (HA) configuration with 
> *Zookeeper* for locking mechanism.
>  * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM 
> node, including state retention and proper node state transitions.
>  * Verified node state transitions during RM failover, ensuring nodes moved 
> between LOST, ACTIVE, and other states as expected.
>Reporter: Arjun Mohnot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> h3. Issue Overview
> When the ResourceManager (RM) starts, nodes listed in the _"include"_ file 
> are not immediately reported until their corresponding NodeManagers (NMs) 
> send their first heartbeat. However, nodes in the _"exclude"_ file are 
> instantly reflected in the _"Decommissioned Hosts"_ section with a port value 
> -1.
> This design creates several challenges:
>  * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM 
> standalone restart, some nodes may not report back, even though they are 
> listed in the _"include"_ file. These nodes neither appear in the _LOST_ 
> state nor are they represented in the RM's JMX metrics. This results in an 
> untracked state, making it difficult to monitor their status. While in HDFS 
> similar behaviour exists and is marked as {_}"DEAD"{_}.
>  * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until 
> they send their first heartbeat. This delay impacts real-time cluster 
> monitoring, leading to a lack of immediate visibility for these nodes in 
> Resourcemanager's state on the total no. of nodes.
>  * {*}Operational Impact{*}: These unreported nodes cause operational 
> difficulties, particularly in automated workflows such as OS Upgrade 
> Automation (OSUA), node recovery automation, and others where validation 
> depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or 
> {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky 
> workarounds to determine their accurate status.
> h3. Proposed Solution
> To address these issues, we propose automatically assigning the _LOST_ state 
> to any node listed in the _"include"_ file that are not registered and not 
> part of the exclude file by default at the RM startup or HA failover. This 
> can be done by marking the node with a special port value {_}-2{_}, signaling 
> that the node is considered LOST but has not yet been reported. Whenever a 
> heartbeat is received for that {color:#de350b}nodeID{color}, it will be 
> transitioned from _LOST_ to {_}RUNNING{_}, {_}UNHEALTHY{_}, or any other 
> required desired state.
> h3. Key implementation points
>  * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of 
> the RM active node context should be automatically marked as {_}LOST{_}. This 
> can be achieved by modifying the _NodesListManager_ under the 
> {color:#de350b}refreshHostsReader{color} method, invoked during failover, or 
> manual node refresh operations. This logic should ensure that all 
> unregistered nodes are moved to the _LOST_ state, with port _-2_ indicating 
> the node is untracked.
>  * For non-HA setups, this process can be triggered during RM service startup 
> to mark nodes as _LOST_ initially, and they will gradually transition to 
> their desired state when the heartbeat is received.
>  * Handle Node Heartbeat and Transition: When a node sends its first 
> heartbeat, the system should verify if the node is listed in 
> {color:#de350b}getInactiveRMNodes(){color}. If the node exists in the _LOST_ 
> state, the RM should remove it from the inactive list, decrement the _LOST_

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883663#comment-17883663
 ] 

ASF GitHub Bot commented on YARN-11730:
---

slfan1989 commented on PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2367142932

   @arjunmohnot Thanks for the contribution! @zeekling Thanks for the review! 




> Resourcemanager node reporting enhancement for unregistered hosts
> -
>
> Key: YARN-11730
> URL: https://issues.apache.org/jira/browse/YARN-11730
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, yarn
>Affects Versions: 3.4.0
> Environment: Tested on multiple environments:
> A. Docker Environment{*}:{*}
>  * Base OS: *Ubuntu 20.04*
>  * *Java 8* installed from OpenJDK.
>  * Docker image includes Hadoop binaries, user configurations, and ports for 
> YARN services.
>  * Verified behavior using a Hadoop snapshot in a containerized environment.
>  * Performed Namenode formatting and validated service interactions through 
> exposed ports.
>  * Repo reference: 
> [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main]
> B. Bare-metal Distributed Setup (RedHat Linux){*}:{*}
>  * Running *Java 8* in a High-Availability (HA) configuration with 
> *Zookeeper* for locking mechanism.
>  * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM 
> node, including state retention and proper node state transitions.
>  * Verified node state transitions during RM failover, ensuring nodes moved 
> between LOST, ACTIVE, and other states as expected.
>Reporter: Arjun Mohnot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> h3. Issue Overview
> When the ResourceManager (RM) starts, nodes listed in the _"include"_ file 
> are not immediately reported until their corresponding NodeManagers (NMs) 
> send their first heartbeat. However, nodes in the _"exclude"_ file are 
> instantly reflected in the _"Decommissioned Hosts"_ section with a port value 
> -1.
> This design creates several challenges:
>  * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM 
> standalone restart, some nodes may not report back, even though they are 
> listed in the _"include"_ file. These nodes neither appear in the _LOST_ 
> state nor are they represented in the RM's JMX metrics. This results in an 
> untracked state, making it difficult to monitor their status. While in HDFS 
> similar behaviour exists and is marked as {_}"DEAD"{_}.
>  * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until 
> they send their first heartbeat. This delay impacts real-time cluster 
> monitoring, leading to a lack of immediate visibility for these nodes in 
> Resourcemanager's state on the total no. of nodes.
>  * {*}Operational Impact{*}: These unreported nodes cause operational 
> difficulties, particularly in automated workflows such as OS Upgrade 
> Automation (OSUA), node recovery automation, and others where validation 
> depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or 
> {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky 
> workarounds to determine their accurate status.
> h3. Proposed Solution
> To address these issues, we propose automatically assigning the _LOST_ state 
> to any node listed in the _"include"_ file that are not registered and not 
> part of the exclude file by default at the RM startup or HA failover. This 
> can be done by marking the node with a special port value {_}-2{_}, signaling 
> that the node is considered LOST but has not yet been reported. Whenever a 
> heartbeat is received for that {color:#de350b}nodeID{color}, it will be 
> transitioned from _LOST_ to {_}RUNNING{_}, {_}UNHEALTHY{_}, or any other 
> required desired state.
> h3. Key implementation points
>  * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of 
> the RM active node context should be automatically marked as {_}LOST{_}. This 
> can be achieved by modifying the _NodesListManager_ under the 
> {color:#de350b}refreshHostsReader{color} method, invoked during failover, or 
> manual node refresh operations. This logic should ensure that all 
> unregistered nodes are moved to the _LOST_ state, with port _-2_ indicating 
> the node is untracked.
>  * For non-HA setups, this process can be triggered during RM service startup 
> to mark nodes as _LOST_ initially, and they will gradually transition to 
> their desired state when the heartbeat is received.
>  * Handle Node Heartbeat and Transition: When a node sends its first 
> heartbeat, the system should verify if the node is listed in 
> {color:#de350b}getInactiveRMNodes(){color}. If

[jira] [Commented] (YARN-11560) Fix NPE bug when multi-node enabled with schedule asynchronously

2024-09-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883648#comment-17883648
 ] 

ASF GitHub Bot commented on YARN-11560:
---

granewang commented on code in PR #6021:
URL: https://github.com/apache/hadoop/pull/6021#discussion_r1770714881


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java:
##
@@ -1737,6 +1737,10 @@ private CSAssignment allocateContainerOnSingleNode(
 
   private void allocateFromReservedContainer(FiCaSchedulerNode node,
   boolean withNodeHeartbeat, RMContainer reservedContainer) {
+if(reservedContainer == null){
+  LOG.warn("reservedContainer is null,that may be unreserved by the 
proposal judgment thread");

Review Comment:
   Thanks for your review and pr updated.





> Fix NPE bug when multi-node enabled with schedule asynchronously
> 
>
> Key: YARN-11560
> URL: https://issues.apache.org/jira/browse/YARN-11560
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.3
>Reporter: wangzhongwei
>Assignee: wangzhongwei
>Priority: Blocker
>  Labels: pull-request-available
>
> when multiNodePlacementEnabled，using global scheduler，NPE may happend when 
> commit thread calling  allocateFromReservedContainer with param 
> reservedContainer ,while the container may be unreserved by the judgment 
> thread in tryCommit->apply  function 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers

2024-09-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883616#comment-17883616
 ] 

ASF GitHub Bot commented on YARN-11702:
---

hadoop-yetus commented on PR #6990:
URL: https://github.com/apache/hadoop/pull/6990#issuecomment-2366846061

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  17m 56s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  xmllint  |   0m  0s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 43s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  33m 26s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   7m 37s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   7m 10s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   2m  0s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m 17s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   3m  9s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   3m  0s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   6m 19s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  37m  2s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 33s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m  4s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   6m 56s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   6m 56s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   6m 59s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   6m 59s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m 51s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   2m 58s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   2m 53s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   2m 46s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   6m 36s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  37m  2s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 10s |  |  hadoop-yarn-api in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   5m 51s |  |  hadoop-yarn-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  | 110m 15s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m  0s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 326m 14s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.47 ServerAPI=1.47 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6990/5/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6990 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint |
   | uname | Linux 82d88f39fd1a 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 
20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / a1f82433186262429d84139f9e61c7653383b5e8 |
   | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 
|
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6990/5/testReport/ |
   | Max. process+thread count | 956 (vs. ulimit of 5500) |

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883615#comment-17883615
 ] 

ASF GitHub Bot commented on YARN-11730:
---

arjunmohnot commented on PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2366843257

   > > Hi @slfan1989, @zeekling, If there are no further concerns, could I 
kindly request an approval so we can merge this change? Thank you for the 
review.
   > 
   > @arjunmohnot Thank you for your contributions! If there are no further 
comments by the end of this week, we will merge into the trunk branch.
   
   Thank you @slfan1989 for your thoughtful feedback and the time spent 
reviewing these changes. Your support is truly appreciated! If everything looks 
good and there are no further comments, merging this PR at your convenience 
would be a great help! 🚀




> Resourcemanager node reporting enhancement for unregistered hosts
> -
>
> Key: YARN-11730
> URL: https://issues.apache.org/jira/browse/YARN-11730
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, yarn
>Affects Versions: 3.4.0
> Environment: Tested on multiple environments:
> A. Docker Environment{*}:{*}
>  * Base OS: *Ubuntu 20.04*
>  * *Java 8* installed from OpenJDK.
>  * Docker image includes Hadoop binaries, user configurations, and ports for 
> YARN services.
>  * Verified behavior using a Hadoop snapshot in a containerized environment.
>  * Performed Namenode formatting and validated service interactions through 
> exposed ports.
>  * Repo reference: 
> [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main]
> B. Bare-metal Distributed Setup (RedHat Linux){*}:{*}
>  * Running *Java 8* in a High-Availability (HA) configuration with 
> *Zookeeper* for locking mechanism.
>  * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM 
> node, including state retention and proper node state transitions.
>  * Verified node state transitions during RM failover, ensuring nodes moved 
> between LOST, ACTIVE, and other states as expected.
>Reporter: Arjun Mohnot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> h3. Issue Overview
> When the ResourceManager (RM) starts, nodes listed in the _"include"_ file 
> are not immediately reported until their corresponding NodeManagers (NMs) 
> send their first heartbeat. However, nodes in the _"exclude"_ file are 
> instantly reflected in the _"Decommissioned Hosts"_ section with a port value 
> -1.
> This design creates several challenges:
>  * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM 
> standalone restart, some nodes may not report back, even though they are 
> listed in the _"include"_ file. These nodes neither appear in the _LOST_ 
> state nor are they represented in the RM's JMX metrics. This results in an 
> untracked state, making it difficult to monitor their status. While in HDFS 
> similar behaviour exists and is marked as {_}"DEAD"{_}.
>  * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until 
> they send their first heartbeat. This delay impacts real-time cluster 
> monitoring, leading to a lack of immediate visibility for these nodes in 
> Resourcemanager's state on the total no. of nodes.
>  * {*}Operational Impact{*}: These unreported nodes cause operational 
> difficulties, particularly in automated workflows such as OS Upgrade 
> Automation (OSUA), node recovery automation, and others where validation 
> depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or 
> {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky 
> workarounds to determine their accurate status.
> h3. Proposed Solution
> To address these issues, we propose automatically assigning the _LOST_ state 
> to any node listed in the _"include"_ file that are not registered and not 
> part of the exclude file by default at the RM startup or HA failover. This 
> can be done by marking the node with a special port value {_}-2{_}, signaling 
> that the node is considered LOST but has not yet been reported. Whenever a 
> heartbeat is received for that {color:#de350b}nodeID{color}, it will be 
> transitioned from _LOST_ to {_}RUNNING{_}, {_}UNHEALTHY{_}, or any other 
> required desired state.
> h3. Key implementation points
>  * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of 
> the RM active node context should be automatically marked as {_}LOST{_}. This 
> can be achieved by modifying the _NodesListManager_ under the 
> {color:#de350b}refreshHostsReader{color} method, invoked during failover, or 
> manual node refresh operations. This logic should ensure that all 
> un

[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers

2024-09-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883586#comment-17883586
 ] 

ASF GitHub Bot commented on YARN-11702:
---

shameersss1 commented on PR #6990:
URL: https://github.com/apache/hadoop/pull/6990#issuecomment-2366614555

   @aajisaka @slfan1989  - I have addressed the latest comments - Please review
   Thanks




> Fix Yarn over allocating containers
> ---
>
> Key: YARN-11702
> URL: https://issues.apache.org/jira/browse/YARN-11702
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, fairscheduler, scheduler, yarn
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
>
> *Replication Steps:*
> Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler)
>  
> {code:java}
> spark.executor.memory            1024M
> spark.driver.memory              2048M
> spark.executor.cores             1
> spark.executor.instances 20
> spark.dynamicAllocation.enabled false{code}
>  
> Based on the setup, there should be 20 spark executors, but from the 
> ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 
> of them were released in seconds. On analyzing the Spark ApplicationMaster 
> (AM) logs, The following logs were observed.
>  
> {code:java}
> 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) 
> for  ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with 
> custom resources:  vCores:2147483647>
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, 
> launching executors on 4 of them.
> 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, 
> launching executors on 0 of them.
> {code}
> It was clear for the logs that extra allocated 12 containers are being 
> ignored from Spark side. Inorder to debug this further, additional log lines 
> were added to 
> [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427]
>  class in increment and decrement of container request to expose additional 
> information about the request.
>  
> {code:java}
> 2024-06-24 14:10:14,075 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 
> Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, 
> containerToUpdate=null} for: appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,113 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 15 Decremented by: 1

[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers

2024-09-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883585#comment-17883585
 ] 

ASF GitHub Bot commented on YARN-11702:
---

shameersss1 commented on code in PR #6990:
URL: https://github.com/apache/hadoop/pull/6990#discussion_r1770451559


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java:
##
@@ -1678,4 +1794,78 @@ private List 
getAppsFromQueue(String queueName)
 }
 return apps;
   }
+
+  /**
+   * ContainerObjectType is a container object with the following properties.
+   * Namely allocationId, priority, executionType and resourceType.
+   */
+  protected class ContainerObjectType extends Object {
+private final long allocationId;
+private final Priority priority;
+private final ExecutionType executionType;
+private final Resource resource;
+
+public ContainerObjectType(long allocationId, Priority priority,
+ExecutionType executionType, Resource resource) {
+  this.allocationId = allocationId;
+  this.priority = priority;
+  this.executionType = executionType;
+  this.resource = resource;
+}
+
+public long getAllocationId() {
+  return allocationId;
+}
+
+public Priority getPriority() {
+  return priority;
+}
+
+public ExecutionType getExecutionType() {
+  return executionType;
+}
+
+public Resource getResource() {
+  return resource;
+}
+
+@Override
+public int hashCode() {
+  final int prime = 31;
+  int result = 1;
+  result = (int) (prime * result +  allocationId);
+  result = prime * result + (priority == null ? 0 : priority.hashCode());
+  result = prime * result + (executionType == null ? 0 : 
executionType.hashCode());
+  result = prime * result + (resource == null ? 0 : resource.hashCode());
+  return result;
+}
+
+@Override
+public boolean equals(Object obj) {
+  if (obj == null) {

Review Comment:
   ack



##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java:
##
@@ -1678,4 +1794,78 @@ private List 
getAppsFromQueue(String queueName)
 }
 return apps;
   }
+
+  /**
+   * ContainerObjectType is a container object with the following properties.
+   * Namely allocationId, priority, executionType and resourceType.
+   */
+  protected class ContainerObjectType extends Object {
+private final long allocationId;
+private final Priority priority;
+private final ExecutionType executionType;
+private final Resource resource;
+
+public ContainerObjectType(long allocationId, Priority priority,
+ExecutionType executionType, Resource resource) {
+  this.allocationId = allocationId;
+  this.priority = priority;
+  this.executionType = executionType;
+  this.resource = resource;
+}
+
+public long getAllocationId() {
+  return allocationId;
+}
+
+public Priority getPriority() {
+  return priority;
+}
+
+public ExecutionType getExecutionType() {
+  return executionType;
+}
+
+public Resource getResource() {
+  return resource;
+}
+
+@Override
+public int hashCode() {
+  final int prime = 31;
+  int result = 1;

Review Comment:
   ack





> Fix Yarn over allocating containers
> ---
>
> Key: YARN-11702
> URL: https://issues.apache.org/jira/browse/YARN-11702
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, fairscheduler, scheduler, yarn
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
>
> *Replication Steps:*
> Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler)
>  
> {code:java}
> spark.executor.memory            1024M
> spark.driver.memory              2048M
> spark.executor.cores             1
> spark.executor.instances 20
> spark.dynamicAllocation.enabled false{code}
>  
> Based on the setup, there should be 20 spark executors, but from the 
> ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 
> of them were released in seconds. On analyzing the Spark ApplicationMaster 
> (AM) logs, The following logs were observed.
>  
> {code:java}
> 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) 
> for  ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with 
> custom resources:  vCores:2147483647>
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24

[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers

2024-09-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883583#comment-17883583
 ] 

ASF GitHub Bot commented on YARN-11702:
---

shameersss1 commented on PR #6990:
URL: https://github.com/apache/hadoop/pull/6990#issuecomment-2366453607

   > The scheduler will not discard resource application requests, so why do we 
need to apply multiple times? @shameersss1
   
   Yes, but the way AMRMClient in Hadoop works in a way that AM always sends 
pendingconatiner request as part of heartbeat. This is done because of two 
reasons
   1. AM can dynamically change the request (Eg: Tez have auto parallelism and 
container reuse)
   2. To make sure AM always went what it wants.




> Fix Yarn over allocating containers
> ---
>
> Key: YARN-11702
> URL: https://issues.apache.org/jira/browse/YARN-11702
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, fairscheduler, scheduler, yarn
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
>
> *Replication Steps:*
> Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler)
>  
> {code:java}
> spark.executor.memory            1024M
> spark.driver.memory              2048M
> spark.executor.cores             1
> spark.executor.instances 20
> spark.dynamicAllocation.enabled false{code}
>  
> Based on the setup, there should be 20 spark executors, but from the 
> ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 
> of them were released in seconds. On analyzing the Spark ApplicationMaster 
> (AM) logs, The following logs were observed.
>  
> {code:java}
> 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) 
> for  ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with 
> custom resources:  vCores:2147483647>
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, 
> launching executors on 4 of them.
> 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, 
> launching executors on 0 of them.
> {code}
> It was clear for the logs that extra allocated 12 containers are being 
> ignored from Spark side. Inorder to debug this further, additional log lines 
> were added to 
> [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427]
>  class in increment and decrement of container request to expose additional 
> information about the request.
>  
> {code:java}
> 2024-06-24 14:10:14,075 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 
> Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, 
> containerToUpdate=null} for: appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 16 Decremented by

[jira] [Commented] (YARN-11560) Fix NPE bug when multi-node enabled with schedule asynchronously

2024-09-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883451#comment-17883451
 ] 

ASF GitHub Bot commented on YARN-11560:
---

ayushtkn commented on code in PR #6021:
URL: https://github.com/apache/hadoop/pull/6021#discussion_r1769492431


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java:
##
@@ -1737,6 +1737,10 @@ private CSAssignment allocateContainerOnSingleNode(
 
   private void allocateFromReservedContainer(FiCaSchedulerNode node,
   boolean withNodeHeartbeat, RMContainer reservedContainer) {
+if(reservedContainer == null){
+  LOG.warn("reservedContainer is null,that may be unreserved by the 
proposal judgment thread");

Review Comment:
   nit
   add space after ,





> Fix NPE bug when multi-node enabled with schedule asynchronously
> 
>
> Key: YARN-11560
> URL: https://issues.apache.org/jira/browse/YARN-11560
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.3
>Reporter: wangzhongwei
>Assignee: wangzhongwei
>Priority: Blocker
>
> when multiNodePlacementEnabled，using global scheduler，NPE may happend when 
> commit thread calling  allocateFromReservedContainer with param 
> reservedContainer ,while the container may be unreserved by the judgment 
> thread in tryCommit->apply  function 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11560) Fix NPE bug when multi-node enabled with schedule asynchronously

2024-09-20 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-11560:
--
Labels: pull-request-available  (was: )

> Fix NPE bug when multi-node enabled with schedule asynchronously
> 
>
> Key: YARN-11560
> URL: https://issues.apache.org/jira/browse/YARN-11560
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.3
>Reporter: wangzhongwei
>Assignee: wangzhongwei
>Priority: Blocker
>  Labels: pull-request-available
>
> when multiNodePlacementEnabled，using global scheduler，NPE may happend when 
> commit thread calling  allocateFromReservedContainer with param 
> reservedContainer ,while the container may be unreserved by the judgment 
> thread in tryCommit->apply  function 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883440#comment-17883440
 ] 

ASF GitHub Bot commented on YARN-11730:
---

slfan1989 commented on PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2364800576

   > Hi @slfan1989, @zeekling, If there are no further concerns, could I kindly 
request an approval so we can merge this change? Thank you for the review.
   
   @arjunmohnot  Thank you for your contributions! If there are no further 
comments by the end of this week, we will merge into the trunk branch.




> Resourcemanager node reporting enhancement for unregistered hosts
> -
>
> Key: YARN-11730
> URL: https://issues.apache.org/jira/browse/YARN-11730
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, yarn
>Affects Versions: 3.4.0
> Environment: Tested on multiple environments:
> A. Docker Environment{*}:{*}
>  * Base OS: *Ubuntu 20.04*
>  * *Java 8* installed from OpenJDK.
>  * Docker image includes Hadoop binaries, user configurations, and ports for 
> YARN services.
>  * Verified behavior using a Hadoop snapshot in a containerized environment.
>  * Performed Namenode formatting and validated service interactions through 
> exposed ports.
>  * Repo reference: 
> [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main]
> B. Bare-metal Distributed Setup (RedHat Linux){*}:{*}
>  * Running *Java 8* in a High-Availability (HA) configuration with 
> *Zookeeper* for locking mechanism.
>  * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM 
> node, including state retention and proper node state transitions.
>  * Verified node state transitions during RM failover, ensuring nodes moved 
> between LOST, ACTIVE, and other states as expected.
>Reporter: Arjun Mohnot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> h3. Issue Overview
> When the ResourceManager (RM) starts, nodes listed in the _"include"_ file 
> are not immediately reported until their corresponding NodeManagers (NMs) 
> send their first heartbeat. However, nodes in the _"exclude"_ file are 
> instantly reflected in the _"Decommissioned Hosts"_ section with a port value 
> -1.
> This design creates several challenges:
>  * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM 
> standalone restart, some nodes may not report back, even though they are 
> listed in the _"include"_ file. These nodes neither appear in the _LOST_ 
> state nor are they represented in the RM's JMX metrics. This results in an 
> untracked state, making it difficult to monitor their status. While in HDFS 
> similar behaviour exists and is marked as {_}"DEAD"{_}.
>  * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until 
> they send their first heartbeat. This delay impacts real-time cluster 
> monitoring, leading to a lack of immediate visibility for these nodes in 
> Resourcemanager's state on the total no. of nodes.
>  * {*}Operational Impact{*}: These unreported nodes cause operational 
> difficulties, particularly in automated workflows such as OS Upgrade 
> Automation (OSUA), node recovery automation, and others where validation 
> depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or 
> {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky 
> workarounds to determine their accurate status.
> h3. Proposed Solution
> To address these issues, we propose automatically assigning the _LOST_ state 
> to any node listed in the _"include"_ file that are not registered and not 
> part of the exclude file by default at the RM startup or HA failover. This 
> can be done by marking the node with a special port value {_}-2{_}, signaling 
> that the node is considered LOST but has not yet been reported. Whenever a 
> heartbeat is received for that {color:#de350b}nodeID{color}, it will be 
> transitioned from _LOST_ to {_}RUNNING{_}, {_}UNHEALTHY{_}, or any other 
> required desired state.
> h3. Key implementation points
>  * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of 
> the RM active node context should be automatically marked as {_}LOST{_}. This 
> can be achieved by modifying the _NodesListManager_ under the 
> {color:#de350b}refreshHostsReader{color} method, invoked during failover, or 
> manual node refresh operations. This logic should ensure that all 
> unregistered nodes are moved to the _LOST_ state, with port _-2_ indicating 
> the node is untracked.
>  * For non-HA setups, this process can be triggered during RM service startup 
> to mark nodes as _LOST_ initially, and they will gradually transition to 
> their des

[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app

2024-09-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883353#comment-17883353
 ] 

ASF GitHub Bot commented on YARN-11708:
---

hadoop-yetus commented on PR #7041:
URL: https://github.com/apache/hadoop/pull/7041#issuecomment-2364153400

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 53s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  44m  0s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m  2s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   0m 56s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 57s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m  2s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  0s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 50s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   2m  1s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  35m 50s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 49s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 54s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   0m 54s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 47s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   0m 47s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 43s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 50s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 47s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 43s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   2m  0s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  35m 35s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 111m 22s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 39s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 243m  9s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.47 ServerAPI=1.47 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/7/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7041 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 7961632c7c16 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 
20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 8c46ee7bce648f5798cc45d950df423bb92a5122 |
   | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 
|
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/7/testReport/ |
   | Max. process+thread count | 957 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/7/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |

[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app

2024-09-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883209#comment-17883209
 ] 

ASF GitHub Bot commented on YARN-11708:
---

susheelgupta7 commented on code in PR #7041:
URL: https://github.com/apache/hadoop/pull/7041#discussion_r1768322383


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/YarnScheduler.java:
##
@@ -44,8 +44,10 @@
 import org.apache.hadoop.yarn.api.records.ResourceRequest;
 import org.apache.hadoop.yarn.api.records.SchedulingRequest;
 import org.apache.hadoop.yarn.event.EventHandler;
+import 
org.apache.hadoop.yarn.server.resourcemanager.placement.ApplicationPlacementContext;
 import org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainer;
 import org.apache.hadoop.yarn.exceptions.YarnException;
+import 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueue;

Review Comment:
   Thanks for the review. Yes FairScheduler has its own FSQueue class, which is 
specific to its design. 





> Setting maximum-application-lifetime using AQCv2 templates doesn't  apply on 
> the first submitted app
> 
>
> Key: YARN-11708
> URL: https://issues.apache.org/jira/browse/YARN-11708
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Benjamin Teke
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
>
> Setting the _maximum-application-lifetime_ property using AQC v2 templates 
> (_yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime_)
>  doesn't apply to the first submitted application (through which the queue is 
> created), only to the subsequent ones. It should apply to the first 
> application as well.
> Repro steps:
> Create the queue root.test, enable AQCv2 on it.
> Provide the following template properties:
> yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime=8
> yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.default-application-lifetime=8
> The first submitted application, which triggers the queue creation will have 
> unlimited lifetime:
> {code:java}
> TimeoutType : LIFETIME 
> ExpiryTime : UNLIMITED
> RemainingTime : -1seconds
> Final-State : SUCCEEDED
> {code}
> The subsequent applications will be killed  after the lifetime expires:
> {code:java}
> TimeoutType : LIFETIME
> ExpiryTime : 2024-07-23T15:02:41.386+ 
> RemainingTime : 0seconds
> Final-State : KILLED
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app

2024-09-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883203#comment-17883203
 ] 

ASF GitHub Bot commented on YARN-11708:
---

brumi1024 commented on code in PR #7041:
URL: https://github.com/apache/hadoop/pull/7041#discussion_r1768296527


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/YarnScheduler.java:
##
@@ -44,8 +44,10 @@
 import org.apache.hadoop.yarn.api.records.ResourceRequest;
 import org.apache.hadoop.yarn.api.records.SchedulingRequest;
 import org.apache.hadoop.yarn.event.EventHandler;
+import 
org.apache.hadoop.yarn.server.resourcemanager.placement.ApplicationPlacementContext;
 import org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainer;
 import org.apache.hadoop.yarn.exceptions.YarnException;
+import 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueue;

Review Comment:
   YarnScheduler should not rely on one of its implementations' utility classes.



##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java:
##
@@ -1566,6 +1567,14 @@ public long checkAndGetApplicationLifetime(String 
queueName, long lifetime) {
 return lifetime;
   }
 
+  @Override
+  public CSQueue getOrCreateQueueFromPlacementContext(ApplicationId

Review Comment:
   It shouldn't return a CSQueue, as Fair Scheduler or the Fifo Scheduler do 
not utilize Capacity Scheduler queues.





> Setting maximum-application-lifetime using AQCv2 templates doesn't  apply on 
> the first submitted app
> 
>
> Key: YARN-11708
> URL: https://issues.apache.org/jira/browse/YARN-11708
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Benjamin Teke
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
>
> Setting the _maximum-application-lifetime_ property using AQC v2 templates 
> (_yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime_)
>  doesn't apply to the first submitted application (through which the queue is 
> created), only to the subsequent ones. It should apply to the first 
> application as well.
> Repro steps:
> Create the queue root.test, enable AQCv2 on it.
> Provide the following template properties:
> yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime=8
> yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.default-application-lifetime=8
> The first submitted application, which triggers the queue creation will have 
> unlimited lifetime:
> {code:java}
> TimeoutType : LIFETIME 
> ExpiryTime : UNLIMITED
> RemainingTime : -1seconds
> Final-State : SUCCEEDED
> {code}
> The subsequent applications will be killed  after the lifetime expires:
> {code:java}
> TimeoutType : LIFETIME
> ExpiryTime : 2024-07-23T15:02:41.386+ 
> RemainingTime : 0seconds
> Final-State : KILLED
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app

2024-09-19 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883066#comment-17883066
 ] 

ASF GitHub Bot commented on YARN-11708:
---

hadoop-yetus commented on PR #7041:
URL: https://github.com/apache/hadoop/pull/7041#issuecomment-2361722415

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 54s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  1s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  45m 14s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m  4s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   0m 57s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 58s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m  2s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 58s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 50s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   2m  3s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  35m 49s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 51s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 53s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   0m 53s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 47s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   0m 47s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  1s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 43s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 53s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 46s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 42s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 59s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  35m 31s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 109m 25s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 38s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 242m 40s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.47 ServerAPI=1.47 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/6/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7041 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 767fb092bef3 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 
20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / b0274bdb9ef52b03c74c6b9c232854fb9a395ad9 |
   | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 
|
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/6/testReport/ |
   | Max. process+thread count | 946 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/6/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |

[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app

2024-09-19 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882975#comment-17882975
 ] 

ASF GitHub Bot commented on YARN-11708:
---

hadoop-yetus commented on PR #7041:
URL: https://github.com/apache/hadoop/pull/7041#issuecomment-2360679473

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m  4s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  50m  4s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m  3s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   0m 59s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 59s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m  6s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  2s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 52s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   2m 12s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  40m 41s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 50s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 53s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   0m 53s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 47s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   0m 47s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 45s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/5/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 3 new + 286 unchanged - 0 fixed = 289 total (was 286)  |
   | +1 :green_heart: |  mvnsite  |   0m 53s |  |  the patch passed  |
   | -1 :x: |  javadoc  |   0m 46s | 
[/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/5/artifact/out/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt)
 |  
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
 with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 generated 1 new + 1 
unchanged - 0 fixed = 2 total (was 1)  |
   | -1 :x: |  javadoc  |   0m 42s | 
[/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/5/artifact/out/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt)
 |  
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05
 with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 generated 1 new + 1 
unchanged - 0 fixed = 2 total (was 1)  |
   | +1 :green_heart: |  spotbugs  |   1m 59s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  35m 20s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 109m 43s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 39s |  |  The patch does not 
generate ASF

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-18 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882783#comment-17882783
 ] 

ASF GitHub Bot commented on YARN-11730:
---

arjunmohnot commented on PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2359151132

   Hi @slfan1989, @zeekling, If there are no further concerns, could I kindly 
request an approval so we can merge this change?
   Thank you for the review.




> Resourcemanager node reporting enhancement for unregistered hosts
> -
>
> Key: YARN-11730
> URL: https://issues.apache.org/jira/browse/YARN-11730
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, yarn
>Affects Versions: 3.4.0
> Environment: Tested on multiple environments:
> A. Docker Environment{*}:{*}
>  * Base OS: *Ubuntu 20.04*
>  * *Java 8* installed from OpenJDK.
>  * Docker image includes Hadoop binaries, user configurations, and ports for 
> YARN services.
>  * Verified behavior using a Hadoop snapshot in a containerized environment.
>  * Performed Namenode formatting and validated service interactions through 
> exposed ports.
>  * Repo reference: 
> [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main]
> B. Bare-metal Distributed Setup (RedHat Linux){*}:{*}
>  * Running *Java 8* in a High-Availability (HA) configuration with 
> *Zookeeper* for locking mechanism.
>  * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM 
> node, including state retention and proper node state transitions.
>  * Verified node state transitions during RM failover, ensuring nodes moved 
> between LOST, ACTIVE, and other states as expected.
>Reporter: Arjun Mohnot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> h3. Issue Overview
> When the ResourceManager (RM) starts, nodes listed in the _"include"_ file 
> are not immediately reported until their corresponding NodeManagers (NMs) 
> send their first heartbeat. However, nodes in the _"exclude"_ file are 
> instantly reflected in the _"Decommissioned Hosts"_ section with a port value 
> -1.
> This design creates several challenges:
>  * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM 
> standalone restart, some nodes may not report back, even though they are 
> listed in the _"include"_ file. These nodes neither appear in the _LOST_ 
> state nor are they represented in the RM's JMX metrics. This results in an 
> untracked state, making it difficult to monitor their status. While in HDFS 
> similar behaviour exists and is marked as {_}"DEAD"{_}.
>  * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until 
> they send their first heartbeat. This delay impacts real-time cluster 
> monitoring, leading to a lack of immediate visibility for these nodes in 
> Resourcemanager's state on the total no. of nodes.
>  * {*}Operational Impact{*}: These unreported nodes cause operational 
> difficulties, particularly in automated workflows such as OS Upgrade 
> Automation (OSUA), node recovery automation, and others where validation 
> depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or 
> {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky 
> workarounds to determine their accurate status.
> h3. Proposed Solution
> To address these issues, we propose automatically assigning the _LOST_ state 
> to any node listed in the _"include"_ file that are not registered and not 
> part of the exclude file by default at the RM startup or HA failover. This 
> can be done by marking the node with a special port value {_}-2{_}, signaling 
> that the node is considered LOST but has not yet been reported. Whenever a 
> heartbeat is received for that {color:#de350b}nodeID{color}, it will be 
> transitioned from _LOST_ to {_}RUNNING{_}, {_}UNHEALTHY{_}, or any other 
> required desired state.
> h3. Key implementation points
>  * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of 
> the RM active node context should be automatically marked as {_}LOST{_}. This 
> can be achieved by modifying the _NodesListManager_ under the 
> {color:#de350b}refreshHostsReader{color} method, invoked during failover, or 
> manual node refresh operations. This logic should ensure that all 
> unregistered nodes are moved to the _LOST_ state, with port _-2_ indicating 
> the node is untracked.
>  * For non-HA setups, this process can be triggered during RM service startup 
> to mark nodes as _LOST_ initially, and they will gradually transition to 
> their desired state when the heartbeat is received.
>  * Handle Node Heartbeat and Transition: When a node sends its first 
> heartbeat, the system should ve

[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app

2024-09-18 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882705#comment-17882705
 ] 

ASF GitHub Bot commented on YARN-11708:
---

hadoop-yetus commented on PR #7041:
URL: https://github.com/apache/hadoop/pull/7041#issuecomment-2358394779

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 59s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  44m 24s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m  2s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   0m 56s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 57s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m  3s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 58s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 51s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   2m  2s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  35m 25s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 49s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 55s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   0m 55s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 49s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   0m 49s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 44s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/4/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 2 new + 286 unchanged - 0 fixed = 288 total (was 286)  |
   | +1 :green_heart: |  mvnsite  |   0m 52s |  |  the patch passed  |
   | -1 :x: |  javadoc  |   0m 45s | 
[/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/4/artifact/out/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt)
 |  
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
 with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 generated 1 new + 1 
unchanged - 0 fixed = 2 total (was 1)  |
   | -1 :x: |  javadoc  |   0m 43s | 
[/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/4/artifact/out/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt)
 |  
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05
 with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 generated 1 new + 1 
unchanged - 0 fixed = 2 total (was 1)  |
   | +1 :green_heart: |  spotbugs  |   1m 57s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  35m 44s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 110m  1s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 38s |  |  The patch does not 
generate ASF

[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app

2024-09-18 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882650#comment-17882650
 ] 

ASF GitHub Bot commented on YARN-11708:
---

K0K0V0K commented on code in PR #7041:
URL: https://github.com/apache/hadoop/pull/7041#discussion_r1764751050


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java:
##
@@ -3366,14 +3367,47 @@ public boolean moveReservedContainer(RMContainer 
toBeMovedContainer,
 
   @Override
   public long checkAndGetApplicationLifetime(String queueName,

Review Comment:
   I think instead of add this create new queue logic to the method we could 
make the getOrCreateQueueFromPlacementContext public and call it before we call 
the checkAndGetApplicationLifetime method. If i see well this 
checkAndGetApplicationLifetime used just once in the code



##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java:
##
@@ -3366,14 +3367,47 @@ public boolean moveReservedContainer(RMContainer 
toBeMovedContainer,
 
   @Override
   public long checkAndGetApplicationLifetime(String queueName,
-  long lifetimeRequestedByApp) {
-readLock.lock();
+ long lifetimeRequestedByApp, 
RMAppImpl app) {
+CSQueue queue;
+
+writeLock.lock();
 try {
-  CSQueue queue = getQueue(queueName);
+  queue = getQueue(queueName);
+
+  // This handles the case where the first submitted app in aqc queue does 
not exist,
+  // addressing the issue related to YARN-11708.
+  if (queue == null) {
+queue = getOrCreateQueueFromPlacementContext(app.getApplicationId(), 
app.getUser(),
+  app.getQueue(), app.getApplicationPlacementContext(), false);
+  }
+
+  if (queue == null) {
+String message;
+if (isAmbiguous(queueName)) {
+  message = "Application " + app.getApplicationId()

Review Comment:
   I think here we have some code duplication. We can add the 
   `"Application " + app.getApplicationId() + " submitted by user " + 
app.getUser()`
   part to the line 3385 





> Setting maximum-application-lifetime using AQCv2 templates doesn't  apply on 
> the first submitted app
> 
>
> Key: YARN-11708
> URL: https://issues.apache.org/jira/browse/YARN-11708
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Benjamin Teke
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
>
> Setting the _maximum-application-lifetime_ property using AQC v2 templates 
> (_yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime_)
>  doesn't apply to the first submitted application (through which the queue is 
> created), only to the subsequent ones. It should apply to the first 
> application as well.
> Repro steps:
> Create the queue root.test, enable AQCv2 on it.
> Provide the following template properties:
> yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime=8
> yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.default-application-lifetime=8
> The first submitted application, which triggers the queue creation will have 
> unlimited lifetime:
> {code:java}
> TimeoutType : LIFETIME 
> ExpiryTime : UNLIMITED
> RemainingTime : -1seconds
> Final-State : SUCCEEDED
> {code}
> The subsequent applications will be killed  after the lifetime expires:
> {code:java}
> TimeoutType : LIFETIME
> ExpiryTime : 2024-07-23T15:02:41.386+ 
> RemainingTime : 0seconds
> Final-State : KILLED
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app

2024-09-18 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882639#comment-17882639
 ] 

ASF GitHub Bot commented on YARN-11708:
---

K0K0V0K commented on PR #7041:
URL: https://github.com/apache/hadoop/pull/7041#issuecomment-2357904421

   Thanks for the update @susheelgupta7 !
   
   May i ask you to fill the PR description and "How was this patch tested?" 
parts?
   




> Setting maximum-application-lifetime using AQCv2 templates doesn't  apply on 
> the first submitted app
> 
>
> Key: YARN-11708
> URL: https://issues.apache.org/jira/browse/YARN-11708
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Benjamin Teke
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
>
> Setting the _maximum-application-lifetime_ property using AQC v2 templates 
> (_yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime_)
>  doesn't apply to the first submitted application (through which the queue is 
> created), only to the subsequent ones. It should apply to the first 
> application as well.
> Repro steps:
> Create the queue root.test, enable AQCv2 on it.
> Provide the following template properties:
> yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime=8
> yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.default-application-lifetime=8
> The first submitted application, which triggers the queue creation will have 
> unlimited lifetime:
> {code:java}
> TimeoutType : LIFETIME 
> ExpiryTime : UNLIMITED
> RemainingTime : -1seconds
> Final-State : SUCCEEDED
> {code}
> The subsequent applications will be killed  after the lifetime expires:
> {code:java}
> TimeoutType : LIFETIME
> ExpiryTime : 2024-07-23T15:02:41.386+ 
> RemainingTime : 0seconds
> Final-State : KILLED
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882577#comment-17882577
 ] 

ASF GitHub Bot commented on YARN-11730:
---

hadoop-yetus commented on PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2357548110

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 20s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  xmllint  |   0m  1s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 22s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  22m  3s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   4m 11s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   3m 53s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 58s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 41s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 46s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   1m 37s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   3m 35s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  22m 32s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 22s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m  9s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   3m 52s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   3m 52s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   3m 40s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   3m 40s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 55s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 27s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 28s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   1m 38s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   3m 55s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  23m  2s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   0m 45s |  |  hadoop-yarn-api in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   4m 43s |  |  hadoop-yarn-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  |  89m 39s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 40s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 215m 12s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.47 ServerAPI=1.47 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/6/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7049 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint |
   | uname | Linux 9fe524a117b6 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 
20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / d5174734cec6b8942857cca7dce4f2e87e3d9753 |
   | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 
|
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/6/testReport/ |
   | Max. process+thread count | 918 (vs. ulimit of 5500) |

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882576#comment-17882576
 ] 

ASF GitHub Bot commented on YARN-11730:
---

hadoop-yetus commented on PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2357543733

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 20s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  xmllint  |   0m  0s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 35s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  21m 40s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   3m 53s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   3m 48s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   1m  0s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 47s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 41s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   1m 37s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   3m 31s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  22m 43s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 21s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m  7s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   3m 53s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   3m 53s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   3m 47s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   3m 47s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 59s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 33s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 27s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   1m 23s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   3m 54s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m 55s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   0m 40s |  |  hadoop-yarn-api in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   4m 25s |  |  hadoop-yarn-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  |  90m 18s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 37s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 214m 45s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.47 ServerAPI=1.47 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/5/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7049 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint |
   | uname | Linux 43b873ba1713 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 
20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 3b38551284356efd4c48d7ec2eeb8ff213b05ca2 |
   | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 
|
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/5/testReport/ |
   | Max. process+thread count | 928 (vs. ulimit of 5500) |

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882554#comment-17882554
 ] 

ASF GitHub Bot commented on YARN-11730:
---

slfan1989 commented on PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2357315118

   > Hi @slfan1989, the required changes have been made and CI checks 
passed—could you kindly review again, and possibly merge when you get a chance? 
Thank you for your time and support!
   
   LGTM.




> Resourcemanager node reporting enhancement for unregistered hosts
> -
>
> Key: YARN-11730
> URL: https://issues.apache.org/jira/browse/YARN-11730
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, yarn
>Affects Versions: 3.4.0
> Environment: Tested on multiple environments:
> A. Docker Environment{*}:{*}
>  * Base OS: *Ubuntu 20.04*
>  * *Java 8* installed from OpenJDK.
>  * Docker image includes Hadoop binaries, user configurations, and ports for 
> YARN services.
>  * Verified behavior using a Hadoop snapshot in a containerized environment.
>  * Performed Namenode formatting and validated service interactions through 
> exposed ports.
>  * Repo reference: 
> [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main]
> B. Bare-metal Distributed Setup (RedHat Linux){*}:{*}
>  * Running *Java 8* in a High-Availability (HA) configuration with 
> *Zookeeper* for locking mechanism.
>  * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM 
> node, including state retention and proper node state transitions.
>  * Verified node state transitions during RM failover, ensuring nodes moved 
> between LOST, ACTIVE, and other states as expected.
>Reporter: Arjun Mohnot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> h3. Issue Overview
> When the ResourceManager (RM) starts, nodes listed in the _"include"_ file 
> are not immediately reported until their corresponding NodeManagers (NMs) 
> send their first heartbeat. However, nodes in the _"exclude"_ file are 
> instantly reflected in the _"Decommissioned Hosts"_ section with a port value 
> -1.
> This design creates several challenges:
>  * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM 
> standalone restart, some nodes may not report back, even though they are 
> listed in the _"include"_ file. These nodes neither appear in the _LOST_ 
> state nor are they represented in the RM's JMX metrics. This results in an 
> untracked state, making it difficult to monitor their status. While in HDFS 
> similar behaviour exists and is marked as {_}"DEAD"{_}.
>  * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until 
> they send their first heartbeat. This delay impacts real-time cluster 
> monitoring, leading to a lack of immediate visibility for these nodes in 
> Resourcemanager's state on the total no. of nodes.
>  * {*}Operational Impact{*}: These unreported nodes cause operational 
> difficulties, particularly in automated workflows such as OS Upgrade 
> Automation (OSUA), node recovery automation, and others where validation 
> depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or 
> {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky 
> workarounds to determine their accurate status.
> h3. Proposed Solution
> To address these issues, we propose automatically assigning the _LOST_ state 
> to any node listed in the _"include"_ file that are not registered and not 
> part of the exclude file by default at the RM startup or HA failover. This 
> can be done by marking the node with a special port value {_}-2{_}, signaling 
> that the node is considered LOST but has not yet been reported. Whenever a 
> heartbeat is received for that {color:#de350b}nodeID{color}, it will be 
> transitioned from _LOST_ to {_}RUNNING{_}, {_}UNHEALTHY{_}, or any other 
> required desired state.
> h3. Key implementation points
>  * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of 
> the RM active node context should be automatically marked as {_}LOST{_}. This 
> can be achieved by modifying the _NodesListManager_ under the 
> {color:#de350b}refreshHostsReader{color} method, invoked during failover, or 
> manual node refresh operations. This logic should ensure that all 
> unregistered nodes are moved to the _LOST_ state, with port _-2_ indicating 
> the node is untracked.
>  * For non-HA setups, this process can be triggered during RM service startup 
> to mark nodes as _LOST_ initially, and they will gradually transition to 
> their desired state when the heartbeat is received.
>  * Handle Node Heartbeat and Transition: When a node sends i

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882538#comment-17882538
 ] 

ASF GitHub Bot commented on YARN-11730:
---

hadoop-yetus commented on PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2357185560

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 19s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  xmllint  |   0m  0s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 55s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  20m 16s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   3m 39s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   3m 31s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   1m  3s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 58s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 58s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   1m 55s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   3m 52s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m 37s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 21s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 14s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   3m 26s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   3m 26s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   3m 25s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   3m 25s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 54s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 47s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 42s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   1m 48s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   4m  2s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  21m 17s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   0m 48s |  |  hadoop-yarn-api in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   4m 42s |  |  hadoop-yarn-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  |  90m  2s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 37s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 212m 20s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.47 ServerAPI=1.47 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7049 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint |
   | uname | Linux a65f08ea2011 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 
20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 3dac5d6df37371f533eee1c78e7fbc80593d9716 |
   | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 
|
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/4/testReport/ |
   | Max. process+thread count | 931 (vs. ulimit of 5500) |

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882513#comment-17882513
 ] 

ASF GitHub Bot commented on YARN-11730:
---

arjunmohnot commented on code in PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#discussion_r1763997947


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java:
##
@@ -1608,9 +1608,19 @@ protected void serviceStart() throws Exception {
 // Non HA case, start after RM services are started.
 if (!this.rmContext.isHAEnabled()) {
   transitionToActive();
+
+  // Refresh node state at the service startup to reflect the unregistered
+  // nodemanagers as LOST if the tracking for unregistered nodes flag is 
enabled.
+  // For HA setup, refreshNodes is already being called during the 
transition.
+  Configuration yarnConf = getConfig();
+  if (yarnConf.getBoolean(
+  YarnConfiguration.ENABLE_TRACKING_FOR_UNREGISTERED_NODES,
+  YarnConfiguration.DEFAULT_ENABLE_TRACKING_FOR_UNREGISTERED_NODES)) {
+this.rmContext.getNodesListManager().refreshNodes(yarnConf);

Review Comment:
   Hey @zeekling, thanks for your question! After reviewing potential edge 
cases and comparing the existing implementation, here’s a summary of the 
different scenarios:
   
   ### Unregistered Lost Node Definition
   - A node is marked as _LOST_ when listed in the _"include"_ file but not 
registered in the ResourceManager's node context, and is also not part of the 
_"exclude"_ file during startup or HA failover. The unregistered lost node is 
indicated by a port value of -2.
   
   ### **Case 1**: Node Marked as LOST, Heartbeat Received
   - When a node is marked as **LOST**, a lost event is dispatched, adding the 
node to the **active** and **inactive** node maps of the RMContext.
   - If the node sends a heartbeat afterward, the transition method in 
`RMNodeImpl` checks for the same hostname of that node with **port -2** (LOST 
state).
   - If found, the nodeID is removed from the `rmContext` and re-registered 
with the desired port of the NM.
   - It also decrements the LOST node counter and increments the ACTIVE node 
counter, ensuring a clean state of transitions.
   
   ### **Case 2**: **Rare Scenario** - Race Condition
   - A race condition may occur if the **ResourceTrackerService** starts before 
the RM starts processing the unregistered lost nodes, and the NodeManager (NM) 
sends its heartbeat **quickly** in parallel.
 - **Example**:
   - When fetching nodes from `rmContext`, an NM (say **host1**) may not 
initially be present in the context.
   - Before this operation completes, **host1** may send a heartbeat and 
get registered with a valid port.
   - Meanwhile, the RM could still attempt to mark the same NM (host1) as 
LOST with port -2 as it was not registered while querying the context, 
resulting in two entries for the same host: one as ACTIVE and another as LOST.
   
   ### Details on Case-2
   - I reviewed the code and found that for a node to register, the 
`ResourceTracker` service must start during service startup. In HA mode, nodes 
only register once the RM becomes active.
   - The current implementation for HA calls the `refreshNodes` function before 
the `transitionToActive` method, which rules out the race condition for HA 
setup since all unregistered nodes are dispatched first.
 
   For standalone RM setup, there was a slight oversight. However, during 
testing, I did not encounter or replicate this issue, as the NM heartbeat can 
take time to register while the nodes were marked as LOST beforehand. With the 
recent changes, I am making the `refreshNodes` operation before the service 
starts. This change ensures that unregistered nodes are consistently marked as 
LOST with **port -2** first. Only afterwards NMs can register themselves with a 
proper port during heartbeat reception when the `ResourceTrackerService` starts 
the server (Triggered during RM service start).
   
   
   The updated change should guarantee that the nodes are properly marked as 
LOST **before** any heartbeats are processed. This eliminates the chance of 
falsely reporting nodes as LOST. I’ve validated this behavior through logs, 
which show that the lost event is dispatched first, and then the heartbeats 
from NMs received after service start.
   
   Let me know if this clears things up! 😊
   





> Resourcemanager node reporting enhancement for unregistered hosts
> -
>
> Key: YARN-11730
> URL: https://issues.apache.org/jira/browse/YARN-11730
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, yarn
>Affects Versions: 3

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882512#comment-17882512
 ] 

ASF GitHub Bot commented on YARN-11730:
---

arjunmohnot commented on code in PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#discussion_r1763997947


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java:
##
@@ -1608,9 +1608,19 @@ protected void serviceStart() throws Exception {
 // Non HA case, start after RM services are started.
 if (!this.rmContext.isHAEnabled()) {
   transitionToActive();
+
+  // Refresh node state at the service startup to reflect the unregistered
+  // nodemanagers as LOST if the tracking for unregistered nodes flag is 
enabled.
+  // For HA setup, refreshNodes is already being called during the 
transition.
+  Configuration yarnConf = getConfig();
+  if (yarnConf.getBoolean(
+  YarnConfiguration.ENABLE_TRACKING_FOR_UNREGISTERED_NODES,
+  YarnConfiguration.DEFAULT_ENABLE_TRACKING_FOR_UNREGISTERED_NODES)) {
+this.rmContext.getNodesListManager().refreshNodes(yarnConf);

Review Comment:
   Hey @zeekling, thanks for your question! After reviewing potential edge 
cases and comparing the existing implementation, here’s a summary of the 
different scenarios:
   
   ### Unregistered Lost Node Definition
   - A node is marked as _LOST_ when listed in the _"include"_ file but not 
registered in the ResourceManager's node context, and is also not part of the 
_"exclude"_ file during startup or HA failover. The unregistered lost node is 
indicated by a port value of -2.
   
   ### **Case 1**: Node Marked as LOST, Heartbeat Received
   - When a node is marked as **LOST**, a lost event is dispatched, adding the 
node to the **active** and **inactive** node maps of the RMContext.
   - If the node sends a heartbeat afterward, the transition method in 
`RMNodeImpl` checks for the same hostname of that node with **port -2** (LOST 
state).
   - If found, the nodeID is removed from the `rmContext` and re-registered 
with the desired port of the NM.
   - It also decrements the LOST node counter and increments the ACTIVE node 
counter, ensuring a clean state of transitions.
   
   ### **Case 2**: **Rare Scenario** - Race Condition
   - A race condition may occur if the **ResourceTrackerService** starts before 
the RM starts processing the unregistered lost nodes, and the NodeManager (NM) 
sends its heartbeat **quickly** in parallel.
 - **Example**:
   - When fetching nodes from `rmContext`, an NM (say **host1**) may not 
initially be present in the context.
   - Before this operation completes, **host1** may send a heartbeat and 
get registered with a valid port.
   - Meanwhile, the RM could still attempt to mark the same NM (host1) as 
LOST with port -2 as it was not registered while querying the context, 
resulting in two entries for the same host: one as ACTIVE and another as LOST.
   
   ### Details on Case-2
   - I reviewed the code and found that for a node to register, the 
`ResourceTracker` service must start during service startup. In HA mode, nodes 
only register once the RM becomes active.
   - The current implementation for HA calls the `refreshNodes` function before 
the `transitionToActive` method, which rules out the race condition for HA 
setup since all unregistered nodes are dispatched first.
 
   For standalone RM setup, there was a slight oversight. However, during 
testing, I did not encounter or replicate this issue, as the NM heartbeat can 
take time to register while the nodes were marked as LOST beforehand. With the 
recent changes, I am making the `refreshNodes` operation before the service 
starts. This change ensures that unregistered nodes are consistently marked as 
LOST with **port -2** first. Only afterwards NMs can register themselves with a 
proper port during heartbeat reception when the `ResourceTrackerService` starts 
the server (Triggered during RM service start).
   
   ### Conclusion
   The updated change should guarantee that the nodes are properly marked as 
LOST **before** any heartbeats are processed. This eliminates the chance of 
falsely reporting nodes as LOST. I’ve validated this behavior through logs, 
which show that the lost event is dispatched first, and then the heartbeats 
from NMs received after service start.
   
   Let me know if this clears things up! 😊
   





> Resourcemanager node reporting enhancement for unregistered hosts
> -
>
> Key: YARN-11730
> URL: https://issues.apache.org/jira/browse/YARN-11730
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, yarn
>Affec

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882511#comment-17882511
 ] 

ASF GitHub Bot commented on YARN-11730:
---

arjunmohnot commented on code in PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#discussion_r1763997947


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java:
##
@@ -1608,9 +1608,19 @@ protected void serviceStart() throws Exception {
 // Non HA case, start after RM services are started.
 if (!this.rmContext.isHAEnabled()) {
   transitionToActive();
+
+  // Refresh node state at the service startup to reflect the unregistered
+  // nodemanagers as LOST if the tracking for unregistered nodes flag is 
enabled.
+  // For HA setup, refreshNodes is already being called during the 
transition.
+  Configuration yarnConf = getConfig();
+  if (yarnConf.getBoolean(
+  YarnConfiguration.ENABLE_TRACKING_FOR_UNREGISTERED_NODES,
+  YarnConfiguration.DEFAULT_ENABLE_TRACKING_FOR_UNREGISTERED_NODES)) {
+this.rmContext.getNodesListManager().refreshNodes(yarnConf);

Review Comment:
   Hey @zeekling, thanks for your question! After reviewing potential edge 
cases and comparing the existing implementation, here’s a summary of the 
different scenarios:
   
   ### Unregistered Lost Node Definition
   - A node is marked as _LOST_ when listed in the _"include"_ file but not 
registered in the ResourceManager's node context, and is also not part of the 
_"exclude"_ file during startup or HA failover. The unregistered lost node is 
indicated by a port value of -2.
   
   ### **Case 1**: Node Marked as LOST, Heartbeat Received
   - When a node is marked as **LOST**, a lost event is dispatched, adding the 
node to the **active** and **inactive** node maps of the RMContext.
   - If the node sends a heartbeat afterward, the transition method in 
`RMNodeImpl` checks for the same hostname of that node with **port -2** (LOST 
state).
   - If found, the nodeID is removed from the `rmContext` and re-registered 
with the desired port of the NM.
   - It also decrements the LOST node counter and increments the ACTIVE node 
counter, ensuring a clean state of transitions.
   
   ### **Case 2**: **Rare Scenario** - Race Condition
   - A race condition may occur if the **ResourceTrackerService** starts before 
the RM starts processing the unregistered lost nodes, and the NodeManager (NM) 
sends its heartbeat **quickly** in parallel.
 - **Example**:
   - When fetching nodes from `rmContext`, an NM (say **host1**) may not 
initially be present in the context.
   - Before this operation completes, **host1** may send a heartbeat and 
get registered with a valid port.
   - Meanwhile, the RM could still attempt to mark the same NM (host1) as 
LOST with port -2 as it was not registered while querying the context, 
resulting in two entries for the same host: one as ACTIVE and another as LOST.
   
   ### Details on Case-2
   - I reviewed the code and found that for a node to register, the 
`ResourceTracker` service must start during service startup. In HA mode, nodes 
only register once the RM becomes active.
   - The current implementation for HA calls the `refreshNodes` function before 
the `transitionToActive` method, which rules out the race condition for HA 
setup since all unregistered nodes are dispatched first.
 
   For standalone setups, there was a slight oversight. However, during 
testing, I did not encounter or replicate this issue, as the NM heartbeat can 
take time to register while the nodes were marked as LOST beforehand. With the 
recent changes, I am making the `refreshNodes` operation before the service 
starts. This change ensures that unregistered nodes are consistently marked as 
LOST with **port -2** first. Only afterwards NMs can register themselves with a 
proper port during heartbeat reception when the `ResourceTrackerService` starts 
the server (Triggered during RM service start).
   
   ### Conclusion
   The updated change should guarantee that the nodes are properly marked as 
LOST **before** any heartbeats are processed. This eliminates the chance of 
falsely reporting nodes as LOST. I’ve validated this behavior through logs, 
which show that the lost event is dispatched first, and then the heartbeats 
from NMs received after service start.
   
   Let me know if this clears things up! 😊
   





> Resourcemanager node reporting enhancement for unregistered hosts
> -
>
> Key: YARN-11730
> URL: https://issues.apache.org/jira/browse/YARN-11730
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, yarn
>Affects

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882510#comment-17882510
 ] 

ASF GitHub Bot commented on YARN-11730:
---

arjunmohnot commented on code in PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#discussion_r1763997947


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java:
##
@@ -1608,9 +1608,19 @@ protected void serviceStart() throws Exception {
 // Non HA case, start after RM services are started.
 if (!this.rmContext.isHAEnabled()) {
   transitionToActive();
+
+  // Refresh node state at the service startup to reflect the unregistered
+  // nodemanagers as LOST if the tracking for unregistered nodes flag is 
enabled.
+  // For HA setup, refreshNodes is already being called during the 
transition.
+  Configuration yarnConf = getConfig();
+  if (yarnConf.getBoolean(
+  YarnConfiguration.ENABLE_TRACKING_FOR_UNREGISTERED_NODES,
+  YarnConfiguration.DEFAULT_ENABLE_TRACKING_FOR_UNREGISTERED_NODES)) {
+this.rmContext.getNodesListManager().refreshNodes(yarnConf);

Review Comment:
   Hey @zeekling, thanks for your question! After reviewing potential edge 
cases and comparing the existing implementation, here’s a summary of the 
different scenarios:
   
   ### Unregistered Lost Node Definition
   - A node is marked as _LOST_ when listed in the _"include"_ file but not 
registered in the ResourceManager's active node context, and is also not part 
of the _"exclude"_ file during startup or HA failover. The unregistered lost 
node is indicated by a port value of -2.
   
   ### **Case 1**: Node Marked as LOST, Heartbeat Received
   - When a node is marked as **LOST**, a lost event is dispatched, adding the 
node to the **active** and **inactive** node maps of the RMContext.
   - If the node sends a heartbeat afterward, the transition method in 
`RMNodeImpl` checks for the same hostname of that node with **port -2** (LOST 
state).
   - If found, the nodeID is removed from the `rmContext` and re-registered 
with the desired port of the NM.
   - It also decrements the LOST node counter and increments the ACTIVE node 
counter, ensuring a clean state of transitions.
   
   ### **Case 2**: **Rare Scenario** - Race Condition
   - A race condition may occur if the **ResourceTrackerService** starts before 
the RM starts processing the unregistered lost nodes, and the NodeManager (NM) 
sends its heartbeat **quickly** in parallel.
 - **Example**:
   - When fetching nodes from `rmContext`, an NM (say **host1**) may not 
initially be present in the context.
   - Before this operation completes, **host1** may send a heartbeat and 
get registered with a valid port.
   - Meanwhile, the RM could still attempt to mark the same NM (host1) as 
LOST with port -2 as it was not registered while querying the context, 
resulting in two entries for the same host: one as ACTIVE and another as LOST.
   
   ### Details on Case-2
   - I reviewed the code and found that for a node to register, the 
`ResourceTracker` service must start during service startup. In HA mode, nodes 
only register once the RM becomes active.
   - The current implementation for HA calls the `refreshNodes` function before 
the `transitionToActive` method, which rules out the race condition for HA 
setup since all unregistered nodes are dispatched first.
 
   For standalone setups, there was a slight oversight. However, during 
testing, I did not encounter or replicate this issue, as the NM heartbeat can 
take time to register while the nodes were marked as LOST beforehand. With the 
recent changes, I am making the `refreshNodes` operation before the service 
starts. This change ensures that unregistered nodes are consistently marked as 
LOST with **port -2** first. Only afterwards NMs can register themselves with a 
proper port during heartbeat reception when the `ResourceTrackerService` starts 
the server (Triggered during RM service start).
   
   ### Conclusion
   The updated change should guarantee that the nodes are properly marked as 
LOST **before** any heartbeats are processed. This eliminates the chance of 
falsely reporting nodes as LOST. I’ve validated this behavior through logs, 
which show that the lost event is dispatched first, and then the heartbeats 
from NMs received after service start.
   
   Let me know if this clears things up! 😊
   





> Resourcemanager node reporting enhancement for unregistered hosts
> -
>
> Key: YARN-11730
> URL: https://issues.apache.org/jira/browse/YARN-11730
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, yarn
>

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882422#comment-17882422
 ] 

ASF GitHub Bot commented on YARN-11730:
---

zeekling commented on code in PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#discussion_r1763342594


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java:
##
@@ -1608,9 +1608,19 @@ protected void serviceStart() throws Exception {
 // Non HA case, start after RM services are started.
 if (!this.rmContext.isHAEnabled()) {
   transitionToActive();
+
+  // Refresh node state at the service startup to reflect the unregistered
+  // nodemanagers as LOST if the tracking for unregistered nodes flag is 
enabled.
+  // For HA setup, refreshNodes is already being called during the 
transition.
+  Configuration yarnConf = getConfig();
+  if (yarnConf.getBoolean(
+  YarnConfiguration.ENABLE_TRACKING_FOR_UNREGISTERED_NODES,
+  YarnConfiguration.DEFAULT_ENABLE_TRACKING_FOR_UNREGISTERED_NODES)) {
+this.rmContext.getNodesListManager().refreshNodes(yarnConf);

Review Comment:
   When RM starts, is it possible that NM will falsely report the Lost state?





> Resourcemanager node reporting enhancement for unregistered hosts
> -
>
> Key: YARN-11730
> URL: https://issues.apache.org/jira/browse/YARN-11730
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, yarn
>Affects Versions: 3.4.0
> Environment: Tested on multiple environments:
> A. Docker Environment{*}:{*}
>  * Base OS: *Ubuntu 20.04*
>  * *Java 8* installed from OpenJDK.
>  * Docker image includes Hadoop binaries, user configurations, and ports for 
> YARN services.
>  * Verified behavior using a Hadoop snapshot in a containerized environment.
>  * Performed Namenode formatting and validated service interactions through 
> exposed ports.
>  * Repo reference: 
> [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main]
> B. Bare-metal Distributed Setup (RedHat Linux){*}:{*}
>  * Running *Java 8* in a High-Availability (HA) configuration with 
> *Zookeeper* for locking mechanism.
>  * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM 
> node, including state retention and proper node state transitions.
>  * Verified node state transitions during RM failover, ensuring nodes moved 
> between LOST, ACTIVE, and other states as expected.
>Reporter: Arjun Mohnot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> h3. Issue Overview
> When the ResourceManager (RM) starts, nodes listed in the _"include"_ file 
> are not immediately reported until their corresponding NodeManagers (NMs) 
> send their first heartbeat. However, nodes in the _"exclude"_ file are 
> instantly reflected in the _"Decommissioned Hosts"_ section with a port value 
> -1.
> This design creates several challenges:
>  * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM 
> standalone restart, some nodes may not report back, even though they are 
> listed in the _"include"_ file. These nodes neither appear in the _LOST_ 
> state nor are they represented in the RM's JMX metrics. This results in an 
> untracked state, making it difficult to monitor their status. While in HDFS 
> similar behaviour exists and is marked as {_}"DEAD"{_}.
>  * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until 
> they send their first heartbeat. This delay impacts real-time cluster 
> monitoring, leading to a lack of immediate visibility for these nodes in 
> Resourcemanager's state on the total no. of nodes.
>  * {*}Operational Impact{*}: These unreported nodes cause operational 
> difficulties, particularly in automated workflows such as OS Upgrade 
> Automation (OSUA), node recovery automation, and others where validation 
> depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or 
> {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky 
> workarounds to determine their accurate status.
> h3. Proposed Solution
> To address these issues, we propose automatically assigning the _LOST_ state 
> to any node listed in the _"include"_ file that are not registered and not 
> part of the exclude file by default at the RM startup or HA failover. This 
> can be done by marking the node with a special port value {_}-2{_}, signaling 
> that the node is considered LOST but has not yet been reported. Whenever a 
> heartbeat is received for that {color:#de350b}nodeID{color}, it will be 
> transitioned from _LOST_ to {_}RUNNING{_

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882403#comment-17882403
 ] 

ASF GitHub Bot commented on YARN-11730:
---

arjunmohnot commented on PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2355769123

   Hi @slfan1989, the changes have been made and CI checks passed—could you 
kindly review when you get a chance? Thank you for your time and support!




> Resourcemanager node reporting enhancement for unregistered hosts
> -
>
> Key: YARN-11730
> URL: https://issues.apache.org/jira/browse/YARN-11730
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, yarn
>Affects Versions: 3.4.0
> Environment: Tested on multiple environments:
> A. Docker Environment{*}:{*}
>  * Base OS: *Ubuntu 20.04*
>  * *Java 8* installed from OpenJDK.
>  * Docker image includes Hadoop binaries, user configurations, and ports for 
> YARN services.
>  * Verified behavior using a Hadoop snapshot in a containerized environment.
>  * Performed Namenode formatting and validated service interactions through 
> exposed ports.
>  * Repo reference: 
> [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main]
> B. Bare-metal Distributed Setup (RedHat Linux){*}:{*}
>  * Running *Java 8* in a High-Availability (HA) configuration with 
> *Zookeeper* for locking mechanism.
>  * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM 
> node, including state retention and proper node state transitions.
>  * Verified node state transitions during RM failover, ensuring nodes moved 
> between LOST, ACTIVE, and other states as expected.
>Reporter: Arjun Mohnot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> h3. Issue Overview
> When the ResourceManager (RM) starts, nodes listed in the _"include"_ file 
> are not immediately reported until their corresponding NodeManagers (NMs) 
> send their first heartbeat. However, nodes in the _"exclude"_ file are 
> instantly reflected in the _"Decommissioned Hosts"_ section with a port value 
> -1.
> This design creates several challenges:
>  * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM 
> standalone restart, some nodes may not report back, even though they are 
> listed in the _"include"_ file. These nodes neither appear in the _LOST_ 
> state nor are they represented in the RM's JMX metrics. This results in an 
> untracked state, making it difficult to monitor their status. While in HDFS 
> similar behaviour exists and is marked as {_}"DEAD"{_}.
>  * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until 
> they send their first heartbeat. This delay impacts real-time cluster 
> monitoring, leading to a lack of immediate visibility for these nodes in 
> Resourcemanager's state on the total no. of nodes.
>  * {*}Operational Impact{*}: These unreported nodes cause operational 
> difficulties, particularly in automated workflows such as OS Upgrade 
> Automation (OSUA), node recovery automation, and others where validation 
> depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or 
> {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky 
> workarounds to determine their accurate status.
> h3. Proposed Solution
> To address these issues, we propose automatically assigning the _LOST_ state 
> to any node listed in the _"include"_ file that are not registered and not 
> part of the exclude file by default at the RM startup or HA failover. This 
> can be done by marking the node with a special port value {_}-2{_}, signaling 
> that the node is considered LOST but has not yet been reported. Whenever a 
> heartbeat is received for that {color:#de350b}nodeID{color}, it will be 
> transitioned from _LOST_ to {_}RUNNING{_}, {_}UNHEALTHY{_}, or any other 
> required desired state.
> h3. Key implementation points
>  * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of 
> the RM active node context should be automatically marked as {_}LOST{_}. This 
> can be achieved by modifying the _NodesListManager_ under the 
> {color:#de350b}refreshHostsReader{color} method, invoked during failover, or 
> manual node refresh operations. This logic should ensure that all 
> unregistered nodes are moved to the _LOST_ state, with port _-2_ indicating 
> the node is untracked.
>  * For non-HA setups, this process can be triggered during RM service startup 
> to mark nodes as _LOST_ initially, and they will gradually transition to 
> their desired state when the heartbeat is received.
>  * Handle Node Heartbeat and Transition: When a node sends its first 
> heartbeat, the system should verify i

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882399#comment-17882399
 ] 

ASF GitHub Bot commented on YARN-11730:
---

hadoop-yetus commented on PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2355749836

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 17s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  xmllint  |   0m  0s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 22s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  20m 25s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   3m 43s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   3m 31s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   1m  3s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 55s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   2m  4s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   1m 59s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   3m 41s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m 53s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 21s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 10s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   3m 30s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   3m 30s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   3m 23s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   3m 23s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 56s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 37s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 28s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   1m 34s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   3m 38s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  21m 25s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   0m 49s |  |  hadoop-yarn-api in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   4m 40s |  |  hadoop-yarn-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  |  89m 26s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 35s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 210m 46s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.47 ServerAPI=1.47 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7049 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint |
   | uname | Linux 67b9fa27fe59 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 
20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / ec01c95944ca730c822f92c5bd57b452b2addbf2 |
   | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 
|
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/3/testReport/ |
   | Max. process+thread count | 944 (vs. ulimit of 5500) |

[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"

2024-09-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882379#comment-17882379
 ] 

ASF GitHub Bot commented on YARN-11709:
---

brumi1024 merged PR #7043:
URL: https://github.com/apache/hadoop/pull/7043




> NodeManager should be shut down or blacklisted when it cannot run program 
> "/var/lib/yarn-ce/bin/container-executor"
> ---
>
> Key: YARN-11709
> URL: https://issues.apache.org/jira/browse/YARN-11709
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: container-executor
>Reporter: Ferenc Erdelyi
>Assignee: Benjamin Teke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> When NodeManager encounters the below "No such file or directory" error 
> reported against the "container-executor", it should give up participating in 
> the cluster as it is not capable to run any container, but just fail the jobs.
> {code:java}
> 2023-01-18 10:08:10,600 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_e159_1673543180101_9407_02_
> 14 startLocalizer is : -1
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  java.io.IOException: Cannot run program 
> "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:183)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:403)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.j
> ava:1250)
> Caused by: java.io.IOException: Cannot run program 
> "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"

2024-09-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882380#comment-17882380
 ] 

ASF GitHub Bot commented on YARN-11709:
---

brumi1024 commented on PR #7043:
URL: https://github.com/apache/hadoop/pull/7043#issuecomment-2355587939

   Thanks @slfan1989 for the review, merged to trunk.




> NodeManager should be shut down or blacklisted when it cannot run program 
> "/var/lib/yarn-ce/bin/container-executor"
> ---
>
> Key: YARN-11709
> URL: https://issues.apache.org/jira/browse/YARN-11709
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: container-executor
>Reporter: Ferenc Erdelyi
>Assignee: Benjamin Teke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> When NodeManager encounters the below "No such file or directory" error 
> reported against the "container-executor", it should give up participating in 
> the cluster as it is not capable to run any container, but just fail the jobs.
> {code:java}
> 2023-01-18 10:08:10,600 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_e159_1673543180101_9407_02_
> 14 startLocalizer is : -1
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  java.io.IOException: Cannot run program 
> "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:183)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:403)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.j
> ava:1250)
> Caused by: java.io.IOException: Cannot run program 
> "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"

2024-09-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882366#comment-17882366
 ] 

ASF GitHub Bot commented on YARN-11709:
---

hadoop-yetus commented on PR #7043:
URL: https://github.com/apache/hadoop/pull/7043#issuecomment-2355397523

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 48s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  49m 19s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 31s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   1m 26s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 40s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 46s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 49s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 41s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 29s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  40m  5s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 22s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   1m 22s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 20s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   1m 20s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 29s |  |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 0 new + 101 unchanged - 1 fixed = 101 total (was 102)  |
   | +1 :green_heart: |  mvnsite  |   0m 36s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 35s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 33s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 27s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  40m  2s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  24m 30s |  |  hadoop-yarn-server-nodemanager 
in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 37s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 170m 14s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.46 ServerAPI=1.46 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7043 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 6c4d375b47da 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 
20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / e26571b945692b692964e5c6be46f66bc43b2b60 |
   | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 
|
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/4/testReport/ |
   | Max. process+thread count | 584 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/4/cons

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882338#comment-17882338
 ] 

ASF GitHub Bot commented on YARN-11730:
---

arjunmohnot commented on code in PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#discussion_r1762881987


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java:
##
@@ -387,6 +388,115 @@ private void handleExcludeNodeList(boolean graceful, int 
timeout) {
 updateInactiveNodes();
   }
 
+  /**
+   * Marks the unregistered nodes as LOST
+   * if the feature is enabled via a configuration flag.
+   *
+   * This method finds nodes that are present in the include list but are not
+   * registered with the ResourceManager. Such nodes are then marked as LOST.
+   *
+   * The steps are as follows:
+   * 1. Retrieve all hostnames of registered nodes from RM.
+   * 2. Identify the nodes present in the include list but are not registered
+   * 3. Remove nodes from the exclude list
+   * 4. Dispatch LOST events for filtered nodes to mark them as LOST.
+   *
+   * @param yarnConf Configuration object that holds the YARN configurations.
+   */
+  private void markUnregisteredNodesAsLost(Configuration yarnConf) {
+// Check if tracking unregistered nodes is enabled in the configuration
+if 
(!yarnConf.getBoolean(YarnConfiguration.ENABLE_TRACKING_FOR_UNREGISTERED_NODES,
+YarnConfiguration.DEFAULT_ENABLE_TRACKING_FOR_UNREGISTERED_NODES)) {
+  LOG.debug("Unregistered node tracking is disabled. " +
+  "Skipping marking unregistered nodes as LOST.");
+  return;
+}
+
+// Set to store all registered hostnames from both active and inactive 
lists
+Set registeredHostNames = gatherRegisteredHostNames();
+// Event handler to dispatch LOST events
+EventHandler eventHandler = 
this.rmContext.getDispatcher().getEventHandler();
+
+// Identify nodes that are in the include list but are not registered
+// and are not in the exclude list
+List nodesToMarkLost = new ArrayList<>();
+HostDetails hostDetails = hostsReader.getHostDetails();
+Set includes = hostDetails.getIncludedHosts();
+Set excludes = hostDetails.getExcludedHosts();
+
+for (String includedNode : includes) {
+  if (!registeredHostNames.contains(includedNode) && 
!excludes.contains(includedNode)) {
+LOG.info("Lost node: " + includedNode);
+nodesToMarkLost.add(includedNode);
+  }
+}
+
+// Dispatch LOST events for the identified lost nodes
+for (String lostNode : nodesToMarkLost) {
+  dispatchLostEvent(eventHandler, lostNode);
+}
+
+// Log successful completion of marking unregistered nodes as LOST
+LOG.info("Successfully marked unregistered nodes as LOST");
+  }
+
+  /**
+   * Gathers all registered hostnames from both active and inactive RMNodes.
+   *
+   * @return A set of registered hostnames.
+   */
+  private Set gatherRegisteredHostNames() {
+Set registeredHostNames = new HashSet<>();
+LOG.info("Getting all the registered hostnames");
+
+// Gather all registered nodes (active) from RM into the set
+for (RMNode node : this.rmContext.getRMNodes().values()) {
+  registeredHostNames.add(node.getHostName());
+}
+
+// Gather all inactive nodes from RM into the set
+for (RMNode node : this.rmContext.getInactiveRMNodes().values()) {
+  registeredHostNames.add(node.getHostName());
+}
+
+return registeredHostNames;
+  }
+
+  /**
+   * Dispatches a LOST event for a specified lost node.
+   *
+   * @param eventHandler The EventHandler used to dispatch the LOST event.
+   * @param lostNode The hostname of the lost node for which the event is
+   * being dispatched.
+   */
+  private void dispatchLostEvent(EventHandler eventHandler, String lostNode) {
+// Generate a NodeId for the lost node with a special port -2
+NodeId nodeId = createLostNodeId(lostNode);
+RMNodeEvent lostEvent = new RMNodeEvent(nodeId, RMNodeEventType.EXPIRE);
+RMNodeImpl rmNode = new RMNodeImpl(nodeId, this.rmContext, lostNode, -2, 
-2,
+new UnknownNode(lostNode), Resource.newInstance(0, 0), "unknown");
+
+try {
+  // Dispatch the LOST event to signal the node is no longer active
+  eventHandler.handle(lostEvent);
+
+  // After successful dispatch, update the node status in RMContext
+  // Set the node's timestamp for when it became untracked
+  rmNode.setUntrackedTimeStamp(Time.monotonicNow());
+
+  // Add the node to the active and inactive node maps in RMContext
+  this.rmContext.getRMNodes().put(nodeId, rmNode);
+  this.rmContext.getInactiveRMNodes().put(nodeId, rmNode);
+
+  LOG.info("Successfully dispatched LOST event and deactivated node: "

Review Comme

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882307#comment-17882307
 ] 

ASF GitHub Bot commented on YARN-11730:
---

arjunmohnot commented on code in PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#discussion_r1762644412


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java:
##
@@ -387,6 +388,115 @@ private void handleExcludeNodeList(boolean graceful, int 
timeout) {
 updateInactiveNodes();
   }
 
+  /**
+   * Marks the unregistered nodes as LOST
+   * if the feature is enabled via a configuration flag.
+   *
+   * This method finds nodes that are present in the include list but are not
+   * registered with the ResourceManager. Such nodes are then marked as LOST.
+   *
+   * The steps are as follows:
+   * 1. Retrieve all hostnames of registered nodes from RM.
+   * 2. Identify the nodes present in the include list but are not registered
+   * 3. Remove nodes from the exclude list
+   * 4. Dispatch LOST events for filtered nodes to mark them as LOST.
+   *
+   * @param yarnConf Configuration object that holds the YARN configurations.
+   */
+  private void markUnregisteredNodesAsLost(Configuration yarnConf) {
+// Check if tracking unregistered nodes is enabled in the configuration
+if 
(!yarnConf.getBoolean(YarnConfiguration.ENABLE_TRACKING_FOR_UNREGISTERED_NODES,
+YarnConfiguration.DEFAULT_ENABLE_TRACKING_FOR_UNREGISTERED_NODES)) {
+  LOG.debug("Unregistered node tracking is disabled. " +
+  "Skipping marking unregistered nodes as LOST.");
+  return;
+}
+
+// Set to store all registered hostnames from both active and inactive 
lists
+Set registeredHostNames = gatherRegisteredHostNames();
+// Event handler to dispatch LOST events
+EventHandler eventHandler = 
this.rmContext.getDispatcher().getEventHandler();
+
+// Identify nodes that are in the include list but are not registered
+// and are not in the exclude list
+List nodesToMarkLost = new ArrayList<>();
+HostDetails hostDetails = hostsReader.getHostDetails();
+Set includes = hostDetails.getIncludedHosts();
+Set excludes = hostDetails.getExcludedHosts();
+
+for (String includedNode : includes) {
+  if (!registeredHostNames.contains(includedNode) && 
!excludes.contains(includedNode)) {
+LOG.info("Lost node: " + includedNode);
+nodesToMarkLost.add(includedNode);
+  }
+}
+
+// Dispatch LOST events for the identified lost nodes
+for (String lostNode : nodesToMarkLost) {
+  dispatchLostEvent(eventHandler, lostNode);
+}
+
+// Log successful completion of marking unregistered nodes as LOST
+LOG.info("Successfully marked unregistered nodes as LOST");
+  }
+
+  /**
+   * Gathers all registered hostnames from both active and inactive RMNodes.
+   *
+   * @return A set of registered hostnames.
+   */
+  private Set gatherRegisteredHostNames() {
+Set registeredHostNames = new HashSet<>();
+LOG.info("Getting all the registered hostnames");
+
+// Gather all registered nodes (active) from RM into the set
+for (RMNode node : this.rmContext.getRMNodes().values()) {
+  registeredHostNames.add(node.getHostName());
+}
+
+// Gather all inactive nodes from RM into the set
+for (RMNode node : this.rmContext.getInactiveRMNodes().values()) {
+  registeredHostNames.add(node.getHostName());
+}
+
+return registeredHostNames;
+  }
+
+  /**
+   * Dispatches a LOST event for a specified lost node.
+   *
+   * @param eventHandler The EventHandler used to dispatch the LOST event.
+   * @param lostNode The hostname of the lost node for which the event is
+   * being dispatched.
+   */
+  private void dispatchLostEvent(EventHandler eventHandler, String lostNode) {
+// Generate a NodeId for the lost node with a special port -2
+NodeId nodeId = createLostNodeId(lostNode);
+RMNodeEvent lostEvent = new RMNodeEvent(nodeId, RMNodeEventType.EXPIRE);
+RMNodeImpl rmNode = new RMNodeImpl(nodeId, this.rmContext, lostNode, -2, 
-2,
+new UnknownNode(lostNode), Resource.newInstance(0, 0), "unknown");
+
+try {
+  // Dispatch the LOST event to signal the node is no longer active
+  eventHandler.handle(lostEvent);
+
+  // After successful dispatch, update the node status in RMContext
+  // Set the node's timestamp for when it became untracked
+  rmNode.setUntrackedTimeStamp(Time.monotonicNow());
+
+  // Add the node to the active and inactive node maps in RMContext
+  this.rmContext.getRMNodes().put(nodeId, rmNode);
+  this.rmContext.getInactiveRMNodes().put(nodeId, rmNode);
+
+  LOG.info("Successfully dispatched LOST event and deactivated node: "

Review Comme

[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app

2024-09-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882306#comment-17882306
 ] 

ASF GitHub Bot commented on YARN-11708:
---

hadoop-yetus commented on PR #7041:
URL: https://github.com/apache/hadoop/pull/7041#issuecomment-2354879586

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  22m 55s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | -1 :x: |  mvninstall  |   2m 12s | 
[/branch-mvninstall-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/3/artifact/out/branch-mvninstall-root.txt)
 |  root in trunk failed.  |
   | -1 :x: |  compile  |   0m 23s | 
[/branch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/3/artifact/out/branch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt)
 |  hadoop-yarn-server-resourcemanager in trunk failed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.  |
   | -1 :x: |  compile  |   0m 24s | 
[/branch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/3/artifact/out/branch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt)
 |  hadoop-yarn-server-resourcemanager in trunk failed with JDK Private 
Build-1.8.0_422-8u422-b05-1~20.04-b05.  |
   | -0 :warning: |  checkstyle  |   0m 21s | 
[/buildtool-branch-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/3/artifact/out/buildtool-branch-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  The patch fails to run checkstyle in hadoop-yarn-server-resourcemanager  |
   | -1 :x: |  mvnsite  |   0m 24s | 
[/branch-mvnsite-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/3/artifact/out/branch-mvnsite-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  hadoop-yarn-server-resourcemanager in trunk failed.  |
   | -1 :x: |  javadoc  |   0m 23s | 
[/branch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/3/artifact/out/branch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt)
 |  hadoop-yarn-server-resourcemanager in trunk failed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.  |
   | -1 :x: |  javadoc  |   0m 23s | 
[/branch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/3/artifact/out/branch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt)
 |  hadoop-yarn-server-resourcemanager in trunk failed with JDK Private 
Build-1.8.0_422-8u422-b05-1~20.04-b05.  |
   | -1 :x: |  spotbugs  |   0m 23s | 
[/branch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/3/artifact/out/branch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  hadoop-yarn-server-resourcemanager in trunk failed.  |
   | +1 :green_heart: |  shadedclient  |   2m 45s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | -1 :x: |  mvninstall  |   0m 22s | 
[/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-se

[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"

2024-09-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882305#comment-17882305
 ] 

ASF GitHub Bot commented on YARN-11709:
---

slfan1989 commented on PR #7043:
URL: https://github.com/apache/hadoop/pull/7043#issuecomment-2354877834

   LGTM +1.




> NodeManager should be shut down or blacklisted when it cannot run program 
> "/var/lib/yarn-ce/bin/container-executor"
> ---
>
> Key: YARN-11709
> URL: https://issues.apache.org/jira/browse/YARN-11709
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: container-executor
>Reporter: Ferenc Erdelyi
>Assignee: Benjamin Teke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> When NodeManager encounters the below "No such file or directory" error 
> reported against the "container-executor", it should give up participating in 
> the cluster as it is not capable to run any container, but just fail the jobs.
> {code:java}
> 2023-01-18 10:08:10,600 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_e159_1673543180101_9407_02_
> 14 startLocalizer is : -1
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  java.io.IOException: Cannot run program 
> "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:183)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:403)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.j
> ava:1250)
> Caused by: java.io.IOException: Cannot run program 
> "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882304#comment-17882304
 ] 

ASF GitHub Bot commented on YARN-11730:
---

slfan1989 commented on code in PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#discussion_r1762632209


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java:
##
@@ -387,6 +388,115 @@ private void handleExcludeNodeList(boolean graceful, int 
timeout) {
 updateInactiveNodes();
   }
 
+  /**
+   * Marks the unregistered nodes as LOST
+   * if the feature is enabled via a configuration flag.
+   *
+   * This method finds nodes that are present in the include list but are not
+   * registered with the ResourceManager. Such nodes are then marked as LOST.
+   *
+   * The steps are as follows:
+   * 1. Retrieve all hostnames of registered nodes from RM.
+   * 2. Identify the nodes present in the include list but are not registered
+   * 3. Remove nodes from the exclude list
+   * 4. Dispatch LOST events for filtered nodes to mark them as LOST.
+   *
+   * @param yarnConf Configuration object that holds the YARN configurations.
+   */
+  private void markUnregisteredNodesAsLost(Configuration yarnConf) {
+// Check if tracking unregistered nodes is enabled in the configuration
+if 
(!yarnConf.getBoolean(YarnConfiguration.ENABLE_TRACKING_FOR_UNREGISTERED_NODES,
+YarnConfiguration.DEFAULT_ENABLE_TRACKING_FOR_UNREGISTERED_NODES)) {
+  LOG.debug("Unregistered node tracking is disabled. " +
+  "Skipping marking unregistered nodes as LOST.");
+  return;
+}
+
+// Set to store all registered hostnames from both active and inactive 
lists
+Set registeredHostNames = gatherRegisteredHostNames();
+// Event handler to dispatch LOST events
+EventHandler eventHandler = 
this.rmContext.getDispatcher().getEventHandler();
+
+// Identify nodes that are in the include list but are not registered
+// and are not in the exclude list
+List nodesToMarkLost = new ArrayList<>();
+HostDetails hostDetails = hostsReader.getHostDetails();
+Set includes = hostDetails.getIncludedHosts();
+Set excludes = hostDetails.getExcludedHosts();
+
+for (String includedNode : includes) {
+  if (!registeredHostNames.contains(includedNode) && 
!excludes.contains(includedNode)) {
+LOG.info("Lost node: " + includedNode);
+nodesToMarkLost.add(includedNode);
+  }
+}
+
+// Dispatch LOST events for the identified lost nodes
+for (String lostNode : nodesToMarkLost) {
+  dispatchLostEvent(eventHandler, lostNode);
+}
+
+// Log successful completion of marking unregistered nodes as LOST
+LOG.info("Successfully marked unregistered nodes as LOST");
+  }
+
+  /**
+   * Gathers all registered hostnames from both active and inactive RMNodes.
+   *
+   * @return A set of registered hostnames.
+   */
+  private Set gatherRegisteredHostNames() {
+Set registeredHostNames = new HashSet<>();
+LOG.info("Getting all the registered hostnames");
+
+// Gather all registered nodes (active) from RM into the set
+for (RMNode node : this.rmContext.getRMNodes().values()) {
+  registeredHostNames.add(node.getHostName());
+}
+
+// Gather all inactive nodes from RM into the set
+for (RMNode node : this.rmContext.getInactiveRMNodes().values()) {
+  registeredHostNames.add(node.getHostName());
+}
+
+return registeredHostNames;
+  }
+
+  /**
+   * Dispatches a LOST event for a specified lost node.
+   *
+   * @param eventHandler The EventHandler used to dispatch the LOST event.
+   * @param lostNode The hostname of the lost node for which the event is
+   * being dispatched.
+   */
+  private void dispatchLostEvent(EventHandler eventHandler, String lostNode) {
+// Generate a NodeId for the lost node with a special port -2
+NodeId nodeId = createLostNodeId(lostNode);
+RMNodeEvent lostEvent = new RMNodeEvent(nodeId, RMNodeEventType.EXPIRE);
+RMNodeImpl rmNode = new RMNodeImpl(nodeId, this.rmContext, lostNode, -2, 
-2,
+new UnknownNode(lostNode), Resource.newInstance(0, 0), "unknown");
+
+try {
+  // Dispatch the LOST event to signal the node is no longer active
+  eventHandler.handle(lostEvent);
+
+  // After successful dispatch, update the node status in RMContext
+  // Set the node's timestamp for when it became untracked
+  rmNode.setUntrackedTimeStamp(Time.monotonicNow());
+
+  // Add the node to the active and inactive node maps in RMContext
+  this.rmContext.getRMNodes().put(nodeId, rmNode);
+  this.rmContext.getInactiveRMNodes().put(nodeId, rmNode);
+
+  LOG.info("Successfully dispatched LOST event and deactivated node: "

Review Comment

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-17 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882293#comment-17882293
 ] 

ASF GitHub Bot commented on YARN-11730:
---

hadoop-yetus commented on PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2354793253

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 19s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  xmllint  |   0m  0s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  15m 46s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  20m  2s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   3m 47s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   3m 29s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   1m  1s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 54s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 57s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   1m 57s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   3m 40s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m 25s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 21s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 10s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   3m 23s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   3m 23s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   3m 25s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   3m 25s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 54s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 40s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 46s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   1m 48s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   3m 55s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  21m 38s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   0m 46s |  |  hadoop-yarn-api in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   4m 39s |  |  hadoop-yarn-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  |  89m 44s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 40s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 212m 34s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.47 ServerAPI=1.47 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7049 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint |
   | uname | Linux e05255413a5c 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 
20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 072bb20b1cbc1f06a5815aa338c4703cd390054c |
   | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 
|
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/2/testReport/ |
   | Max. process+thread count | 926 (vs. ulimit of 5500) |

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-16 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882208#comment-17882208
 ] 

ASF GitHub Bot commented on YARN-11730:
---

hadoop-yetus commented on PR #7049:
URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2354217074

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   6m 55s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  xmllint  |   0m  0s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 23s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  20m 14s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   3m 50s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   3m 24s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   1m  0s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 55s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 53s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   1m 53s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   3m 47s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m 16s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 22s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 11s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   3m 25s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   3m 25s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   3m 24s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   3m 24s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 53s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/1/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn.txt)
 |  hadoop-yarn-project/hadoop-yarn: The patch generated 6 new + 248 unchanged 
- 0 fixed = 254 total (was 248)  |
   | +1 :green_heart: |  mvnsite  |   1m 43s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 39s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   1m 49s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | -1 :x: |  spotbugs  |   1m 22s | 
[/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/1/artifact/out/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)  |
   | +1 :green_heart: |  shadedclient  |  21m 31s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   0m 44s |  |  hadoop-yarn-api in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   4m 40s |  |  hadoop-yarn-common in the patch 
passed.  |
   | -1 :x: |  unit  |  89m 20s | 
[/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/1/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 40s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 216m 55s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | SpotBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcema

[jira] [Updated] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-16 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-11730:
--
Labels: pull-request-available  (was: )

> Resourcemanager node reporting enhancement for unregistered hosts
> -
>
> Key: YARN-11730
> URL: https://issues.apache.org/jira/browse/YARN-11730
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, yarn
>Affects Versions: 3.4.0
> Environment: Tested on multiple environments:
> A. Docker Environment{*}:{*}
>  * Base OS: *Ubuntu 20.04*
>  * *Java 8* installed from OpenJDK.
>  * Docker image includes Hadoop binaries, user configurations, and ports for 
> YARN services.
>  * Verified behavior using a Hadoop snapshot in a containerized environment.
>  * Performed Namenode formatting and validated service interactions through 
> exposed ports.
>  * Repo reference: 
> [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main]
> B. Bare-metal Distributed Setup (RedHat Linux){*}:{*}
>  * Running *Java 8* in a High-Availability (HA) configuration with 
> *Zookeeper* for locking mechanism.
>  * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM 
> node, including state retention and proper node state transitions.
>  * Verified node state transitions during RM failover, ensuring nodes moved 
> between LOST, ACTIVE, and other states as expected.
>Reporter: Arjun Mohnot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> h3. Issue Overview
> When the ResourceManager (RM) starts, nodes listed in the _"include"_ file 
> are not immediately reported until their corresponding NodeManagers (NMs) 
> send their first heartbeat. However, nodes in the _"exclude"_ file are 
> instantly reflected in the _"Decommissioned Hosts"_ section with a port value 
> -1.
> This design creates several challenges:
>  * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM 
> standalone restart, some nodes may not report back, even though they are 
> listed in the _"include"_ file. These nodes neither appear in the _LOST_ 
> state nor are they represented in the RM's JMX metrics. This results in an 
> untracked state, making it difficult to monitor their status. While in HDFS 
> similar behaviour exists and is marked as {_}"DEAD"{_}.
>  * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until 
> they send their first heartbeat. This delay impacts real-time cluster 
> monitoring, leading to a lack of immediate visibility for these nodes in 
> Resourcemanager's state on the total no. of nodes.
>  * {*}Operational Impact{*}: These unreported nodes cause operational 
> difficulties, particularly in automated workflows such as OS Upgrade 
> Automation (OSUA), node recovery automation, and others where validation 
> depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or 
> {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky 
> workarounds to determine their accurate status.
> h3. Proposed Solution
> To address these issues, we propose automatically assigning the _LOST_ state 
> to any node listed in the _"include"_ file by default at the RM startup or HA 
> failover. This can be done by marking the node with a special port value 
> {_}-2{_}, signaling that the node is considered LOST but has not yet been 
> reported. Whenever a heartbeat is received for that 
> {color:#de350b}nodeID{color}, it will be transitioned from _LOST_ to 
> {_}RUNNING{_}, {_}UNHEALTHY{_}, or any other required desired state.
> h3. Key implementation points
>  * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of 
> the RM active node context should be automatically marked as {_}LOST{_}. This 
> can be achieved by modifying the _NodesListManager_ under the 
> {color:#de350b}refreshHostsReader{color} method, invoked during failover, or 
> manual node refresh operations. This logic should ensure that all 
> unregistered nodes are moved to the _LOST_ state, with port _-2_ indicating 
> the node is untracked.
>  * For non-HA setups, this process can be triggered during RM service startup 
> to mark nodes as _LOST_ initially, and they will gradually transition to 
> their desired state when the heartbeat is received.
>  * Handle Node Heartbeat and Transition: When a node sends its first 
> heartbeat, the system should verify if the node is listed in 
> {color:#de350b}getInactiveRMNodes(){color}. If the node exists in the _LOST_ 
> state, the RM should remove it from the inactive list, decrement the _LOST_ 
> node count, and handle the transition back to the active node set.
>  * This logic can be placed in the state transition method within 
> {color:#de35

[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts

2024-09-16 Thread ASF GitHub Bot (Jira)

[
https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882178#comment-17882178
]

ASF GitHub Bot commented on YARN-11730:
---

arjunmohnot opened a new pull request, #7049:
URL: https://github.com/apache/hadoop/pull/7049

### Description of PR
1. Overview
When the ResourceManager starts, nodes listed in the "include" file are not
immediately reported until their corresponding NodeManagers send their first
heartbeat. However, nodes in the "exclude" file are instantly reflected in the
"Decommissioned Hosts" section with a port value of -1.

2. Challenges
1. **Untracked NodeManagers**: During Resourcemanager HA failover or RM
standalone restart, some nodes may not report back, even though they are listed
in the _"include"_ file. These nodes neither appear in the _LOST_ state nor are
they represented in the RM's JMX metrics. This results in an untracked state,
making it difficult to monitor their status. While in HDFS similar behaviour
exists and datanodes are marked as _"DEAD"_.
2. **Monitoring Gaps**: Nodes in the "include" file are not visible until
they send their first heartbeat, impacting real-time cluster monitoring when
being dependent on cluster metrics sink.
3. **Operational Impact**: Unreported nodes cause operational difficulties,
particularly in automated workflows such as OS Upgrade Automation (OSUA), node
recovery automation, etc. requiring workarounds to determine accurate status
for nodes that don't report.

3. Proposed Solution
To address these issues, the code automatically assigns the **_LOST_** state
to nodes listed in the _"include"_ file that are not registered and not part of
the exclude file at RM startup or during HA failover. This is indicated by a
special port value of **-2**, marking the node as LOST but not yet reported.
Once a heartbeat is received for that node, it will transition from LOST to
RUNNING, UNHEALTHY, or any other desired state.

4. Key Implementation Points
1. **Mark Unreported Nodes as LOST**:
- **Class Modified**: `NodesListManager`
- **Method**: `refreshHostsReader`
- **Functionality**:
- Automatically marks nodes listed in the **"include"** file as
**LOST** if they are not part of the RM active node context.
- For non-HA setups, this process is triggered during **RM service
startup**, ensuring unregistered nodes are initially set to **LOST**.
- Port value **-2** indicates that the node is untracked.

2. **Handle Node Heartbeat and Transition**:
- **Class Modified**: `RMNodeImpl`
- **Method**: State transition method
- **Functionality**:
- Upon receiving the first heartbeat from a node, the system checks if
the node exists in the **LOST** state (If nodeID has -2 port for that host) by
verifying against `getInactiveRMNodes()`.
- If the node is found in the **LOST** state:
- Remove the node from the inactive node list.
- Remove the node from the active node list to register it with a
new nodeID having its required port.
- Maintain the hostname in the RM context for proper host tracking.
- Decrement the count of **LOST** nodes.
- Re-register the node with the new nodeID and transition it back to
the active node set, ensuring it recovers gracefully from the **LOST** state.
- This logic ensures a smooth transition for nodes from **NEW** to
**LOST** and back to active upon heartbeat reception.

5. Flow Diagram
```yaml
+---+
| RM Startup / HA Failover |
+---+
|
v
Check Nodes in RM Context
|
+-+
| |
Not Registered & Not in Registered or in
Exclude FileExclude File
| |
v v
Mark Node as LOST (port -2) Node processed normally
|
v
Wait for Heartbeat
|
v
Receive Heartbeat
|
v
Node State Check
|
+v
||
Previous NodeID RemovedSame Hostname
With Port -2 Still Remains in the RM Context
||
||
v|
[No Further Transition]|
|

[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"

2024-09-16 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882119#comment-17882119
 ] 

ASF GitHub Bot commented on YARN-11709:
---

hadoop-yetus commented on PR #7043:
URL: https://github.com/apache/hadoop/pull/7043#issuecomment-2353412188

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 48s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  49m 18s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 33s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   1m 27s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 40s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 46s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 49s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 40s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 28s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  39m 48s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 34s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 22s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   1m 22s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 18s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   1m 18s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 28s |  |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 0 new + 101 unchanged - 1 fixed = 101 total (was 102)  |
   | +1 :green_heart: |  mvnsite  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 36s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 32s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 27s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  40m  7s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  24m 30s |  |  hadoop-yarn-server-nodemanager 
in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 37s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 170m  0s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.46 ServerAPI=1.46 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7043 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux dc7d10aaaf50 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 
20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / e2d0786c7c14222481f9995935fa8b6cb5bf5882 |
   | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 
|
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/2/testReport/ |
   | Max. process+thread count | 533 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/2/cons

[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"

2024-09-16 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882117#comment-17882117
 ] 

ASF GitHub Bot commented on YARN-11709:
---

hadoop-yetus commented on PR #7043:
URL: https://github.com/apache/hadoop/pull/7043#issuecomment-2353393874

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  11m 38s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  48m 31s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 47s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   1m 44s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 44s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 54s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 57s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 49s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 55s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  45m 45s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | -1 :x: |  mvninstall  |   0m 35s | 
[/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/3/artifact/out/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt)
 |  hadoop-yarn-server-nodemanager in the patch failed.  |
   | -1 :x: |  compile  |   0m 24s | 
[/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/3/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt)
 |  hadoop-yarn-server-nodemanager in the patch failed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.  |
   | -1 :x: |  javac  |   0m 24s | 
[/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/3/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt)
 |  hadoop-yarn-server-nodemanager in the patch failed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.  |
   | -1 :x: |  compile  |   0m 23s | 
[/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/3/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt)
 |  hadoop-yarn-server-nodemanager in the patch failed with JDK Private 
Build-1.8.0_422-8u422-b05-1~20.04-b05.  |
   | -1 :x: |  javac  |   0m 23s | 
[/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/3/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt)
 |  hadoop-yarn-server-nodemanager in the patch failed with JDK Private 
Build-1.8.0_422-8u422-b05-1~20.04-b05.  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 21s | 
[/buildtool-patch-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/3/artifact/out/buildtool-patch-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt)
 |  The patc

[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers

2024-09-16 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882095#comment-17882095
 ] 

ASF GitHub Bot commented on YARN-11702:
---

zeekling commented on PR #6990:
URL: https://github.com/apache/hadoop/pull/6990#issuecomment-2353173990

   The scheduler will not discard resource application requests, so why do we 
need to apply multiple times?




> Fix Yarn over allocating containers
> ---
>
> Key: YARN-11702
> URL: https://issues.apache.org/jira/browse/YARN-11702
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, fairscheduler, scheduler, yarn
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
>
> *Replication Steps:*
> Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler)
>  
> {code:java}
> spark.executor.memory            1024M
> spark.driver.memory              2048M
> spark.executor.cores             1
> spark.executor.instances 20
> spark.dynamicAllocation.enabled false{code}
>  
> Based on the setup, there should be 20 spark executors, but from the 
> ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 
> of them were released in seconds. On analyzing the Spark ApplicationMaster 
> (AM) logs, The following logs were observed.
>  
> {code:java}
> 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) 
> for  ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with 
> custom resources:  vCores:2147483647>
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, 
> launching executors on 4 of them.
> 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, 
> launching executors on 0 of them.
> {code}
> It was clear for the logs that extra allocated 12 containers are being 
> ignored from Spark side. Inorder to debug this further, additional log lines 
> were added to 
> [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427]
>  class in increment and decrement of container request to expose additional 
> information about the request.
>  
> {code:java}
> 2024-06-24 14:10:14,075 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 
> Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, 
> containerToUpdate=null} for: appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,113 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 15 De

[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"

2024-09-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881674#comment-17881674
 ] 

ASF GitHub Bot commented on YARN-11709:
---

hadoop-yetus commented on PR #7043:
URL: https://github.com/apache/hadoop/pull/7043#issuecomment-2350004088

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  17m 20s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  49m 42s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 31s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   1m 26s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 41s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 45s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 49s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 40s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 28s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  39m 53s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 22s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   1m 22s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 18s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   1m 18s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 28s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/1/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 2 new + 101 unchanged - 1 fixed = 103 total (was 102)  |
   | +1 :green_heart: |  mvnsite  |   0m 35s |  |  the patch passed  |
   | -1 :x: |  javadoc  |   0m 35s | 
[/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/1/artifact/out/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt)
 |  
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
 with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 generated 1 new + 195 
unchanged - 0 fixed = 196 total (was 195)  |
   | -1 :x: |  javadoc  |   0m 32s | 
[/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/1/artifact/out/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt)
 |  
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05
 with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 generated 1 new + 195 
unchanged - 0 fixed = 196 total (was 195)  |
   | +1 :green_heart: |  spotbugs  |   1m 29s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  40m 16s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  24m 31s |  |  hadoop-yarn-server-nodemanager 
in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 37s |  |  The patch does not 
generate ASF License warnings.  |
   |

[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"

2024-09-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881613#comment-17881613
 ] 

ASF GitHub Bot commented on YARN-11709:
---

brumi1024 opened a new pull request, #7043:
URL: https://github.com/apache/hadoop/pull/7043

   
   
   ### Description of PR
   
   The startLocalizer step didn't have the same error checks as the normal 
container launch. Updated the code to mark the NM unhealthy if the 
container-executor has config issues.
   
   ### How was this patch tested?
   
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> NodeManager should be shut down or blacklisted when it cannot run program 
> "/var/lib/yarn-ce/bin/container-executor"
> ---
>
> Key: YARN-11709
> URL: https://issues.apache.org/jira/browse/YARN-11709
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: container-executor
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> When NodeManager encounters the below "No such file or directory" error 
> reported against the "container-executor", it should give up participating in 
> the cluster as it is not capable to run any container, but just fail the jobs.
> {code:java}
> 2023-01-18 10:08:10,600 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_e159_1673543180101_9407_02_
> 14 startLocalizer is : -1
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  java.io.IOException: Cannot run program 
> "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:183)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:403)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.j
> ava:1250)
> Caused by: java.io.IOException: Cannot run program 
> "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app

2024-09-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881593#comment-17881593
 ] 

ASF GitHub Bot commented on YARN-11708:
---

hadoop-yetus commented on PR #7041:
URL: https://github.com/apache/hadoop/pull/7041#issuecomment-2349142064

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 56s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  45m 32s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m  3s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   0m 56s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 55s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m  1s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  0s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 51s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   2m  2s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  36m 49s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 49s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 53s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   0m 53s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 47s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   0m 47s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 42s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/2/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 4 new + 96 unchanged - 0 fixed = 100 total (was 96)  |
   | +1 :green_heart: |  mvnsite  |   0m 51s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 46s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 42s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 58s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  35m 34s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 123m 15s | 
[/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/2/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 38s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 257m 41s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestApplicationACLs |
   |   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerAutoQueueCreation
 |
   |   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerAmbiguousLeafs
 |
   |   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerApps 
|
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.47 ServerAPI=1.47 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7041 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spo

[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app

2024-09-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881525#comment-17881525
 ] 

ASF GitHub Bot commented on YARN-11708:
---

hadoop-yetus commented on PR #7041:
URL: https://github.com/apache/hadoop/pull/7041#issuecomment-2348693336

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  17m 25s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  44m 46s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m  2s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   0m 57s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 56s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m  1s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  0s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 52s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 59s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  35m 34s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 51s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 52s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   0m 52s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 49s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   0m 49s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 43s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/1/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 4 new + 96 unchanged - 0 fixed = 100 total (was 96)  |
   | +1 :green_heart: |  mvnsite  |   0m 51s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 45s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 42s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   2m  0s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  35m 48s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 122m  0s | 
[/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/1/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 36s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 271m  4s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestApplicationACLs |
   |   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerAutoQueueCreation
 |
   |   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerAmbiguousLeafs
 |
   |   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerApps 
|
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.47 ServerAPI=1.47 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7041 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spo

[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app

2024-09-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881516#comment-17881516
 ] 

ASF GitHub Bot commented on YARN-11708:
---

susheel-gupta commented on code in PR #7041:
URL: https://github.com/apache/hadoop/pull/7041#discussion_r1758603557


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java:
##
@@ -3364,12 +3364,29 @@ public boolean moveReservedContainer(RMContainer 
toBeMovedContainer,
 }
   }
 
-  @Override
+   @Override
   public long checkAndGetApplicationLifetime(String queueName,
   long lifetimeRequestedByApp) {
+CSQueue queue = getQueue(queueName);
+
+// This handles the case where the queue does not exist,
+// addressing the issue related to YARN-11708.
+if (queue == null) {
+  QueuePath queuePath = new QueuePath(queueName);
+
+  writeLock.lock();

Review Comment:
   Thanks for reviewing, yes it should be.





> Setting maximum-application-lifetime using AQCv2 templates doesn't  apply on 
> the first submitted app
> 
>
> Key: YARN-11708
> URL: https://issues.apache.org/jira/browse/YARN-11708
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Benjamin Teke
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
>
> Setting the _maximum-application-lifetime_ property using AQC v2 templates 
> (_yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime_)
>  doesn't apply to the first submitted application (through which the queue is 
> created), only to the subsequent ones. It should apply to the first 
> application as well.
> Repro steps:
> Create the queue root.test, enable AQCv2 on it.
> Provide the following template properties:
> yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime=8
> yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.default-application-lifetime=8
> The first submitted application, which triggers the queue creation will have 
> unlimited lifetime:
> {code:java}
> TimeoutType : LIFETIME 
> ExpiryTime : UNLIMITED
> RemainingTime : -1seconds
> Final-State : SUCCEEDED
> {code}
> The subsequent applications will be killed  after the lifetime expires:
> {code:java}
> TimeoutType : LIFETIME
> ExpiryTime : 2024-07-23T15:02:41.386+ 
> RemainingTime : 0seconds
> Final-State : KILLED
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app

2024-09-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881508#comment-17881508
 ] 

ASF GitHub Bot commented on YARN-11708:
---

K0K0V0K commented on code in PR #7041:
URL: https://github.com/apache/hadoop/pull/7041#discussion_r1758562684


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java:
##
@@ -3364,12 +3364,29 @@ public boolean moveReservedContainer(RMContainer 
toBeMovedContainer,
 }
   }
 
-  @Override
+   @Override
   public long checkAndGetApplicationLifetime(String queueName,
   long lifetimeRequestedByApp) {
+CSQueue queue = getQueue(queueName);
+
+// This handles the case where the queue does not exist,
+// addressing the issue related to YARN-11708.
+if (queue == null) {
+  QueuePath queuePath = new QueuePath(queueName);
+
+  writeLock.lock();
+  try {
+queue = queueManager.createQueue(queuePath);
+  } catch (YarnException | IOException e) {
+LOG.error("Failed to create queue '{}': ", queueName, e);

Review Comment:
   ```suggestion
   LOG.error("Failed to create queue " + queueName, e);
   ```
   
   Because the current log line wont log the exception, cause will think that 
is the 2nd param in the message



##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java:
##
@@ -3364,12 +3364,29 @@ public boolean moveReservedContainer(RMContainer 
toBeMovedContainer,
 }
   }
 
-  @Override
+   @Override
   public long checkAndGetApplicationLifetime(String queueName,
   long lifetimeRequestedByApp) {
+CSQueue queue = getQueue(queueName);
+
+// This handles the case where the queue does not exist,
+// addressing the issue related to YARN-11708.
+if (queue == null) {
+  QueuePath queuePath = new QueuePath(queueName);
+
+  writeLock.lock();

Review Comment:
   This write lock should be before the get queue call, to properly handle the 
race condition, right?





> Setting maximum-application-lifetime using AQCv2 templates doesn't  apply on 
> the first submitted app
> 
>
> Key: YARN-11708
> URL: https://issues.apache.org/jira/browse/YARN-11708
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Benjamin Teke
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
>
> Setting the _maximum-application-lifetime_ property using AQC v2 templates 
> (_yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime_)
>  doesn't apply to the first submitted application (through which the queue is 
> created), only to the subsequent ones. It should apply to the first 
> application as well.
> Repro steps:
> Create the queue root.test, enable AQCv2 on it.
> Provide the following template properties:
> yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime=8
> yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.default-application-lifetime=8
> The first submitted application, which triggers the queue creation will have 
> unlimited lifetime:
> {code:java}
> TimeoutType : LIFETIME 
> ExpiryTime : UNLIMITED
> RemainingTime : -1seconds
> Final-State : SUCCEEDED
> {code}
> The subsequent applications will be killed  after the lifetime expires:
> {code:java}
> TimeoutType : LIFETIME
> ExpiryTime : 2024-07-23T15:02:41.386+ 
> RemainingTime : 0seconds
> Final-State : KILLED
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app

2024-09-12 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-11708:
--
Labels: pull-request-available  (was: )

> Setting maximum-application-lifetime using AQCv2 templates doesn't  apply on 
> the first submitted app
> 
>
> Key: YARN-11708
> URL: https://issues.apache.org/jira/browse/YARN-11708
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Benjamin Teke
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
>
> Setting the _maximum-application-lifetime_ property using AQC v2 templates 
> (_yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime_)
>  doesn't apply to the first submitted application (through which the queue is 
> created), only to the subsequent ones. It should apply to the first 
> application as well.
> Repro steps:
> Create the queue root.test, enable AQCv2 on it.
> Provide the following template properties:
> yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime=8
> yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.default-application-lifetime=8
> The first submitted application, which triggers the queue creation will have 
> unlimited lifetime:
> {code:java}
> TimeoutType : LIFETIME 
> ExpiryTime : UNLIMITED
> RemainingTime : -1seconds
> Final-State : SUCCEEDED
> {code}
> The subsequent applications will be killed  after the lifetime expires:
> {code:java}
> TimeoutType : LIFETIME
> ExpiryTime : 2024-07-23T15:02:41.386+ 
> RemainingTime : 0seconds
> Final-State : KILLED
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app

2024-09-12 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881478#comment-17881478
 ] 

ASF GitHub Bot commented on YARN-11708:
---

susheelgupta7 opened a new pull request, #7041:
URL: https://github.com/apache/hadoop/pull/7041

   …s doesn't apply on the first submitted app
   
   
   
   ### Description of PR
   
   
   ### How was this patch tested?
   
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> Setting maximum-application-lifetime using AQCv2 templates doesn't  apply on 
> the first submitted app
> 
>
> Key: YARN-11708
> URL: https://issues.apache.org/jira/browse/YARN-11708
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Benjamin Teke
>Assignee: Susheel Gupta
>Priority: Major
>
> Setting the _maximum-application-lifetime_ property using AQC v2 templates 
> (_yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime_)
>  doesn't apply to the first submitted application (through which the queue is 
> created), only to the subsequent ones. It should apply to the first 
> application as well.
> Repro steps:
> Create the queue root.test, enable AQCv2 on it.
> Provide the following template properties:
> yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime=8
> yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.default-application-lifetime=8
> The first submitted application, which triggers the queue creation will have 
> unlimited lifetime:
> {code:java}
> TimeoutType : LIFETIME 
> ExpiryTime : UNLIMITED
> RemainingTime : -1seconds
> Final-State : SUCCEEDED
> {code}
> The subsequent applications will be killed  after the lifetime expires:
> {code:java}
> TimeoutType : LIFETIME
> ExpiryTime : 2024-07-23T15:02:41.386+ 
> RemainingTime : 0seconds
> Final-State : KILLED
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11664) Remove HDFS Binaries/Jars Dependency From YARN

2024-09-09 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880352#comment-17880352
 ] 

ASF GitHub Bot commented on YARN-11664:
---

steveloughran commented on PR #6631:
URL: https://github.com/apache/hadoop/pull/6631#issuecomment-2338234175

   removing the package-info class would be the simpler solution, but we need 
to understand how the regression got in. your PR seemed to take, but everything 
after it broke. 




> Remove HDFS Binaries/Jars Dependency From YARN
> --
>
> Key: YARN-11664
> URL: https://issues.apache.org/jira/browse/YARN-11664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> In principle Hadoop Yarn is independent of HDFS. It can work with any 
> filesystem. Currently there exists some code dependency for Yarn with HDFS. 
> This dependency requires Yarn to bring in some of the HDFS binaries/jars to 
> its class path. The idea behind this jira is to remove this dependency so 
> that Yarn can run without HDFS binaries/jars
> *Scope*
> 1. Non test classes are considered
> 2. Some test classes which comes as transitive dependency are considered
> *Out of scope*
> 1. All test classes in Yarn module is not considered
>  
> 
> A quick search in Yarn module revealed following HDFS dependencies
> 1. Constants
> {code:java}
> import 
> org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier;
> import org.apache.hadoop.hdfs.DFSConfigKeys;{code}
>  
>  
> 2. Exception
> {code:java}
> import org.apache.hadoop.hdfs.protocol.DSQuotaExceededException;{code}
>  
> 3. Utility
> {code:java}
> import org.apache.hadoop.hdfs.protocol.datatransfer.IOStreamPair;{code}
>  
> Both Yarn and HDFS depends on *hadoop-common* module,
> * Constants variables and Utility classes can be moved to *hadoop-common*
> * Instead of DSQuotaExceededException, Use the parent exception 
> ClusterStoragrCapacityExceeded



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11729) Broken 'AM Node Web UI' link on App details page

2024-09-09 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880281#comment-17880281
 ] 

ASF GitHub Bot commented on YARN-11729:
---

K0K0V0K opened a new pull request, #7030:
URL: https://github.com/apache/hadoop/pull/7030

   ### Description of PR
   
   - the current link ends with a '/'
   - with this ending the RM wont open the link
   - to fix the issue we remove the last '/' char from the url
   
   ### How was this patch tested?
   
   - manually run example job and click on the generated link
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> Broken 'AM Node Web UI' link on App details page
> 
>
> Key: YARN-11729
> URL: https://issues.apache.org/jira/browse/YARN-11729
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.4.0
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Major
>
> h6. Description:
> Generated 'AM Node Web UI' link can not be interpreted by RM.
> h6. Reproduction
> - Run MapReduce pi example job
> - Open the app details page
> - Click on AM Node Web UI
> - Page won't load
> h6. Fix:
> The problem is the URL finishes with a '/' so RM can not open the node page.
> To fix this we should modify the UI code to generate the URL without the last 
> '/' char



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11729) Broken 'AM Node Web UI' link on App details page

2024-09-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-11729:
--
Labels: pull-request-available  (was: )

> Broken 'AM Node Web UI' link on App details page
> 
>
> Key: YARN-11729
> URL: https://issues.apache.org/jira/browse/YARN-11729
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.4.0
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Major
>  Labels: pull-request-available
>
> h6. Description:
> Generated 'AM Node Web UI' link can not be interpreted by RM.
> h6. Reproduction
> - Run MapReduce pi example job
> - Open the app details page
> - Click on AM Node Web UI
> - Page won't load
> h6. Fix:
> The problem is the URL finishes with a '/' so RM can not open the node page.
> To fix this we should modify the UI code to generate the URL without the last 
> '/' char



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11664) Remove HDFS Binaries/Jars Dependency From YARN

2024-09-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880194#comment-17880194
 ] 

ASF GitHub Bot commented on YARN-11664:
---

shameersss1 commented on PR #6631:
URL: https://github.com/apache/hadoop/pull/6631#issuecomment-2337134702

   > I think this is triggering a regression in enforcer
   > 
   > ```
   > [INFO]   Adding ignore: *
   > [WARNING] Rule 1: org.apache.maven.plugins.enforcer.BanDuplicateClasses 
failed with message:
   > Duplicate classes found:
   > 
   >   Found in:
   > org.apache.hadoop:hadoop-client-minicluster:jar:3.5.0-SNAPSHOT:compile
   > org.apache.hadoop:hadoop-client-api:jar:3.5.0-SNAPSHOT:compile
   >   Duplicate classes:
   > org/apache/hadoop/hdfs/protocol/datatransfer/package-info.class
   > ```
   > 
   > I'm going to revert the PR and we'll have to move that IOStreamPair class 
to a new package after all. pity
   
Sure, @steveloughran - Instead of moving IOStreamPair to a new package , 
Can we ignore this specific 
`org/apache/hadoop/hdfs/protocol/datatransfer/package-info.class` class from 
BanDuplicateClasses enforcer? Anyhow package-info.class is not a critical class.




> Remove HDFS Binaries/Jars Dependency From YARN
> --
>
> Key: YARN-11664
> URL: https://issues.apache.org/jira/browse/YARN-11664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> In principle Hadoop Yarn is independent of HDFS. It can work with any 
> filesystem. Currently there exists some code dependency for Yarn with HDFS. 
> This dependency requires Yarn to bring in some of the HDFS binaries/jars to 
> its class path. The idea behind this jira is to remove this dependency so 
> that Yarn can run without HDFS binaries/jars
> *Scope*
> 1. Non test classes are considered
> 2. Some test classes which comes as transitive dependency are considered
> *Out of scope*
> 1. All test classes in Yarn module is not considered
>  
> 
> A quick search in Yarn module revealed following HDFS dependencies
> 1. Constants
> {code:java}
> import 
> org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier;
> import org.apache.hadoop.hdfs.DFSConfigKeys;{code}
>  
>  
> 2. Exception
> {code:java}
> import org.apache.hadoop.hdfs.protocol.DSQuotaExceededException;{code}
>  
> 3. Utility
> {code:java}
> import org.apache.hadoop.hdfs.protocol.datatransfer.IOStreamPair;{code}
>  
> Both Yarn and HDFS depends on *hadoop-common* module,
> * Constants variables and Utility classes can be moved to *hadoop-common*
> * Instead of DSQuotaExceededException, Use the parent exception 
> ClusterStoragrCapacityExceeded



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers

2024-09-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880158#comment-17880158
 ] 

ASF GitHub Bot commented on YARN-11702:
---

zeekling commented on PR #6990:
URL: https://github.com/apache/hadoop/pull/6990#issuecomment-2336729314

   Why are multiple requests for Containers sent? This is the key to the 
problem.




> Fix Yarn over allocating containers
> ---
>
> Key: YARN-11702
> URL: https://issues.apache.org/jira/browse/YARN-11702
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, fairscheduler, scheduler, yarn
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
>
> *Replication Steps:*
> Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler)
>  
> {code:java}
> spark.executor.memory            1024M
> spark.driver.memory              2048M
> spark.executor.cores             1
> spark.executor.instances 20
> spark.dynamicAllocation.enabled false{code}
>  
> Based on the setup, there should be 20 spark executors, but from the 
> ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 
> of them were released in seconds. On analyzing the Spark ApplicationMaster 
> (AM) logs, The following logs were observed.
>  
> {code:java}
> 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) 
> for  ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with 
> custom resources:  vCores:2147483647>
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, 
> launching executors on 4 of them.
> 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, 
> launching executors on 0 of them.
> {code}
> It was clear for the logs that extra allocated 12 containers are being 
> ignored from Spark side. Inorder to debug this further, additional log lines 
> were added to 
> [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427]
>  class in increment and decrement of container request to expose additional 
> information about the request.
>  
> {code:java}
> 2024-06-24 14:10:14,075 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 
> Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, 
> containerToUpdate=null} for: appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,113 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 15 Decremented by: 1 SchedulerR

[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"

2024-09-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880111#comment-17880111
 ] 

ASF GitHub Bot commented on YARN-11709:
---

zeekling commented on code in PR #6960:
URL: https://github.com/apache/hadoop/pull/6960#discussion_r1749152353


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java:
##
@@ -451,8 +451,10 @@ public void startLocalizer(LocalizerStartContext ctx)
 
 } catch (PrivilegedOperationException e) {
   int exitCode = e.getExitCode();
-  LOG.warn("Exit code from container {} startLocalizer is : {}",
-  locId, exitCode, e);
+  LOG.error("Unrecoverable issue occurred. Marking the node as unhealthy 
to prevent "
+  + "further containers to get scheduled on the node and cause 
application failures. " +
+  "Exit code from the container " + locId + "startLocalizer is : " + 
exitCode, e);
+  nmContext.getNodeStatusUpdater().reportException(e);

Review Comment:
   ok





> NodeManager should be shut down or blacklisted when it cannot run program 
> "/var/lib/yarn-ce/bin/container-executor"
> ---
>
> Key: YARN-11709
> URL: https://issues.apache.org/jira/browse/YARN-11709
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: container-executor
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> When NodeManager encounters the below "No such file or directory" error 
> reported against the "container-executor", it should give up participating in 
> the cluster as it is not capable to run any container, but just fail the jobs.
> {code:java}
> 2023-01-18 10:08:10,600 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_e159_1673543180101_9407_02_
> 14 startLocalizer is : -1
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  java.io.IOException: Cannot run program 
> "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:183)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:403)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.j
> ava:1250)
> Caused by: java.io.IOException: Cannot run program 
> "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"

2024-09-06 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880012#comment-17880012
 ] 

ASF GitHub Bot commented on YARN-11709:
---

brumi1024 merged PR #7028:
URL: https://github.com/apache/hadoop/pull/7028




> NodeManager should be shut down or blacklisted when it cannot run program 
> "/var/lib/yarn-ce/bin/container-executor"
> ---
>
> Key: YARN-11709
> URL: https://issues.apache.org/jira/browse/YARN-11709
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: container-executor
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> When NodeManager encounters the below "No such file or directory" error 
> reported against the "container-executor", it should give up participating in 
> the cluster as it is not capable to run any container, but just fail the jobs.
> {code:java}
> 2023-01-18 10:08:10,600 WARN 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code 
> from container container_e159_1673543180101_9407_02_
> 14 startLocalizer is : -1
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  java.io.IOException: Cannot run program 
> "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:183)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:403)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.j
> ava:1250)
> Caused by: java.io.IOException: Cannot run program 
> "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11664) Remove HDFS Binaries/Jars Dependency From YARN

2024-09-05 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879564#comment-17879564
 ] 

ASF GitHub Bot commented on YARN-11664:
---

steveloughran commented on PR #6631:
URL: https://github.com/apache/hadoop/pull/6631#issuecomment-2331703195

   I think this is triggering a regression in enforcer
   ```
   [INFO]   Adding ignore: *
   [WARNING] Rule 1: org.apache.maven.plugins.enforcer.BanDuplicateClasses 
failed with message:
   Duplicate classes found:
   
 Found in:
   org.apache.hadoop:hadoop-client-minicluster:jar:3.5.0-SNAPSHOT:compile
   org.apache.hadoop:hadoop-client-api:jar:3.5.0-SNAPSHOT:compile
 Duplicate classes:
   org/apache/hadoop/hdfs/protocol/datatransfer/package-info.class
   ```
   
   I'm going to revert the PR and we'll have to move that IOStreamPair class to 
a new package after all. pity




> Remove HDFS Binaries/Jars Dependency From YARN
> --
>
> Key: YARN-11664
> URL: https://issues.apache.org/jira/browse/YARN-11664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> In principle Hadoop Yarn is independent of HDFS. It can work with any 
> filesystem. Currently there exists some code dependency for Yarn with HDFS. 
> This dependency requires Yarn to bring in some of the HDFS binaries/jars to 
> its class path. The idea behind this jira is to remove this dependency so 
> that Yarn can run without HDFS binaries/jars
> *Scope*
> 1. Non test classes are considered
> 2. Some test classes which comes as transitive dependency are considered
> *Out of scope*
> 1. All test classes in Yarn module is not considered
>  
> 
> A quick search in Yarn module revealed following HDFS dependencies
> 1. Constants
> {code:java}
> import 
> org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier;
> import org.apache.hadoop.hdfs.DFSConfigKeys;{code}
>  
>  
> 2. Exception
> {code:java}
> import org.apache.hadoop.hdfs.protocol.DSQuotaExceededException;{code}
>  
> 3. Utility
> {code:java}
> import org.apache.hadoop.hdfs.protocol.datatransfer.IOStreamPair;{code}
>  
> Both Yarn and HDFS depends on *hadoop-common* module,
> * Constants variables and Utility classes can be moved to *hadoop-common*
> * Instead of DSQuotaExceededException, Use the parent exception 
> ClusterStoragrCapacityExceeded



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11664) Remove HDFS Binaries/Jars Dependency From YARN

2024-09-04 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879211#comment-17879211
 ] 

ASF GitHub Bot commented on YARN-11664:
---

steveloughran commented on PR #6631:
URL: https://github.com/apache/hadoop/pull/6631#issuecomment-2328853748

   merged to trunk; will take a PR to branch-3.4




> Remove HDFS Binaries/Jars Dependency From YARN
> --
>
> Key: YARN-11664
> URL: https://issues.apache.org/jira/browse/YARN-11664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> In principle Hadoop Yarn is independent of HDFS. It can work with any 
> filesystem. Currently there exists some code dependency for Yarn with HDFS. 
> This dependency requires Yarn to bring in some of the HDFS binaries/jars to 
> its class path. The idea behind this jira is to remove this dependency so 
> that Yarn can run without HDFS binaries/jars
> *Scope*
> 1. Non test classes are considered
> 2. Some test classes which comes as transitive dependency are considered
> *Out of scope*
> 1. All test classes in Yarn module is not considered
>  
> 
> A quick search in Yarn module revealed following HDFS dependencies
> 1. Constants
> {code:java}
> import 
> org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier;
> import org.apache.hadoop.hdfs.DFSConfigKeys;{code}
>  
>  
> 2. Exception
> {code:java}
> import org.apache.hadoop.hdfs.protocol.DSQuotaExceededException;{code}
>  
> 3. Utility
> {code:java}
> import org.apache.hadoop.hdfs.protocol.datatransfer.IOStreamPair;{code}
>  
> Both Yarn and HDFS depends on *hadoop-common* module,
> * Constants variables and Utility classes can be moved to *hadoop-common*
> * Instead of DSQuotaExceededException, Use the parent exception 
> ClusterStoragrCapacityExceeded



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers

2024-09-04 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879174#comment-17879174
 ] 

ASF GitHub Bot commented on YARN-11702:
---

slfan1989 commented on PR #6990:
URL: https://github.com/apache/hadoop/pull/6990#issuecomment-2328506556

   @aajisaka Sorry I missed some messages. I will review this PR. Please give 
me 1-2 days.
   
   cc: @shameersss1 




> Fix Yarn over allocating containers
> ---
>
> Key: YARN-11702
> URL: https://issues.apache.org/jira/browse/YARN-11702
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, fairscheduler, scheduler, yarn
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
>
> *Replication Steps:*
> Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler)
>  
> {code:java}
> spark.executor.memory            1024M
> spark.driver.memory              2048M
> spark.executor.cores             1
> spark.executor.instances 20
> spark.dynamicAllocation.enabled false{code}
>  
> Based on the setup, there should be 20 spark executors, but from the 
> ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 
> of them were released in seconds. On analyzing the Spark ApplicationMaster 
> (AM) logs, The following logs were observed.
>  
> {code:java}
> 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) 
> for  ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with 
> custom resources:  vCores:2147483647>
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, 
> launching executors on 4 of them.
> 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, 
> launching executors on 0 of them.
> {code}
> It was clear for the logs that extra allocated 12 containers are being 
> ignored from Spark side. Inorder to debug this further, additional log lines 
> were added to 
> [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427]
>  class in increment and decrement of container request to expose additional 
> information about the request.
>  
> {code:java}
> 2024-06-24 14:10:14,075 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 
> Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, 
> containerToUpdate=null} for: appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,113 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContaine

[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers

2024-09-04 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879120#comment-17879120
 ] 

ASF GitHub Bot commented on YARN-11702:
---

aajisaka commented on PR #6990:
URL: https://github.com/apache/hadoop/pull/6990#issuecomment-2328090305

   Thank you @shameersss1. I'll merge this in this weekend if there's no 
objection.




> Fix Yarn over allocating containers
> ---
>
> Key: YARN-11702
> URL: https://issues.apache.org/jira/browse/YARN-11702
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, fairscheduler, scheduler, yarn
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
>
> *Replication Steps:*
> Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler)
>  
> {code:java}
> spark.executor.memory            1024M
> spark.driver.memory              2048M
> spark.executor.cores             1
> spark.executor.instances 20
> spark.dynamicAllocation.enabled false{code}
>  
> Based on the setup, there should be 20 spark executors, but from the 
> ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 
> of them were released in seconds. On analyzing the Spark ApplicationMaster 
> (AM) logs, The following logs were observed.
>  
> {code:java}
> 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) 
> for  ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with 
> custom resources:  vCores:2147483647>
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
> launching executors on 8 of them.
> 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, 
> launching executors on 4 of them.
> 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, 
> launching executors on 0 of them.
> {code}
> It was clear for the logs that extra allocated 12 containers are being 
> ignored from Spark side. Inorder to debug this further, additional log lines 
> were added to 
> [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427]
>  class in increment and decrement of container request to expose additional 
> information about the request.
>  
> {code:java}
> 2024-06-24 14:10:14,075 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 
> Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, 
> containerToUpdate=null} for: appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, 
> allocationRequestId=0, containerToUpdate=null} for: 
> appattempt_1719234929152_0004_01
> 2024-06-24 14:10:14,113 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
> (SchedulerEventDispatcher:Event Processor): Allocate Updates 
> PendingContainers: 15 Decremented by: 1 Schedule

[jira] [Updated] (YARN-6261) YARN queue mapping fails for users with no group

2024-09-02 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-6261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-6261:
-
Labels: pull-request-available  (was: )

> YARN queue mapping fails for users with no group
> 
>
> Key: YARN-6261
> URL: https://issues.apache.org/jira/browse/YARN-6261
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Pierre Villard
>Assignee: Pierre Villard
>Priority: Major
>  Labels: pull-request-available
>
> *Issue:* 
> Since Hadoop group mapping can be overridden (to get groups from an AD for 
> example), it is possible to be in a situation where a user does not have any 
> group (because the user is not in the AD but only defined locally):
> {noformat}
> $ hdfs groups zeppelin
> zeppelin:
> {noformat}
> In this case, if the YARN Queue Mapping is configured and contains at least 
> one mapping of {{MappingType.GROUP}}, it won't be possible to get a queue for 
> the job submitted by such a user and the job won't be submitted at all.
> *Expected result:* 
> In case a user does not have any group and no mapping is defined for this 
> user, the default queue should be assigned whatever the queue mapping 
> definition is.
> *Workaround:* 
> A workaround is to define a group mapping of {{MappingType.USER}} for the 
> given user before defining any mapping of {{MappingType.GROUP}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-6261) YARN queue mapping fails for users with no group

2024-09-02 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-6261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878579#comment-17878579
 ] 

ASF GitHub Bot commented on YARN-6261:
--

pvillard31 closed pull request #198: YARN-6261 - Catch user with no group when 
getting queue from mapping …
URL: https://github.com/apache/hadoop/pull/198




> YARN queue mapping fails for users with no group
> 
>
> Key: YARN-6261
> URL: https://issues.apache.org/jira/browse/YARN-6261
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Pierre Villard
>Assignee: Pierre Villard
>Priority: Major
>
> *Issue:* 
> Since Hadoop group mapping can be overridden (to get groups from an AD for 
> example), it is possible to be in a situation where a user does not have any 
> group (because the user is not in the AD but only defined locally):
> {noformat}
> $ hdfs groups zeppelin
> zeppelin:
> {noformat}
> In this case, if the YARN Queue Mapping is configured and contains at least 
> one mapping of {{MappingType.GROUP}}, it won't be possible to get a queue for 
> the job submitted by such a user and the job won't be submitted at all.
> *Expected result:* 
> In case a user does not have any group and no mapping is defined for this 
> user, the default queue should be assigned whatever the queue mapping 
> definition is.
> *Workaround:* 
> A workaround is to define a group mapping of {{MappingType.USER}} for the 
> given user before defining any mapping of {{MappingType.GROUP}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs

2024-08-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877077#comment-17877077
 ] 

ASF GitHub Bot commented on YARN-10345:
---

brumi1024 commented on PR #7013:
URL: https://github.com/apache/hadoop/pull/7013#issuecomment-2312945414

   Thanks @K0K0V0K for the update, merged to trunk.




> HsWebServices containerlogs does not honor ACLs for completed jobs
> --
>
> Key: YARN-10345
> URL: https://issues.apache.org/jira/browse/YARN-10345
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Bence Kosztolnik
>Priority: Critical
>  Labels: pull-request-available
> Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png
>
>
> HsWebServices containerlogs does not honor ACLs. User who does not have 
> permission to view a job is allowed to view the job logs for completed jobs 
> from YARN UI2 through HsWebServices.
> *Repro:*
> Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + 
> HistoryServer runs as mapred
>  # Run a sample MR job using systest user
>  #  Once the job is complete, access the job logs using hue user from YARN 
> UI2.
> !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300!
>  
> YARN CLI works fine and does not allow hue user to view systest user job logs.
> {code:java}
> [hue@pjoseph-cm-2 /]$ 
> [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002
> WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
> 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at 
> rmhostname:8032
> Permission denied: user=hue, access=EXECUTE, 
> inode="/tmp/logs/systest":systest:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs

2024-08-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877076#comment-17877076
 ] 

ASF GitHub Bot commented on YARN-10345:
---

brumi1024 merged PR #7013:
URL: https://github.com/apache/hadoop/pull/7013




> HsWebServices containerlogs does not honor ACLs for completed jobs
> --
>
> Key: YARN-10345
> URL: https://issues.apache.org/jira/browse/YARN-10345
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Bence Kosztolnik
>Priority: Critical
>  Labels: pull-request-available
> Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png
>
>
> HsWebServices containerlogs does not honor ACLs. User who does not have 
> permission to view a job is allowed to view the job logs for completed jobs 
> from YARN UI2 through HsWebServices.
> *Repro:*
> Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + 
> HistoryServer runs as mapred
>  # Run a sample MR job using systest user
>  #  Once the job is complete, access the job logs using hue user from YARN 
> UI2.
> !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300!
>  
> YARN CLI works fine and does not allow hue user to view systest user job logs.
> {code:java}
> [hue@pjoseph-cm-2 /]$ 
> [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002
> WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
> 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at 
> rmhostname:8032
> Permission denied: user=hue, access=EXECUTE, 
> inode="/tmp/logs/systest":systest:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs

2024-08-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877073#comment-17877073
 ] 

ASF GitHub Bot commented on YARN-10345:
---

hadoop-yetus commented on PR #7013:
URL: https://github.com/apache/hadoop/pull/7013#issuecomment-2312931479

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 32s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  15m 24s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  32m 49s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 45s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   1m 36s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   1m 10s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 12s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 13s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   1m  3s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 53s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  33m 41s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 32s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   0m 48s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 36s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   1m 36s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 31s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   1m 31s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m  0s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 55s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 49s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 48s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 59s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  33m 54s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   8m 43s |  |  hadoop-mapreduce-client-app in 
the patch passed.  |
   | +1 :green_heart: |  unit  |   4m 32s |  |  hadoop-mapreduce-client-hs in 
the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 39s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 152m 22s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.46 ServerAPI=1.46 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7013/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7013 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 6fce02db40db 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 
20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 5406efead5107d238dec4a05cae63f7f1b38ca62 |
   | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 
|
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7013/4/testReport/ |
   | Max. process+thread count | 742 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-clien

[jira] [Commented] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs

2024-08-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877017#comment-17877017
 ] 

ASF GitHub Bot commented on YARN-10345:
---

brumi1024 commented on code in PR #7013:
URL: https://github.com/apache/hadoop/pull/7013#discussion_r1732826984


##
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/AMWebServices.java:
##
@@ -113,9 +113,17 @@ private void init() {
 response.setContentType(null);
   }
 
-  /**
-   * convert a job id string to an actual job and handle all the error 
checking.
-   */
+  public static Job getJobFromContainerIdString(String cid, AppContext appCtx)
+  throws NotFoundException {
+//example container_e06_1724414851587_0004_01_01
+String[] parts = cid.split("_");
+return getJobFromJobIdString("job_" + parts[2] + "_" + parts[3], appCtx);

Review Comment:
   Nit: the string "job" and the separators could be replaced with the public 
static constants from the JobID class:
   
https://github.com/apache/hadoop/blob/f3c3d9e0c6eae02dd21f875097ef76d85025ffe4/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/JobID.java#L51





> HsWebServices containerlogs does not honor ACLs for completed jobs
> --
>
> Key: YARN-10345
> URL: https://issues.apache.org/jira/browse/YARN-10345
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Bence Kosztolnik
>Priority: Critical
>  Labels: pull-request-available
> Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png
>
>
> HsWebServices containerlogs does not honor ACLs. User who does not have 
> permission to view a job is allowed to view the job logs for completed jobs 
> from YARN UI2 through HsWebServices.
> *Repro:*
> Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + 
> HistoryServer runs as mapred
>  # Run a sample MR job using systest user
>  #  Once the job is complete, access the job logs using hue user from YARN 
> UI2.
> !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300!
>  
> YARN CLI works fine and does not allow hue user to view systest user job logs.
> {code:java}
> [hue@pjoseph-cm-2 /]$ 
> [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002
> WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
> 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at 
> rmhostname:8032
> Permission denied: user=hue, access=EXECUTE, 
> inode="/tmp/logs/systest":systest:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs

2024-08-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876995#comment-17876995
 ] 

ASF GitHub Bot commented on YARN-10345:
---

hadoop-yetus commented on PR #7013:
URL: https://github.com/apache/hadoop/pull/7013#issuecomment-2312427765

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 35s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 36s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  38m  4s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 55s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   1m 46s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   1m 17s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m  8s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  6s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   1m  2s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   2m  8s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  40m 53s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 32s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   0m 51s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 43s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   1m 43s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 36s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   1m 36s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m  7s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 54s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 47s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 46s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   2m  5s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  36m  6s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   8m 43s |  |  hadoop-mapreduce-client-app in 
the patch passed.  |
   | +1 :green_heart: |  unit  |   4m 30s |  |  hadoop-mapreduce-client-hs in 
the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 37s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 167m  7s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.46 ServerAPI=1.46 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7013/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7013 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 000dd57f74d3 5.15.0-119-generic #129-Ubuntu SMP Fri Aug 2 
19:25:20 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 2846e029c86045f586f67a581d8c1fadcae7093d |
   | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 
|
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7013/3/testReport/ |
   | Max. process+thread count | 745 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-clien

[jira] [Commented] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs

2024-08-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876990#comment-17876990
 ] 

ASF GitHub Bot commented on YARN-10345:
---

hadoop-yetus commented on PR #7013:
URL: https://github.com/apache/hadoop/pull/7013#issuecomment-2312389988

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 32s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 45s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  32m 48s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 43s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   1m 38s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   1m 11s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 12s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 11s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   1m  1s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 53s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  33m 40s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 33s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   0m 51s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 34s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   1m 34s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 32s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   1m 32s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m  1s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 54s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 49s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 47s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 57s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  33m 44s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   8m 43s |  |  hadoop-mapreduce-client-app in 
the patch passed.  |
   | +1 :green_heart: |  unit  |   4m 30s |  |  hadoop-mapreduce-client-hs in 
the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 40s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 151m 31s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.46 ServerAPI=1.46 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7013/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7013 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 0366c7518436 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 
20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 2846e029c86045f586f67a581d8c1fadcae7093d |
   | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 
|
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7013/2/testReport/ |
   | Max. process+thread count | 712 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-clien

[jira] [Commented] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs

2024-08-26 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876806#comment-17876806
 ] 

ASF GitHub Bot commented on YARN-10345:
---

hadoop-yetus commented on PR #7013:
URL: https://github.com/apache/hadoop/pull/7013#issuecomment-2310735488

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  11m 57s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  15m 42s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  32m 29s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 42s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   1m 37s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   1m 12s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 10s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 10s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   1m  4s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 54s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  33m 59s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 33s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   0m 51s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 36s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   1m 36s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 28s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   1m 28s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   1m  0s | 
[/results-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7013/1/artifact/out/results-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client.txt)
 |  hadoop-mapreduce-project/hadoop-mapreduce-client: The patch generated 2 new 
+ 8 unchanged - 0 fixed = 10 total (was 8)  |
   | +1 :green_heart: |  mvnsite  |   0m 54s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 48s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 47s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 59s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  34m  1s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   8m 42s |  |  hadoop-mapreduce-client-app in 
the patch passed.  |
   | -1 :x: |  unit  |   4m 40s | 
[/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-hs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7013/1/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-hs.txt)
 |  hadoop-mapreduce-client-hs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 38s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 164m  9s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.mapreduce.v2.hs.webapp.TestHsWebServicesLogs |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.46 ServerAPI=1.46 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7013/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7013 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux ffd8f5dea4e1 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 
20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool |

[jira] [Updated] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs

2024-08-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-10345:
--
Labels: pull-request-available  (was: )

> HsWebServices containerlogs does not honor ACLs for completed jobs
> --
>
> Key: YARN-10345
> URL: https://issues.apache.org/jira/browse/YARN-10345
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Bence Kosztolnik
>Priority: Critical
>  Labels: pull-request-available
> Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png
>
>
> HsWebServices containerlogs does not honor ACLs. User who does not have 
> permission to view a job is allowed to view the job logs for completed jobs 
> from YARN UI2 through HsWebServices.
> *Repro:*
> Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + 
> HistoryServer runs as mapred
>  # Run a sample MR job using systest user
>  #  Once the job is complete, access the job logs using hue user from YARN 
> UI2.
> !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300!
>  
> YARN CLI works fine and does not allow hue user to view systest user job logs.
> {code:java}
> [hue@pjoseph-cm-2 /]$ 
> [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002
> WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
> 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at 
> rmhostname:8032
> Permission denied: user=hue, access=EXECUTE, 
> inode="/tmp/logs/systest":systest:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs

2024-08-26 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876749#comment-17876749
 ] 

ASF GitHub Bot commented on YARN-10345:
---

K0K0V0K opened a new pull request, #7013:
URL: https://github.com/apache/hadoop/pull/7013

   ### Description of PR
   
   - following rest apis did not have authorization
   - - /ws/v1/history/containerlogs/{containerid}/{filename}
   - - /ws/v1/history/containers/{containerid}/logs
   - after this fix it has acl authorization
   
   ### How was this patch tested?
   
   Setup:
   - mapreduce.cluster.acls.enabled = true on history server
   - submit example pi job with user1 (called job1)
   - - pi -Dmapreduce.job.queuename=root.default -Dmapreduce.job.acl-view-job=* 
 10 100 
   - submit example pi job with user1 (called job2)
   - - pi -Dmapreduce.job.queuename=root.default -Dmapreduce.job.acl-view-job=' 
'  10 100 
   
   Before this commit
   - Some other user (aka user2) can see job1 and job2 logs via problematic apis
   
   After this commit
   - Some user2 can not see job2 logs
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> HsWebServices containerlogs does not honor ACLs for completed jobs
> --
>
> Key: YARN-10345
> URL: https://issues.apache.org/jira/browse/YARN-10345
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Prabhu Joseph
>Assignee: Bence Kosztolnik
>Priority: Critical
> Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png
>
>
> HsWebServices containerlogs does not honor ACLs. User who does not have 
> permission to view a job is allowed to view the job logs for completed jobs 
> from YARN UI2 through HsWebServices.
> *Repro:*
> Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + 
> HistoryServer runs as mapred
>  # Run a sample MR job using systest user
>  #  Once the job is complete, access the job logs using hue user from YARN 
> UI2.
> !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300!
>  
> YARN CLI works fine and does not allow hue user to view systest user job logs.
> {code:java}
> [hue@pjoseph-cm-2 /]$ 
> [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002
> WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
> 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at 
> rmhostname:8032
> Permission denied: user=hue, access=EXECUTE, 
> inode="/tmp/logs/systest":systest:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11664) Remove HDFS Binaries/Jars Dependency From YARN

2024-08-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876168#comment-17876168
 ] 

ASF GitHub Bot commented on YARN-11664:
---

shameersss1 commented on PR #6631:
URL: https://github.com/apache/hadoop/pull/6631#issuecomment-2306276633

   @steveloughran  - Javadoc is fixed. UT failures, Spotbugs and asf seems 
unrelated.




> Remove HDFS Binaries/Jars Dependency From YARN
> --
>
> Key: YARN-11664
> URL: https://issues.apache.org/jira/browse/YARN-11664
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
>
> In principle Hadoop Yarn is independent of HDFS. It can work with any 
> filesystem. Currently there exists some code dependency for Yarn with HDFS. 
> This dependency requires Yarn to bring in some of the HDFS binaries/jars to 
> its class path. The idea behind this jira is to remove this dependency so 
> that Yarn can run without HDFS binaries/jars
> *Scope*
> 1. Non test classes are considered
> 2. Some test classes which comes as transitive dependency are considered
> *Out of scope*
> 1. All test classes in Yarn module is not considered
>  
> 
> A quick search in Yarn module revealed following HDFS dependencies
> 1. Constants
> {code:java}
> import 
> org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier;
> import org.apache.hadoop.hdfs.DFSConfigKeys;{code}
>  
>  
> 2. Exception
> {code:java}
> import org.apache.hadoop.hdfs.protocol.DSQuotaExceededException;{code}
>  
> 3. Utility
> {code:java}
> import org.apache.hadoop.hdfs.protocol.datatransfer.IOStreamPair;{code}
>  
> Both Yarn and HDFS depends on *hadoop-common* module,
> * Constants variables and Utility classes can be moved to *hadoop-common*
> * Instead of DSQuotaExceededException, Use the parent exception 
> ClusterStoragrCapacityExceeded



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11664) Remove HDFS Binaries/Jars Dependency From YARN

2024-08-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876134#comment-17876134
 ] 

ASF GitHub Bot commented on YARN-11664:
---

hadoop-yetus commented on PR #6631:
URL: https://github.com/apache/hadoop/pull/6631#issuecomment-2305875091

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 31s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  15m 30s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  32m 39s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  17m 40s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |  16m  9s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   4m 26s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   6m 24s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   5m 14s |  |  trunk passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   5m 39s |  |  trunk passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | -1 :x: |  spotbugs  |   1m 13s | 
[/branch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-services_hadoop-yarn-services-core-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6631/20/artifact/out/branch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-services_hadoop-yarn-services-core-warnings.html)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core
 in trunk has 1 extant spotbugs warnings.  |
   | +1 :green_heart: |  shadedclient  |  34m 23s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 33s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   4m  5s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 56s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |  16m 56s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 12s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |  16m 12s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   4m 23s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   6m 24s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   5m 15s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   5m 31s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |  12m 49s |  |  the patch passed  |
   | -1 :x: |  shadedclient  |  34m 44s |  |  patch has errors when building 
and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  19m 38s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   2m 48s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 120m 29s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6631/20/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch failed.  |
   | +1 :green_heart: |  unit  |   3m 31s |  |  hadoop-yarn-common in the patch 
passed.  |
   | -1 :x: |  unit  |   0m 50s | 
[/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-services_hadoop-yarn-services-core.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6631/20/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-services_hadoop-yarn-services-core.txt)
 |  hadoop-yarn-services-core in the patch failed.  |
   | -1 :x: |  asflicense  |   1m  8s | 
[/results-asflicense.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6631/20/artifact/out/results-asflicense.txt)
 |  The patch generated 149 ASF License warnings.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1046 matches

Mail list logo