[jira] [Commented] (YARN-11732) Potential NPE when calling SchedulerNode#reservedContainer for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885650#comment-17885650 ] ASF GitHub Bot commented on YARN-11732: --- TaoYang526 commented on code in PR #7065: URL: https://github.com/apache/hadoop/pull/7065#discussion_r1779890901 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java: ## @@ -757,10 +757,9 @@ private void completeOustandingUpdatesWhichAreReserved( RMContainer rmContainer, ContainerStatus containerStatus, RMContainerEventType event) { N schedulerNode = getSchedulerNode(rmContainer.getNodeId()); -if (schedulerNode != null && -schedulerNode.getReservedContainer() != null) { +if (schedulerNode != null) { RMContainer resContainer = schedulerNode.getReservedContainer(); - if (resContainer.getReservedSchedulerKey() != null) { + if (resContainer != null && resContainer.getReservedSchedulerKey() != null) { Review Comment: Thanks @zeekling for the review. I'm not sure why your environment still reported NPE after changing like that, `resContainer.getReservedSchedulerKey()` won't throw NPE any more with the previous not-null check: `if resContainer is null`. Could you please attach some details of the NPE and your changes? For this change, I think NPE should be fixed instead of be caught in general. > Potential NPE when calling SchedulerNode#reservedContainer for > CapacityScheduler > > > Key: YARN-11732 > URL: https://issues.apache.org/jira/browse/YARN-11732 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.4.0, 3.2.4, 3.3.6, 3.5.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: pull-request-available > > I found some places calling *SchedulerNode#getReservedContainer* to get > reservedContainer (returned value) but not do sanity(not-null) check before > calling internal methods of it, which can have a risk to raise > NullPointerException if it's null. > Most of these places have a premise that node has reserved container a few > moments ago, but may getting null by calling > *SchedulerNode#getReservedContainer* in the next moment, since the > reservedContainer can be updated to null concurrently in scheduling and > monitoring(preemption) thread. So that not-null check should be done before > calling internal methods of the reservedContainer. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11732) Potential NPE when calling SchedulerNode#reservedContainer for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885649#comment-17885649 ] ASF GitHub Bot commented on YARN-11732: --- TaoYang526 commented on code in PR #7065: URL: https://github.com/apache/hadoop/pull/7065#discussion_r1779890901 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java: ## @@ -757,10 +757,9 @@ private void completeOustandingUpdatesWhichAreReserved( RMContainer rmContainer, ContainerStatus containerStatus, RMContainerEventType event) { N schedulerNode = getSchedulerNode(rmContainer.getNodeId()); -if (schedulerNode != null && -schedulerNode.getReservedContainer() != null) { +if (schedulerNode != null) { RMContainer resContainer = schedulerNode.getReservedContainer(); - if (resContainer.getReservedSchedulerKey() != null) { + if (resContainer != null && resContainer.getReservedSchedulerKey() != null) { Review Comment: Thanks @zeekling for the review. I'm not sure why your environment still reported NPE after changing like that, `resContainer.getReservedSchedulerKey()` won't throw NPE any more if resContainer is null. Could you please attach some details of the NPE and your changes? For this change, I think NPE should be fixed instead of be caught in general. > Potential NPE when calling SchedulerNode#reservedContainer for > CapacityScheduler > > > Key: YARN-11732 > URL: https://issues.apache.org/jira/browse/YARN-11732 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.4.0, 3.2.4, 3.3.6, 3.5.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: pull-request-available > > I found some places calling *SchedulerNode#getReservedContainer* to get > reservedContainer (returned value) but not do sanity(not-null) check before > calling internal methods of it, which can have a risk to raise > NullPointerException if it's null. > Most of these places have a premise that node has reserved container a few > moments ago, but may getting null by calling > *SchedulerNode#getReservedContainer* in the next moment, since the > reservedContainer can be updated to null concurrently in scheduling and > monitoring(preemption) thread. So that not-null check should be done before > calling internal methods of the reservedContainer. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11732) Potential NPE when calling SchedulerNode#reservedContainer for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885644#comment-17885644 ] ASF GitHub Bot commented on YARN-11732: --- TaoYang526 commented on PR #7065: URL: https://github.com/apache/hadoop/pull/7065#issuecomment-2381073760 @Hexiaoqiao Thanks for the review. I have considered the test cases but found all of these changes are about race condition inside the method (private visibility or totally located in the method), just like this `if(node.getReservedContainer() != null){ LOG.info("... container="+ node.getReservedContainer().getContainerId()); }`. It's hard to reproduce the NPE in test cases. > Potential NPE when calling SchedulerNode#reservedContainer for > CapacityScheduler > > > Key: YARN-11732 > URL: https://issues.apache.org/jira/browse/YARN-11732 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.4.0, 3.2.4, 3.3.6, 3.5.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: pull-request-available > > I found some places calling *SchedulerNode#getReservedContainer* to get > reservedContainer (returned value) but not do sanity(not-null) check before > calling internal methods of it, which can have a risk to raise > NullPointerException if it's null. > Most of these places have a premise that node has reserved container a few > moments ago, but may getting null by calling > *SchedulerNode#getReservedContainer* in the next moment, since the > reservedContainer can be updated to null concurrently in scheduling and > monitoring(preemption) thread. So that not-null check should be done before > calling internal methods of the reservedContainer. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11732) Potential NPE when calling SchedulerNode#reservedContainer for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885593#comment-17885593 ] ASF GitHub Bot commented on YARN-11732: --- zeekling commented on code in PR #7065: URL: https://github.com/apache/hadoop/pull/7065#discussion_r1779503372 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java: ## @@ -757,10 +757,9 @@ private void completeOustandingUpdatesWhichAreReserved( RMContainer rmContainer, ContainerStatus containerStatus, RMContainerEventType event) { N schedulerNode = getSchedulerNode(rmContainer.getNodeId()); -if (schedulerNode != null && -schedulerNode.getReservedContainer() != null) { +if (schedulerNode != null) { RMContainer resContainer = schedulerNode.getReservedContainer(); - if (resContainer.getReservedSchedulerKey() != null) { + if (resContainer != null && resContainer.getReservedSchedulerKey() != null) { Review Comment: It is recommended to add a try catch module. I have made this change in the production environment before, but it still reports a null pointer. > Potential NPE when calling SchedulerNode#reservedContainer for > CapacityScheduler > > > Key: YARN-11732 > URL: https://issues.apache.org/jira/browse/YARN-11732 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.4.0, 3.2.4, 3.3.6, 3.5.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: pull-request-available > > I found some places calling *SchedulerNode#getReservedContainer* to get > reservedContainer (returned value) but not do sanity(not-null) check before > calling internal methods of it, which can have a risk to raise > NullPointerException if it's null. > Most of these places have a premise that node has reserved container a few > moments ago, but may getting null by calling > *SchedulerNode#getReservedContainer* in the next moment, since the > reservedContainer can be updated to null concurrently in scheduling and > monitoring(preemption) thread. So that not-null check should be done before > calling internal methods of the reservedContainer. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11719) The job is stuck in the new state.
[ https://issues.apache.org/jira/browse/YARN-11719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885577#comment-17885577 ] ASF GitHub Bot commented on YARN-11719: --- hadoop-yetus commented on PR #7077: URL: https://github.com/apache/hadoop/pull/7077#issuecomment-2380640345 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 16m 57s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 44m 48s | | trunk passed | | +1 :green_heart: | compile | 1m 3s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 0m 56s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 0m 56s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 2s | | trunk passed | | +1 :green_heart: | javadoc | 1m 0s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 52s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 2m 0s | | trunk passed | | +1 :green_heart: | shadedclient | 36m 26s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 50s | | the patch passed | | +1 :green_heart: | compile | 0m 55s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 0m 55s | | the patch passed | | +1 :green_heart: | compile | 0m 48s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 0m 48s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 41s | [/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7077/1/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 32 unchanged - 0 fixed = 33 total (was 32) | | +1 :green_heart: | mvnsite | 0m 50s | | the patch passed | | +1 :green_heart: | javadoc | 0m 45s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 42s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 2m 0s | | the patch passed | | +1 :green_heart: | shadedclient | 35m 39s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 109m 46s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 38s | | The patch does not generate ASF License warnings. | | | | 259m 19s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7077/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7077 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux c52bb8302dea 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 79eb1bd99287b784ec6b7cc44cf9fa22c1cea2bb | | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Test Result
[jira] [Updated] (YARN-11719) The job is stuck in the new state.
[ https://issues.apache.org/jira/browse/YARN-11719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-11719: -- Labels: pull-request-available (was: ) > The job is stuck in the new state. > -- > > Key: YARN-11719 > URL: https://issues.apache.org/jira/browse/YARN-11719 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.1 >Reporter: zeekling >Priority: Major > Labels: pull-request-available > > After I restarted the router in the production environment, several jobs > remained in the new state. and i found related log here. > > {code:java} > 2024-08-30 00:12:41,380 | WARN | DelegationTokenRenewer #667 | Unable to add > the application to the delegation token renewer. | > DelegationTokenRenewer.java:1215 > java.io.IOException: Failed to renew token: Kind: HDFS_DELEGATION_TOKEN, > Service: ha-hdfs:nsfed, Ident: (token for admintest: HDFS_DELEGATION_TOKEN > owner=admintest@9FCE074E_691F_480F_98F5_58C1CA310829.COM, renewer=mapred, > realUser=, issueDate=1724947875776, maxDate=1725552675776, > sequenceNumber=156, masterKeyId=116) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:641) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$2200(DelegationTokenRenewer.java:86) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:1211) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:1188) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750)Caused by: > java.io.InterruptedIOException: Retry interrupted > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.processWaitTimeAndRetryInfo(RetryInvocationHandler.java:141) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:112) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:366) > at com.sun.proxy.$Proxy96.renewDelegationToken(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient$Renewer.renew(DFSClient.java:849) > at org.apache.hadoop.security.token.Token.renew(Token.java:498) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:771) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:768) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1890) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.renewToken(DelegationTokenRenewer.java:767) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:627) > > ... 8 more > Caused by: java.lang.InterruptedException: sleep interrupted > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.processWaitTimeAndRetryInfo(RetryInvocationHandler.java:135) > ... 20 more > 2024-08-30 00:12:41,380 | WARN | DelegationTokenRenewer #667 | > AsyncDispatcher thread interrupted | AsyncDispatcher.java:437 > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1233) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:434) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:1221) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:1188) >
[jira] [Commented] (YARN-11719) The job is stuck in the new state.
[ https://issues.apache.org/jira/browse/YARN-11719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885553#comment-17885553 ] ASF GitHub Bot commented on YARN-11719: --- zeekling opened a new pull request, #7077: URL: https://github.com/apache/hadoop/pull/7077 ### Description of PR PR for https://issues.apache.org/jira/browse/YARN-11719 ### How was this patch tested? ### For code changes: - [ ] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > The job is stuck in the new state. > -- > > Key: YARN-11719 > URL: https://issues.apache.org/jira/browse/YARN-11719 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.1 >Reporter: zeekling >Priority: Major > > After I restarted the router in the production environment, several jobs > remained in the new state. and i found related log here. > > {code:java} > 2024-08-30 00:12:41,380 | WARN | DelegationTokenRenewer #667 | Unable to add > the application to the delegation token renewer. | > DelegationTokenRenewer.java:1215 > java.io.IOException: Failed to renew token: Kind: HDFS_DELEGATION_TOKEN, > Service: ha-hdfs:nsfed, Ident: (token for admintest: HDFS_DELEGATION_TOKEN > owner=admintest@9FCE074E_691F_480F_98F5_58C1CA310829.COM, renewer=mapred, > realUser=, issueDate=1724947875776, maxDate=1725552675776, > sequenceNumber=156, masterKeyId=116) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:641) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.access$2200(DelegationTokenRenewer.java:86) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.handleDTRenewerAppSubmitEvent(DelegationTokenRenewer.java:1211) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$DelegationTokenRenewerRunnable.run(DelegationTokenRenewer.java:1188) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750)Caused by: > java.io.InterruptedIOException: Retry interrupted > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.processWaitTimeAndRetryInfo(RetryInvocationHandler.java:141) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:112) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:366) > at com.sun.proxy.$Proxy96.renewDelegationToken(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient$Renewer.renew(DFSClient.java:849) > at org.apache.hadoop.security.token.Token.renew(Token.java:498) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:771) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer$1.run(DelegationTokenRenewer.java:768) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1890) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.renewToken(DelegationTokenRenewer.java:767) > at > org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.handleAppSubmitEvent(DelegationTokenRenewer.java:627) > > ... 8 more > Caused by: java.lang.InterruptedException: sleep interrupted > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.processWaitTimeAndRetryInfo(RetryInvocationHandler.java:135) > ... 20 more > 2024-08-30 00:12:41,380 | WARN | DelegationTokenRenewer #667 | > AsyncDispatcher thread interrupted | AsyncDispatcher.java:437 > java.lang.Inte
[jira] [Commented] (YARN-11732) Potential NPE when calling SchedulerNode#reservedContainer for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884818#comment-17884818 ] ASF GitHub Bot commented on YARN-11732: --- TaoYang526 commented on PR #7065: URL: https://github.com/apache/hadoop/pull/7065#issuecomment-2375553315 @szilard-nemeth @brumi1024 Could you please help to review this PR? > Potential NPE when calling SchedulerNode#reservedContainer for > CapacityScheduler > > > Key: YARN-11732 > URL: https://issues.apache.org/jira/browse/YARN-11732 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.4.0, 3.2.4, 3.3.6, 3.5.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: pull-request-available > > I found some places calling *SchedulerNode#getReservedContainer* to get > reservedContainer (returned value) but not do sanity(not-null) check before > calling internal methods of it, which can have a risk to raise > NullPointerException if it's null. > Most of these places have a premise that node has reserved container a few > moments ago, but may getting null by calling > *SchedulerNode#getReservedContainer* in the next moment, since the > reservedContainer can be updated to null concurrently in scheduling and > monitoring(preemption) thread. So that not-null check should be done before > calling internal methods of the reservedContainer. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers
[ https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884495#comment-17884495 ] ASF GitHub Bot commented on YARN-11702: --- slfan1989 commented on PR #6990: URL: https://github.com/apache/hadoop/pull/6990#issuecomment-2372879525 @shameersss1 Thanks for the contribution! @aajisaka @zeekling Thanks for the review! > Fix Yarn over allocating containers > --- > > Key: YARN-11702 > URL: https://issues.apache.org/jira/browse/YARN-11702 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, fairscheduler, scheduler, yarn >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > *Replication Steps:* > Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler) > > {code:java} > spark.executor.memory 1024M > spark.driver.memory 2048M > spark.executor.cores 1 > spark.executor.instances 20 > spark.dynamicAllocation.enabled false{code} > > Based on the setup, there should be 20 spark executors, but from the > ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 > of them were released in seconds. On analyzing the Spark ApplicationMaster > (AM) logs, The following logs were observed. > > {code:java} > 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) > for ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with > custom resources: vCores:2147483647> > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, > launching executors on 4 of them. > 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, > launching executors on 0 of them. > {code} > It was clear for the logs that extra allocated 12 containers are being > ignored from Spark side. Inorder to debug this further, additional log lines > were added to > [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427] > class in increment and decrement of container request to expose additional > information about the request. > > {code:java} > 2024-06-24 14:10:14,075 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 > Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, > containerToUpdate=null} for: appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,111 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,113 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingConta
[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers
[ https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884494#comment-17884494 ] ASF GitHub Bot commented on YARN-11702: --- slfan1989 merged PR #6990: URL: https://github.com/apache/hadoop/pull/6990 > Fix Yarn over allocating containers > --- > > Key: YARN-11702 > URL: https://issues.apache.org/jira/browse/YARN-11702 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, fairscheduler, scheduler, yarn >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > > *Replication Steps:* > Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler) > > {code:java} > spark.executor.memory 1024M > spark.driver.memory 2048M > spark.executor.cores 1 > spark.executor.instances 20 > spark.dynamicAllocation.enabled false{code} > > Based on the setup, there should be 20 spark executors, but from the > ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 > of them were released in seconds. On analyzing the Spark ApplicationMaster > (AM) logs, The following logs were observed. > > {code:java} > 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) > for ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with > custom resources: vCores:2147483647> > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, > launching executors on 4 of them. > 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, > launching executors on 0 of them. > {code} > It was clear for the logs that extra allocated 12 containers are being > ignored from Spark side. Inorder to debug this further, additional log lines > were added to > [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427] > class in increment and decrement of container request to expose additional > information about the request. > > {code:java} > 2024-06-24 14:10:14,075 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 > Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, > containerToUpdate=null} for: appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,111 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,113 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 15 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_000
[jira] [Commented] (YARN-11733) Fix the order of updating CPU controls with cgroup v1
[ https://issues.apache.org/jira/browse/YARN-11733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884322#comment-17884322 ] ASF GitHub Bot commented on YARN-11733: --- brumi1024 merged PR #7069: URL: https://github.com/apache/hadoop/pull/7069 > Fix the order of updating CPU controls with cgroup v1 > - > > Key: YARN-11733 > URL: https://issues.apache.org/jira/browse/YARN-11733 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Peter Szucs >Assignee: Peter Szucs >Priority: Major > Labels: pull-request-available > > After YARN-11674 (Update CpuResourceHandler implementation for cgroup v2 > support) the order of updating cpu.cfs_period_us and cpu.cfs_quota_us > controls have changed which can cause the below errors when launching > containers with CPU limits on cgroupv1: > {code:java} > PrintWriter unable to write to > /var/cgroupv1/cpu/hadoop-yarn/container_e02_1727079571170_0040_02_01/cpu.cfs_quota_us > with value: 112500{code} > > *Reproduction:* > I set CPU limits on yarn-site.xml for cgroup: > {code:java} > yarn.nodemanager.resource.percentage-physical-cpu-limit: 90 > yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage: > true{code} > After that the limits were applied on the hadoop-yarn root hierarchy: > {code:java} > root@pszucs-test-2 hadoop-yarn]# cat cpu.cfs_period_us 100 > root@pszucs-test-2 hadoop-yarn]# cat cpu.cfs_quota_us 90 > {code} > When I tried to launch a container it gave me the following error: > {code:java} > PrintWriter unable to write to > /var/cgroupv1/cpu/hadoop-yarn/container_e02_1727079571170_0040_02_01/cpu.cfs_quota_us > with value: 112500{code} > It is because the container tries to exceed the limit defined at higher level > with the 112 500 value for cfs_quota_us. If I try to create a test cgroup > manually and try to update this control it lets me to do that up to the value > of 90 000 as well: > {code:java} > [root@pszucs-test-2 hadoop-yarn]# cat test/cpu.cfs_period_us > 10 > [root@pszucs-test-2 hadoop-yarn]# echo "90001" > test/cpu.cfs_quota_us > -bash: echo: write error: Invalid argument > [root@pszucs-test-2 hadoop-yarn]# echo "9" > test/cpu.cfs_quota_us{code} > > *Solution:* > The cause for this issue is that the cfs_period_us control get the default > value of 100 000 when a new cgroup is created, but when YARN calculates the > limit, it uses 1 000 000 for that. Because of this we need to update > cpu.cfs_period_us before cpu.cfs_quota_us, to keep the ratio between the two > values and not to overcome the limit defined at parent level. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11733) Fix the order of updating CPU controls with cgroup v1
[ https://issues.apache.org/jira/browse/YARN-11733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884321#comment-17884321 ] ASF GitHub Bot commented on YARN-11733: --- brumi1024 commented on PR #7069: URL: https://github.com/apache/hadoop/pull/7069#issuecomment-2371595567 Thanks @p-szucs for the patch, LGTM. Merging to trunk. > Fix the order of updating CPU controls with cgroup v1 > - > > Key: YARN-11733 > URL: https://issues.apache.org/jira/browse/YARN-11733 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Peter Szucs >Assignee: Peter Szucs >Priority: Major > Labels: pull-request-available > > After YARN-11674 (Update CpuResourceHandler implementation for cgroup v2 > support) the order of updating cpu.cfs_period_us and cpu.cfs_quota_us > controls have changed which can cause the below errors when launching > containers with CPU limits on cgroupv1: > {code:java} > PrintWriter unable to write to > /var/cgroupv1/cpu/hadoop-yarn/container_e02_1727079571170_0040_02_01/cpu.cfs_quota_us > with value: 112500{code} > > *Reproduction:* > I set CPU limits on yarn-site.xml for cgroup: > {code:java} > yarn.nodemanager.resource.percentage-physical-cpu-limit: 90 > yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage: > true{code} > After that the limits were applied on the hadoop-yarn root hierarchy: > {code:java} > root@pszucs-test-2 hadoop-yarn]# cat cpu.cfs_period_us 100 > root@pszucs-test-2 hadoop-yarn]# cat cpu.cfs_quota_us 90 > {code} > When I tried to launch a container it gave me the following error: > {code:java} > PrintWriter unable to write to > /var/cgroupv1/cpu/hadoop-yarn/container_e02_1727079571170_0040_02_01/cpu.cfs_quota_us > with value: 112500{code} > It is because the container tries to exceed the limit defined at higher level > with the 112 500 value for cfs_quota_us. If I try to create a test cgroup > manually and try to update this control it lets me to do that up to the value > of 90 000 as well: > {code:java} > [root@pszucs-test-2 hadoop-yarn]# cat test/cpu.cfs_period_us > 10 > [root@pszucs-test-2 hadoop-yarn]# echo "90001" > test/cpu.cfs_quota_us > -bash: echo: write error: Invalid argument > [root@pszucs-test-2 hadoop-yarn]# echo "9" > test/cpu.cfs_quota_us{code} > > *Solution:* > The cause for this issue is that the cfs_period_us control get the default > value of 100 000 when a new cgroup is created, but when YARN calculates the > limit, it uses 1 000 000 for that. Because of this we need to update > cpu.cfs_period_us before cpu.cfs_quota_us, to keep the ratio between the two > values and not to overcome the limit defined at parent level. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11733) Fix the order of updating CPU controls with cgroup v1
[ https://issues.apache.org/jira/browse/YARN-11733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884291#comment-17884291 ] ASF GitHub Bot commented on YARN-11733: --- hadoop-yetus commented on PR #7069: URL: https://github.com/apache/hadoop/pull/7069#issuecomment-2371415509 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 50s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 50m 12s | | trunk passed | | +1 :green_heart: | compile | 1m 40s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 1m 26s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 0m 39s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 45s | | trunk passed | | +1 :green_heart: | javadoc | 0m 47s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 38s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 29s | | trunk passed | | +1 :green_heart: | shadedclient | 38m 12s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 35s | | the patch passed | | +1 :green_heart: | compile | 1m 25s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 1m 25s | | the patch passed | | +1 :green_heart: | compile | 1m 19s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 1m 19s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 28s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 35s | | the patch passed | | +1 :green_heart: | javadoc | 0m 38s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 32s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 27s | | the patch passed | | +1 :green_heart: | shadedclient | 40m 25s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 24m 29s | | hadoop-yarn-server-nodemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 38s | | The patch does not generate ASF License warnings. | | | | 169m 51s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7069/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7069 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 55805aae17e5 5.15.0-119-generic #129-Ubuntu SMP Fri Aug 2 19:25:20 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 4c87c3db99c44628b07ce123c8fa43be5dc18bde | | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7069/2/testReport/ | | Max. process+thread count | 527 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7069/2/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This
[jira] [Commented] (YARN-11733) Fix the order of updating CPU controls with cgroup v1
[ https://issues.apache.org/jira/browse/YARN-11733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884276#comment-17884276 ] ASF GitHub Bot commented on YARN-11733: --- hadoop-yetus commented on PR #7069: URL: https://github.com/apache/hadoop/pull/7069#issuecomment-2371250866 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 49s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 51m 55s | | trunk passed | | +1 :green_heart: | compile | 1m 33s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 1m 28s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 0m 38s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 50s | | trunk passed | | +1 :green_heart: | javadoc | 0m 48s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 39s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 30s | | trunk passed | | +1 :green_heart: | shadedclient | 40m 17s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 34s | | the patch passed | | +1 :green_heart: | compile | 1m 26s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 1m 26s | | the patch passed | | +1 :green_heart: | compile | 1m 18s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 1m 18s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 27s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 36s | | the patch passed | | +1 :green_heart: | javadoc | 0m 36s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 32s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 31s | | the patch passed | | +1 :green_heart: | shadedclient | 41m 52s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 24m 33s | | hadoop-yarn-server-nodemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 37s | | The patch does not generate ASF License warnings. | | | | 175m 4s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7069/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7069 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux effaca3fa0c9 5.15.0-119-generic #129-Ubuntu SMP Fri Aug 2 19:25:20 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 68828951a523a75632637671dc9a8de6be9d6469 | | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7069/1/testReport/ | | Max. process+thread count | 614 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7069/1/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This
[jira] [Commented] (YARN-11733) Fix the order of updating CPU controls with cgroup v1
[ https://issues.apache.org/jira/browse/YARN-11733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884211#comment-17884211 ] ASF GitHub Bot commented on YARN-11733: --- p-szucs opened a new pull request, #7069: URL: https://github.com/apache/hadoop/pull/7069 Change-Id: I09429c878c124be9d6a09e8f027ab89d34606f2f ### Description of PR With using cgroup v1, cpu.cfs_period_us control gets the default value of 100 000 when a new cgroup is created, but when YARN calculates the limit, it uses 1 000 000 for that. Because of this we need to update cpu.cfs_period_us before cpu.cfs_quota_us, to keep the ratio between the two values and not to exceed the limit defined at parent level. ### How was this patch tested? Unit test ### For code changes: - [x] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > Fix the order of updating CPU controls with cgroup v1 > - > > Key: YARN-11733 > URL: https://issues.apache.org/jira/browse/YARN-11733 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Peter Szucs >Assignee: Peter Szucs >Priority: Major > > After YARN-11674 (Update CpuResourceHandler implementation for cgroup v2 > support) the order of updating cpu.cfs_period_us and cpu.cfs_quota_us > controls have changed which can cause the below errors when launching > containers with CPU limits on cgroupv1: > {code:java} > PrintWriter unable to write to > /var/cgroupv1/cpu/hadoop-yarn/container_e02_1727079571170_0040_02_01/cpu.cfs_quota_us > with value: 112500{code} > > *Reproduction:* > I set CPU limits on yarn-site.xml for cgroup: > {code:java} > yarn.nodemanager.resource.percentage-physical-cpu-limit: 90 > yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage: > true{code} > After that the limits were applied on the hadoop-yarn root hierarchy: > {code:java} > root@pszucs-test-2 hadoop-yarn]# cat cpu.cfs_period_us 100 > root@pszucs-test-2 hadoop-yarn]# cat cpu.cfs_quota_us 90 > {code} > When I tried to launch a container it gave me the following error: > {code:java} > PrintWriter unable to write to > /var/cgroupv1/cpu/hadoop-yarn/container_e02_1727079571170_0040_02_01/cpu.cfs_quota_us > with value: 112500{code} > It is because the container tries to exceed the limit defined at higher level > with the 112 500 value for cfs_quota_us. If I try to create a test cgroup > manually and try to update this control it lets me to do that up to the value > of 90 000 as well: > {code:java} > [root@pszucs-test-2 hadoop-yarn]# cat test/cpu.cfs_period_us > 10 > [root@pszucs-test-2 hadoop-yarn]# echo "90001" > test/cpu.cfs_quota_us > -bash: echo: write error: Invalid argument > [root@pszucs-test-2 hadoop-yarn]# echo "9" > test/cpu.cfs_quota_us{code} > > *Solution:* > The cause for this issue is that the cfs_period_us control get the default > value of 100 000 when a new cgroup is created, but when YARN calculates the > limit, it uses 1 000 000 for that. Because of this we need to update > cpu.cfs_period_us before cpu.cfs_quota_us, to keep the ratio between the two > values and not to overcome the limit defined at parent level. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11733) Fix the order of updating CPU controls with cgroup v1
[ https://issues.apache.org/jira/browse/YARN-11733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-11733: -- Labels: pull-request-available (was: ) > Fix the order of updating CPU controls with cgroup v1 > - > > Key: YARN-11733 > URL: https://issues.apache.org/jira/browse/YARN-11733 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Peter Szucs >Assignee: Peter Szucs >Priority: Major > Labels: pull-request-available > > After YARN-11674 (Update CpuResourceHandler implementation for cgroup v2 > support) the order of updating cpu.cfs_period_us and cpu.cfs_quota_us > controls have changed which can cause the below errors when launching > containers with CPU limits on cgroupv1: > {code:java} > PrintWriter unable to write to > /var/cgroupv1/cpu/hadoop-yarn/container_e02_1727079571170_0040_02_01/cpu.cfs_quota_us > with value: 112500{code} > > *Reproduction:* > I set CPU limits on yarn-site.xml for cgroup: > {code:java} > yarn.nodemanager.resource.percentage-physical-cpu-limit: 90 > yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage: > true{code} > After that the limits were applied on the hadoop-yarn root hierarchy: > {code:java} > root@pszucs-test-2 hadoop-yarn]# cat cpu.cfs_period_us 100 > root@pszucs-test-2 hadoop-yarn]# cat cpu.cfs_quota_us 90 > {code} > When I tried to launch a container it gave me the following error: > {code:java} > PrintWriter unable to write to > /var/cgroupv1/cpu/hadoop-yarn/container_e02_1727079571170_0040_02_01/cpu.cfs_quota_us > with value: 112500{code} > It is because the container tries to exceed the limit defined at higher level > with the 112 500 value for cfs_quota_us. If I try to create a test cgroup > manually and try to update this control it lets me to do that up to the value > of 90 000 as well: > {code:java} > [root@pszucs-test-2 hadoop-yarn]# cat test/cpu.cfs_period_us > 10 > [root@pszucs-test-2 hadoop-yarn]# echo "90001" > test/cpu.cfs_quota_us > -bash: echo: write error: Invalid argument > [root@pszucs-test-2 hadoop-yarn]# echo "9" > test/cpu.cfs_quota_us{code} > > *Solution:* > The cause for this issue is that the cfs_period_us control get the default > value of 100 000 when a new cgroup is created, but when YARN calculates the > limit, it uses 1 000 000 for that. Because of this we need to update > cpu.cfs_period_us before cpu.cfs_quota_us, to keep the ratio between the two > values and not to overcome the limit defined at parent level. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11732) Potential NPE when calling SchedulerNode#reservedContainer for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884186#comment-17884186 ] ASF GitHub Bot commented on YARN-11732: --- hadoop-yetus commented on PR #7065: URL: https://github.com/apache/hadoop/pull/7065#issuecomment-2370643050 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 18m 16s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 49m 25s | | trunk passed | | +1 :green_heart: | compile | 1m 4s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 0m 56s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 0m 56s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 0s | | trunk passed | | +1 :green_heart: | javadoc | 1m 0s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 48s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 2m 0s | | trunk passed | | +1 :green_heart: | shadedclient | 40m 33s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 50s | | the patch passed | | +1 :green_heart: | compile | 0m 55s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 0m 55s | | the patch passed | | +1 :green_heart: | compile | 0m 46s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 0m 46s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 45s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 50s | | the patch passed | | +1 :green_heart: | javadoc | 0m 46s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 42s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 2m 0s | | the patch passed | | +1 :green_heart: | shadedclient | 41m 2s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 108m 50s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 37s | | The patch does not generate ASF License warnings. | | | | 273m 38s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7065/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7065 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux f342e868de9d 5.15.0-119-generic #129-Ubuntu SMP Fri Aug 2 19:25:20 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 38b63adff112620f11ebbb0089d9d181cbd7b5fe | | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7065/1/testReport/ | | Max. process+thread count | 937 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7065/1/console
[jira] [Updated] (YARN-11732) Potential NPE when calling SchedulerNode#reservedContainer for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-11732: -- Labels: pull-request-available (was: ) > Potential NPE when calling SchedulerNode#reservedContainer for > CapacityScheduler > > > Key: YARN-11732 > URL: https://issues.apache.org/jira/browse/YARN-11732 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.4.0, 3.2.4, 3.3.6, 3.5.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Labels: pull-request-available > > I found some places calling *SchedulerNode#getReservedContainer* to get > reservedContainer (returned value) but not do sanity(not-null) check before > calling internal methods of it, which can have a risk to raise > NullPointerException if it's null. > Most of these places have a premise that node has reserved container a few > moments ago, but may getting null by calling > *SchedulerNode#getReservedContainer* in the next moment, since the > reservedContainer can be updated to null concurrently in scheduling and > monitoring(preemption) thread. So that not-null check should be done before > calling internal methods of the reservedContainer. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11732) Potential NPE when calling SchedulerNode#reservedContainer for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884102#comment-17884102 ] ASF GitHub Bot commented on YARN-11732: --- TaoYang526 opened a new pull request, #7065: URL: https://github.com/apache/hadoop/pull/7065 ### Description of PR Details please refer to YARN-11732. Add sanity check before calling internal methods of reservedContainer. ### How was this patch tested? Not necessary. ### For code changes: - [x] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > Potential NPE when calling SchedulerNode#reservedContainer for > CapacityScheduler > > > Key: YARN-11732 > URL: https://issues.apache.org/jira/browse/YARN-11732 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.4.0, 3.2.4, 3.3.6, 3.5.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > > I found some places calling *SchedulerNode#getReservedContainer* to get > reservedContainer (returned value) but not do sanity(not-null) check before > calling internal methods of it, which can have a risk to raise > NullPointerException if it's null. > Most of these places have a premise that node has reserved container a few > moments ago, but may getting null by calling > *SchedulerNode#getReservedContainer* in the next moment, since the > reservedContainer can be updated to null concurrently in scheduling and > monitoring(preemption) thread. So that not-null check should be done before > calling internal methods of the reservedContainer. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers
[ https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883860#comment-17883860 ] ASF GitHub Bot commented on YARN-11702: --- slfan1989 commented on PR #6990: URL: https://github.com/apache/hadoop/pull/6990#issuecomment-2368101194 @shameersss1 Thanks for the contribution! If there are no other comments in the next 2 days, we will merge this PR to the trunk branch. > Fix Yarn over allocating containers > --- > > Key: YARN-11702 > URL: https://issues.apache.org/jira/browse/YARN-11702 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, fairscheduler, scheduler, yarn >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > > *Replication Steps:* > Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler) > > {code:java} > spark.executor.memory 1024M > spark.driver.memory 2048M > spark.executor.cores 1 > spark.executor.instances 20 > spark.dynamicAllocation.enabled false{code} > > Based on the setup, there should be 20 spark executors, but from the > ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 > of them were released in seconds. On analyzing the Spark ApplicationMaster > (AM) logs, The following logs were observed. > > {code:java} > 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) > for ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with > custom resources: vCores:2147483647> > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, > launching executors on 4 of them. > 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, > launching executors on 0 of them. > {code} > It was clear for the logs that extra allocated 12 containers are being > ignored from Spark side. Inorder to debug this further, additional log lines > were added to > [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427] > class in increment and decrement of container request to expose additional > information about the request. > > {code:java} > 2024-06-24 14:10:14,075 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 > Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, > containerToUpdate=null} for: appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,111 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,113 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Upd
[jira] [Commented] (YARN-11560) Fix NPE bug when multi-node enabled with schedule asynchronously
[ https://issues.apache.org/jira/browse/YARN-11560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883741#comment-17883741 ] ASF GitHub Bot commented on YARN-11560: --- ayushtkn merged PR #6021: URL: https://github.com/apache/hadoop/pull/6021 > Fix NPE bug when multi-node enabled with schedule asynchronously > > > Key: YARN-11560 > URL: https://issues.apache.org/jira/browse/YARN-11560 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.3.3 >Reporter: wangzhongwei >Assignee: wangzhongwei >Priority: Blocker > Labels: pull-request-available > > when multiNodePlacementEnabled,using global scheduler,NPE may happend when > commit thread calling allocateFromReservedContainer with param > reservedContainer ,while the container may be unreserved by the judgment > thread in tryCommit->apply function -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11560) Fix NPE bug when multi-node enabled with schedule asynchronously
[ https://issues.apache.org/jira/browse/YARN-11560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883683#comment-17883683 ] ASF GitHub Bot commented on YARN-11560: --- hadoop-yetus commented on PR #6021: URL: https://github.com/apache/hadoop/pull/6021#issuecomment-2367248090 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 7m 34s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 36m 35s | | trunk passed | | +1 :green_heart: | compile | 0m 34s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 0m 32s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 0m 32s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 35s | | trunk passed | | +1 :green_heart: | javadoc | 0m 40s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 33s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 17s | | trunk passed | | +1 :green_heart: | shadedclient | 20m 43s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 27s | | the patch passed | | +1 :green_heart: | compile | 0m 30s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 0m 30s | | the patch passed | | +1 :green_heart: | compile | 0m 27s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 0m 27s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 24s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 31s | | the patch passed | | +1 :green_heart: | javadoc | 0m 27s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 27s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 10s | | the patch passed | | +1 :green_heart: | shadedclient | 20m 20s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 89m 52s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 26s | | The patch does not generate ASF License warnings. | | | | 184m 24s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6021/4/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6021 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 10d0b101643f 5.15.0-116-generic #126-Ubuntu SMP Mon Jul 1 10:14:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 056f22135781508f8b465660591046919b3b1cfb | | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6021/4/testReport/ | | Max. process+thread count | 951 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6021/4/console
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883660#comment-17883660 ] ASF GitHub Bot commented on YARN-11730: --- slfan1989 merged PR #7049: URL: https://github.com/apache/hadoop/pull/7049 > Resourcemanager node reporting enhancement for unregistered hosts > - > > Key: YARN-11730 > URL: https://issues.apache.org/jira/browse/YARN-11730 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, yarn >Affects Versions: 3.4.0 > Environment: Tested on multiple environments: > A. Docker Environment{*}:{*} > * Base OS: *Ubuntu 20.04* > * *Java 8* installed from OpenJDK. > * Docker image includes Hadoop binaries, user configurations, and ports for > YARN services. > * Verified behavior using a Hadoop snapshot in a containerized environment. > * Performed Namenode formatting and validated service interactions through > exposed ports. > * Repo reference: > [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main] > B. Bare-metal Distributed Setup (RedHat Linux){*}:{*} > * Running *Java 8* in a High-Availability (HA) configuration with > *Zookeeper* for locking mechanism. > * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM > node, including state retention and proper node state transitions. > * Verified node state transitions during RM failover, ensuring nodes moved > between LOST, ACTIVE, and other states as expected. >Reporter: Arjun Mohnot >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > h3. Issue Overview > When the ResourceManager (RM) starts, nodes listed in the _"include"_ file > are not immediately reported until their corresponding NodeManagers (NMs) > send their first heartbeat. However, nodes in the _"exclude"_ file are > instantly reflected in the _"Decommissioned Hosts"_ section with a port value > -1. > This design creates several challenges: > * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM > standalone restart, some nodes may not report back, even though they are > listed in the _"include"_ file. These nodes neither appear in the _LOST_ > state nor are they represented in the RM's JMX metrics. This results in an > untracked state, making it difficult to monitor their status. While in HDFS > similar behaviour exists and is marked as {_}"DEAD"{_}. > * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until > they send their first heartbeat. This delay impacts real-time cluster > monitoring, leading to a lack of immediate visibility for these nodes in > Resourcemanager's state on the total no. of nodes. > * {*}Operational Impact{*}: These unreported nodes cause operational > difficulties, particularly in automated workflows such as OS Upgrade > Automation (OSUA), node recovery automation, and others where validation > depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or > {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky > workarounds to determine their accurate status. > h3. Proposed Solution > To address these issues, we propose automatically assigning the _LOST_ state > to any node listed in the _"include"_ file that are not registered and not > part of the exclude file by default at the RM startup or HA failover. This > can be done by marking the node with a special port value {_}-2{_}, signaling > that the node is considered LOST but has not yet been reported. Whenever a > heartbeat is received for that {color:#de350b}nodeID{color}, it will be > transitioned from _LOST_ to {_}RUNNING{_}, {_}UNHEALTHY{_}, or any other > required desired state. > h3. Key implementation points > * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of > the RM active node context should be automatically marked as {_}LOST{_}. This > can be achieved by modifying the _NodesListManager_ under the > {color:#de350b}refreshHostsReader{color} method, invoked during failover, or > manual node refresh operations. This logic should ensure that all > unregistered nodes are moved to the _LOST_ state, with port _-2_ indicating > the node is untracked. > * For non-HA setups, this process can be triggered during RM service startup > to mark nodes as _LOST_ initially, and they will gradually transition to > their desired state when the heartbeat is received. > * Handle Node Heartbeat and Transition: When a node sends its first > heartbeat, the system should verify if the node is listed in > {color:#de350b}getInactiveRMNodes(){color}. If the node exists in the _LOST_ > state, the RM should remove it from the inactive list, decrement the _LOST_
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883663#comment-17883663 ] ASF GitHub Bot commented on YARN-11730: --- slfan1989 commented on PR #7049: URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2367142932 @arjunmohnot Thanks for the contribution! @zeekling Thanks for the review! > Resourcemanager node reporting enhancement for unregistered hosts > - > > Key: YARN-11730 > URL: https://issues.apache.org/jira/browse/YARN-11730 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, yarn >Affects Versions: 3.4.0 > Environment: Tested on multiple environments: > A. Docker Environment{*}:{*} > * Base OS: *Ubuntu 20.04* > * *Java 8* installed from OpenJDK. > * Docker image includes Hadoop binaries, user configurations, and ports for > YARN services. > * Verified behavior using a Hadoop snapshot in a containerized environment. > * Performed Namenode formatting and validated service interactions through > exposed ports. > * Repo reference: > [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main] > B. Bare-metal Distributed Setup (RedHat Linux){*}:{*} > * Running *Java 8* in a High-Availability (HA) configuration with > *Zookeeper* for locking mechanism. > * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM > node, including state retention and proper node state transitions. > * Verified node state transitions during RM failover, ensuring nodes moved > between LOST, ACTIVE, and other states as expected. >Reporter: Arjun Mohnot >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > h3. Issue Overview > When the ResourceManager (RM) starts, nodes listed in the _"include"_ file > are not immediately reported until their corresponding NodeManagers (NMs) > send their first heartbeat. However, nodes in the _"exclude"_ file are > instantly reflected in the _"Decommissioned Hosts"_ section with a port value > -1. > This design creates several challenges: > * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM > standalone restart, some nodes may not report back, even though they are > listed in the _"include"_ file. These nodes neither appear in the _LOST_ > state nor are they represented in the RM's JMX metrics. This results in an > untracked state, making it difficult to monitor their status. While in HDFS > similar behaviour exists and is marked as {_}"DEAD"{_}. > * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until > they send their first heartbeat. This delay impacts real-time cluster > monitoring, leading to a lack of immediate visibility for these nodes in > Resourcemanager's state on the total no. of nodes. > * {*}Operational Impact{*}: These unreported nodes cause operational > difficulties, particularly in automated workflows such as OS Upgrade > Automation (OSUA), node recovery automation, and others where validation > depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or > {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky > workarounds to determine their accurate status. > h3. Proposed Solution > To address these issues, we propose automatically assigning the _LOST_ state > to any node listed in the _"include"_ file that are not registered and not > part of the exclude file by default at the RM startup or HA failover. This > can be done by marking the node with a special port value {_}-2{_}, signaling > that the node is considered LOST but has not yet been reported. Whenever a > heartbeat is received for that {color:#de350b}nodeID{color}, it will be > transitioned from _LOST_ to {_}RUNNING{_}, {_}UNHEALTHY{_}, or any other > required desired state. > h3. Key implementation points > * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of > the RM active node context should be automatically marked as {_}LOST{_}. This > can be achieved by modifying the _NodesListManager_ under the > {color:#de350b}refreshHostsReader{color} method, invoked during failover, or > manual node refresh operations. This logic should ensure that all > unregistered nodes are moved to the _LOST_ state, with port _-2_ indicating > the node is untracked. > * For non-HA setups, this process can be triggered during RM service startup > to mark nodes as _LOST_ initially, and they will gradually transition to > their desired state when the heartbeat is received. > * Handle Node Heartbeat and Transition: When a node sends its first > heartbeat, the system should verify if the node is listed in > {color:#de350b}getInactiveRMNodes(){color}. If
[jira] [Commented] (YARN-11560) Fix NPE bug when multi-node enabled with schedule asynchronously
[ https://issues.apache.org/jira/browse/YARN-11560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883648#comment-17883648 ] ASF GitHub Bot commented on YARN-11560: --- granewang commented on code in PR #6021: URL: https://github.com/apache/hadoop/pull/6021#discussion_r1770714881 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java: ## @@ -1737,6 +1737,10 @@ private CSAssignment allocateContainerOnSingleNode( private void allocateFromReservedContainer(FiCaSchedulerNode node, boolean withNodeHeartbeat, RMContainer reservedContainer) { +if(reservedContainer == null){ + LOG.warn("reservedContainer is null,that may be unreserved by the proposal judgment thread"); Review Comment: Thanks for your review and pr updated. > Fix NPE bug when multi-node enabled with schedule asynchronously > > > Key: YARN-11560 > URL: https://issues.apache.org/jira/browse/YARN-11560 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.3.3 >Reporter: wangzhongwei >Assignee: wangzhongwei >Priority: Blocker > Labels: pull-request-available > > when multiNodePlacementEnabled,using global scheduler,NPE may happend when > commit thread calling allocateFromReservedContainer with param > reservedContainer ,while the container may be unreserved by the judgment > thread in tryCommit->apply function -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers
[ https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883616#comment-17883616 ] ASF GitHub Bot commented on YARN-11702: --- hadoop-yetus commented on PR #6990: URL: https://github.com/apache/hadoop/pull/6990#issuecomment-2366846061 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 17m 56s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 0s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 43s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 33m 26s | | trunk passed | | +1 :green_heart: | compile | 7m 37s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 7m 10s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 2m 0s | | trunk passed | | +1 :green_heart: | mvnsite | 3m 17s | | trunk passed | | +1 :green_heart: | javadoc | 3m 9s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 3m 0s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 6m 19s | | trunk passed | | +1 :green_heart: | shadedclient | 37m 2s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 33s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 4s | | the patch passed | | +1 :green_heart: | compile | 6m 56s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 6m 56s | | the patch passed | | +1 :green_heart: | compile | 6m 59s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 6m 59s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 51s | | the patch passed | | +1 :green_heart: | mvnsite | 2m 58s | | the patch passed | | +1 :green_heart: | javadoc | 2m 53s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 2m 46s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 6m 36s | | the patch passed | | +1 :green_heart: | shadedclient | 37m 2s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 1m 10s | | hadoop-yarn-api in the patch passed. | | +1 :green_heart: | unit | 5m 51s | | hadoop-yarn-common in the patch passed. | | +1 :green_heart: | unit | 110m 15s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 1m 0s | | The patch does not generate ASF License warnings. | | | | 326m 14s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6990/5/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6990 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint | | uname | Linux 82d88f39fd1a 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / a1f82433186262429d84139f9e61c7653383b5e8 | | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6990/5/testReport/ | | Max. process+thread count | 956 (vs. ulimit of 5500) |
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883615#comment-17883615 ] ASF GitHub Bot commented on YARN-11730: --- arjunmohnot commented on PR #7049: URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2366843257 > > Hi @slfan1989, @zeekling, If there are no further concerns, could I kindly request an approval so we can merge this change? Thank you for the review. > > @arjunmohnot Thank you for your contributions! If there are no further comments by the end of this week, we will merge into the trunk branch. Thank you @slfan1989 for your thoughtful feedback and the time spent reviewing these changes. Your support is truly appreciated! If everything looks good and there are no further comments, merging this PR at your convenience would be a great help! 🚀 > Resourcemanager node reporting enhancement for unregistered hosts > - > > Key: YARN-11730 > URL: https://issues.apache.org/jira/browse/YARN-11730 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, yarn >Affects Versions: 3.4.0 > Environment: Tested on multiple environments: > A. Docker Environment{*}:{*} > * Base OS: *Ubuntu 20.04* > * *Java 8* installed from OpenJDK. > * Docker image includes Hadoop binaries, user configurations, and ports for > YARN services. > * Verified behavior using a Hadoop snapshot in a containerized environment. > * Performed Namenode formatting and validated service interactions through > exposed ports. > * Repo reference: > [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main] > B. Bare-metal Distributed Setup (RedHat Linux){*}:{*} > * Running *Java 8* in a High-Availability (HA) configuration with > *Zookeeper* for locking mechanism. > * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM > node, including state retention and proper node state transitions. > * Verified node state transitions during RM failover, ensuring nodes moved > between LOST, ACTIVE, and other states as expected. >Reporter: Arjun Mohnot >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > h3. Issue Overview > When the ResourceManager (RM) starts, nodes listed in the _"include"_ file > are not immediately reported until their corresponding NodeManagers (NMs) > send their first heartbeat. However, nodes in the _"exclude"_ file are > instantly reflected in the _"Decommissioned Hosts"_ section with a port value > -1. > This design creates several challenges: > * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM > standalone restart, some nodes may not report back, even though they are > listed in the _"include"_ file. These nodes neither appear in the _LOST_ > state nor are they represented in the RM's JMX metrics. This results in an > untracked state, making it difficult to monitor their status. While in HDFS > similar behaviour exists and is marked as {_}"DEAD"{_}. > * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until > they send their first heartbeat. This delay impacts real-time cluster > monitoring, leading to a lack of immediate visibility for these nodes in > Resourcemanager's state on the total no. of nodes. > * {*}Operational Impact{*}: These unreported nodes cause operational > difficulties, particularly in automated workflows such as OS Upgrade > Automation (OSUA), node recovery automation, and others where validation > depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or > {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky > workarounds to determine their accurate status. > h3. Proposed Solution > To address these issues, we propose automatically assigning the _LOST_ state > to any node listed in the _"include"_ file that are not registered and not > part of the exclude file by default at the RM startup or HA failover. This > can be done by marking the node with a special port value {_}-2{_}, signaling > that the node is considered LOST but has not yet been reported. Whenever a > heartbeat is received for that {color:#de350b}nodeID{color}, it will be > transitioned from _LOST_ to {_}RUNNING{_}, {_}UNHEALTHY{_}, or any other > required desired state. > h3. Key implementation points > * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of > the RM active node context should be automatically marked as {_}LOST{_}. This > can be achieved by modifying the _NodesListManager_ under the > {color:#de350b}refreshHostsReader{color} method, invoked during failover, or > manual node refresh operations. This logic should ensure that all > un
[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers
[ https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883586#comment-17883586 ] ASF GitHub Bot commented on YARN-11702: --- shameersss1 commented on PR #6990: URL: https://github.com/apache/hadoop/pull/6990#issuecomment-2366614555 @aajisaka @slfan1989 - I have addressed the latest comments - Please review Thanks > Fix Yarn over allocating containers > --- > > Key: YARN-11702 > URL: https://issues.apache.org/jira/browse/YARN-11702 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, fairscheduler, scheduler, yarn >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > > *Replication Steps:* > Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler) > > {code:java} > spark.executor.memory 1024M > spark.driver.memory 2048M > spark.executor.cores 1 > spark.executor.instances 20 > spark.dynamicAllocation.enabled false{code} > > Based on the setup, there should be 20 spark executors, but from the > ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 > of them were released in seconds. On analyzing the Spark ApplicationMaster > (AM) logs, The following logs were observed. > > {code:java} > 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) > for ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with > custom resources: vCores:2147483647> > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, > launching executors on 4 of them. > 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, > launching executors on 0 of them. > {code} > It was clear for the logs that extra allocated 12 containers are being > ignored from Spark side. Inorder to debug this further, additional log lines > were added to > [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427] > class in increment and decrement of container request to expose additional > information about the request. > > {code:java} > 2024-06-24 14:10:14,075 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 > Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, > containerToUpdate=null} for: appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,111 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,113 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 15 Decremented by: 1
[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers
[ https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883585#comment-17883585 ] ASF GitHub Bot commented on YARN-11702: --- shameersss1 commented on code in PR #6990: URL: https://github.com/apache/hadoop/pull/6990#discussion_r1770451559 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java: ## @@ -1678,4 +1794,78 @@ private List getAppsFromQueue(String queueName) } return apps; } + + /** + * ContainerObjectType is a container object with the following properties. + * Namely allocationId, priority, executionType and resourceType. + */ + protected class ContainerObjectType extends Object { +private final long allocationId; +private final Priority priority; +private final ExecutionType executionType; +private final Resource resource; + +public ContainerObjectType(long allocationId, Priority priority, +ExecutionType executionType, Resource resource) { + this.allocationId = allocationId; + this.priority = priority; + this.executionType = executionType; + this.resource = resource; +} + +public long getAllocationId() { + return allocationId; +} + +public Priority getPriority() { + return priority; +} + +public ExecutionType getExecutionType() { + return executionType; +} + +public Resource getResource() { + return resource; +} + +@Override +public int hashCode() { + final int prime = 31; + int result = 1; + result = (int) (prime * result + allocationId); + result = prime * result + (priority == null ? 0 : priority.hashCode()); + result = prime * result + (executionType == null ? 0 : executionType.hashCode()); + result = prime * result + (resource == null ? 0 : resource.hashCode()); + return result; +} + +@Override +public boolean equals(Object obj) { + if (obj == null) { Review Comment: ack ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java: ## @@ -1678,4 +1794,78 @@ private List getAppsFromQueue(String queueName) } return apps; } + + /** + * ContainerObjectType is a container object with the following properties. + * Namely allocationId, priority, executionType and resourceType. + */ + protected class ContainerObjectType extends Object { +private final long allocationId; +private final Priority priority; +private final ExecutionType executionType; +private final Resource resource; + +public ContainerObjectType(long allocationId, Priority priority, +ExecutionType executionType, Resource resource) { + this.allocationId = allocationId; + this.priority = priority; + this.executionType = executionType; + this.resource = resource; +} + +public long getAllocationId() { + return allocationId; +} + +public Priority getPriority() { + return priority; +} + +public ExecutionType getExecutionType() { + return executionType; +} + +public Resource getResource() { + return resource; +} + +@Override +public int hashCode() { + final int prime = 31; + int result = 1; Review Comment: ack > Fix Yarn over allocating containers > --- > > Key: YARN-11702 > URL: https://issues.apache.org/jira/browse/YARN-11702 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, fairscheduler, scheduler, yarn >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > > *Replication Steps:* > Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler) > > {code:java} > spark.executor.memory 1024M > spark.driver.memory 2048M > spark.executor.cores 1 > spark.executor.instances 20 > spark.dynamicAllocation.enabled false{code} > > Based on the setup, there should be 20 spark executors, but from the > ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 > of them were released in seconds. On analyzing the Spark ApplicationMaster > (AM) logs, The following logs were observed. > > {code:java} > 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) > for ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with > custom resources: vCores:2147483647> > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24
[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers
[ https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883583#comment-17883583 ] ASF GitHub Bot commented on YARN-11702: --- shameersss1 commented on PR #6990: URL: https://github.com/apache/hadoop/pull/6990#issuecomment-2366453607 > The scheduler will not discard resource application requests, so why do we need to apply multiple times? @shameersss1 Yes, but the way AMRMClient in Hadoop works in a way that AM always sends pendingconatiner request as part of heartbeat. This is done because of two reasons 1. AM can dynamically change the request (Eg: Tez have auto parallelism and container reuse) 2. To make sure AM always went what it wants. > Fix Yarn over allocating containers > --- > > Key: YARN-11702 > URL: https://issues.apache.org/jira/browse/YARN-11702 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, fairscheduler, scheduler, yarn >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > > *Replication Steps:* > Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler) > > {code:java} > spark.executor.memory 1024M > spark.driver.memory 2048M > spark.executor.cores 1 > spark.executor.instances 20 > spark.dynamicAllocation.enabled false{code} > > Based on the setup, there should be 20 spark executors, but from the > ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 > of them were released in seconds. On analyzing the Spark ApplicationMaster > (AM) logs, The following logs were observed. > > {code:java} > 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) > for ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with > custom resources: vCores:2147483647> > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, > launching executors on 4 of them. > 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, > launching executors on 0 of them. > {code} > It was clear for the logs that extra allocated 12 containers are being > ignored from Spark side. Inorder to debug this further, additional log lines > were added to > [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427] > class in increment and decrement of container request to expose additional > information about the request. > > {code:java} > 2024-06-24 14:10:14,075 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 > Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, > containerToUpdate=null} for: appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,111 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 16 Decremented by
[jira] [Commented] (YARN-11560) Fix NPE bug when multi-node enabled with schedule asynchronously
[ https://issues.apache.org/jira/browse/YARN-11560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883451#comment-17883451 ] ASF GitHub Bot commented on YARN-11560: --- ayushtkn commented on code in PR #6021: URL: https://github.com/apache/hadoop/pull/6021#discussion_r1769492431 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java: ## @@ -1737,6 +1737,10 @@ private CSAssignment allocateContainerOnSingleNode( private void allocateFromReservedContainer(FiCaSchedulerNode node, boolean withNodeHeartbeat, RMContainer reservedContainer) { +if(reservedContainer == null){ + LOG.warn("reservedContainer is null,that may be unreserved by the proposal judgment thread"); Review Comment: nit add space after , > Fix NPE bug when multi-node enabled with schedule asynchronously > > > Key: YARN-11560 > URL: https://issues.apache.org/jira/browse/YARN-11560 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.3.3 >Reporter: wangzhongwei >Assignee: wangzhongwei >Priority: Blocker > > when multiNodePlacementEnabled,using global scheduler,NPE may happend when > commit thread calling allocateFromReservedContainer with param > reservedContainer ,while the container may be unreserved by the judgment > thread in tryCommit->apply function -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11560) Fix NPE bug when multi-node enabled with schedule asynchronously
[ https://issues.apache.org/jira/browse/YARN-11560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-11560: -- Labels: pull-request-available (was: ) > Fix NPE bug when multi-node enabled with schedule asynchronously > > > Key: YARN-11560 > URL: https://issues.apache.org/jira/browse/YARN-11560 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.3.3 >Reporter: wangzhongwei >Assignee: wangzhongwei >Priority: Blocker > Labels: pull-request-available > > when multiNodePlacementEnabled,using global scheduler,NPE may happend when > commit thread calling allocateFromReservedContainer with param > reservedContainer ,while the container may be unreserved by the judgment > thread in tryCommit->apply function -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883440#comment-17883440 ] ASF GitHub Bot commented on YARN-11730: --- slfan1989 commented on PR #7049: URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2364800576 > Hi @slfan1989, @zeekling, If there are no further concerns, could I kindly request an approval so we can merge this change? Thank you for the review. @arjunmohnot Thank you for your contributions! If there are no further comments by the end of this week, we will merge into the trunk branch. > Resourcemanager node reporting enhancement for unregistered hosts > - > > Key: YARN-11730 > URL: https://issues.apache.org/jira/browse/YARN-11730 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, yarn >Affects Versions: 3.4.0 > Environment: Tested on multiple environments: > A. Docker Environment{*}:{*} > * Base OS: *Ubuntu 20.04* > * *Java 8* installed from OpenJDK. > * Docker image includes Hadoop binaries, user configurations, and ports for > YARN services. > * Verified behavior using a Hadoop snapshot in a containerized environment. > * Performed Namenode formatting and validated service interactions through > exposed ports. > * Repo reference: > [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main] > B. Bare-metal Distributed Setup (RedHat Linux){*}:{*} > * Running *Java 8* in a High-Availability (HA) configuration with > *Zookeeper* for locking mechanism. > * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM > node, including state retention and proper node state transitions. > * Verified node state transitions during RM failover, ensuring nodes moved > between LOST, ACTIVE, and other states as expected. >Reporter: Arjun Mohnot >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > h3. Issue Overview > When the ResourceManager (RM) starts, nodes listed in the _"include"_ file > are not immediately reported until their corresponding NodeManagers (NMs) > send their first heartbeat. However, nodes in the _"exclude"_ file are > instantly reflected in the _"Decommissioned Hosts"_ section with a port value > -1. > This design creates several challenges: > * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM > standalone restart, some nodes may not report back, even though they are > listed in the _"include"_ file. These nodes neither appear in the _LOST_ > state nor are they represented in the RM's JMX metrics. This results in an > untracked state, making it difficult to monitor their status. While in HDFS > similar behaviour exists and is marked as {_}"DEAD"{_}. > * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until > they send their first heartbeat. This delay impacts real-time cluster > monitoring, leading to a lack of immediate visibility for these nodes in > Resourcemanager's state on the total no. of nodes. > * {*}Operational Impact{*}: These unreported nodes cause operational > difficulties, particularly in automated workflows such as OS Upgrade > Automation (OSUA), node recovery automation, and others where validation > depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or > {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky > workarounds to determine their accurate status. > h3. Proposed Solution > To address these issues, we propose automatically assigning the _LOST_ state > to any node listed in the _"include"_ file that are not registered and not > part of the exclude file by default at the RM startup or HA failover. This > can be done by marking the node with a special port value {_}-2{_}, signaling > that the node is considered LOST but has not yet been reported. Whenever a > heartbeat is received for that {color:#de350b}nodeID{color}, it will be > transitioned from _LOST_ to {_}RUNNING{_}, {_}UNHEALTHY{_}, or any other > required desired state. > h3. Key implementation points > * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of > the RM active node context should be automatically marked as {_}LOST{_}. This > can be achieved by modifying the _NodesListManager_ under the > {color:#de350b}refreshHostsReader{color} method, invoked during failover, or > manual node refresh operations. This logic should ensure that all > unregistered nodes are moved to the _LOST_ state, with port _-2_ indicating > the node is untracked. > * For non-HA setups, this process can be triggered during RM service startup > to mark nodes as _LOST_ initially, and they will gradually transition to > their des
[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app
[ https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883353#comment-17883353 ] ASF GitHub Bot commented on YARN-11708: --- hadoop-yetus commented on PR #7041: URL: https://github.com/apache/hadoop/pull/7041#issuecomment-2364153400 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 53s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 1s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 44m 0s | | trunk passed | | +1 :green_heart: | compile | 1m 2s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 0m 56s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 0m 57s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 2s | | trunk passed | | +1 :green_heart: | javadoc | 1m 0s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 50s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 2m 1s | | trunk passed | | +1 :green_heart: | shadedclient | 35m 50s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 49s | | the patch passed | | +1 :green_heart: | compile | 0m 54s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 0m 54s | | the patch passed | | +1 :green_heart: | compile | 0m 47s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 0m 47s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 43s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 50s | | the patch passed | | +1 :green_heart: | javadoc | 0m 47s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 43s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 2m 0s | | the patch passed | | +1 :green_heart: | shadedclient | 35m 35s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 111m 22s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 39s | | The patch does not generate ASF License warnings. | | | | 243m 9s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/7/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7041 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 7961632c7c16 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 8c46ee7bce648f5798cc45d950df423bb92a5122 | | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/7/testReport/ | | Max. process+thread count | 957 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/7/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app
[ https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883209#comment-17883209 ] ASF GitHub Bot commented on YARN-11708: --- susheelgupta7 commented on code in PR #7041: URL: https://github.com/apache/hadoop/pull/7041#discussion_r1768322383 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/YarnScheduler.java: ## @@ -44,8 +44,10 @@ import org.apache.hadoop.yarn.api.records.ResourceRequest; import org.apache.hadoop.yarn.api.records.SchedulingRequest; import org.apache.hadoop.yarn.event.EventHandler; +import org.apache.hadoop.yarn.server.resourcemanager.placement.ApplicationPlacementContext; import org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainer; import org.apache.hadoop.yarn.exceptions.YarnException; +import org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueue; Review Comment: Thanks for the review. Yes FairScheduler has its own FSQueue class, which is specific to its design. > Setting maximum-application-lifetime using AQCv2 templates doesn't apply on > the first submitted app > > > Key: YARN-11708 > URL: https://issues.apache.org/jira/browse/YARN-11708 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Benjamin Teke >Assignee: Susheel Gupta >Priority: Major > Labels: pull-request-available > > Setting the _maximum-application-lifetime_ property using AQC v2 templates > (_yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime_) > doesn't apply to the first submitted application (through which the queue is > created), only to the subsequent ones. It should apply to the first > application as well. > Repro steps: > Create the queue root.test, enable AQCv2 on it. > Provide the following template properties: > yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime=8 > yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.default-application-lifetime=8 > The first submitted application, which triggers the queue creation will have > unlimited lifetime: > {code:java} > TimeoutType : LIFETIME > ExpiryTime : UNLIMITED > RemainingTime : -1seconds > Final-State : SUCCEEDED > {code} > The subsequent applications will be killed after the lifetime expires: > {code:java} > TimeoutType : LIFETIME > ExpiryTime : 2024-07-23T15:02:41.386+ > RemainingTime : 0seconds > Final-State : KILLED > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app
[ https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883203#comment-17883203 ] ASF GitHub Bot commented on YARN-11708: --- brumi1024 commented on code in PR #7041: URL: https://github.com/apache/hadoop/pull/7041#discussion_r1768296527 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/YarnScheduler.java: ## @@ -44,8 +44,10 @@ import org.apache.hadoop.yarn.api.records.ResourceRequest; import org.apache.hadoop.yarn.api.records.SchedulingRequest; import org.apache.hadoop.yarn.event.EventHandler; +import org.apache.hadoop.yarn.server.resourcemanager.placement.ApplicationPlacementContext; import org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainer; import org.apache.hadoop.yarn.exceptions.YarnException; +import org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueue; Review Comment: YarnScheduler should not rely on one of its implementations' utility classes. ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java: ## @@ -1566,6 +1567,14 @@ public long checkAndGetApplicationLifetime(String queueName, long lifetime) { return lifetime; } + @Override + public CSQueue getOrCreateQueueFromPlacementContext(ApplicationId Review Comment: It shouldn't return a CSQueue, as Fair Scheduler or the Fifo Scheduler do not utilize Capacity Scheduler queues. > Setting maximum-application-lifetime using AQCv2 templates doesn't apply on > the first submitted app > > > Key: YARN-11708 > URL: https://issues.apache.org/jira/browse/YARN-11708 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Benjamin Teke >Assignee: Susheel Gupta >Priority: Major > Labels: pull-request-available > > Setting the _maximum-application-lifetime_ property using AQC v2 templates > (_yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime_) > doesn't apply to the first submitted application (through which the queue is > created), only to the subsequent ones. It should apply to the first > application as well. > Repro steps: > Create the queue root.test, enable AQCv2 on it. > Provide the following template properties: > yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime=8 > yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.default-application-lifetime=8 > The first submitted application, which triggers the queue creation will have > unlimited lifetime: > {code:java} > TimeoutType : LIFETIME > ExpiryTime : UNLIMITED > RemainingTime : -1seconds > Final-State : SUCCEEDED > {code} > The subsequent applications will be killed after the lifetime expires: > {code:java} > TimeoutType : LIFETIME > ExpiryTime : 2024-07-23T15:02:41.386+ > RemainingTime : 0seconds > Final-State : KILLED > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app
[ https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883066#comment-17883066 ] ASF GitHub Bot commented on YARN-11708: --- hadoop-yetus commented on PR #7041: URL: https://github.com/apache/hadoop/pull/7041#issuecomment-2361722415 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 54s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 1s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 45m 14s | | trunk passed | | +1 :green_heart: | compile | 1m 4s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 0m 57s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 0m 58s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 2s | | trunk passed | | +1 :green_heart: | javadoc | 0m 58s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 50s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 2m 3s | | trunk passed | | +1 :green_heart: | shadedclient | 35m 49s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 51s | | the patch passed | | +1 :green_heart: | compile | 0m 53s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 0m 53s | | the patch passed | | +1 :green_heart: | compile | 0m 47s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 0m 47s | | the patch passed | | +1 :green_heart: | blanks | 0m 1s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 43s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 53s | | the patch passed | | +1 :green_heart: | javadoc | 0m 46s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 42s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 59s | | the patch passed | | +1 :green_heart: | shadedclient | 35m 31s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 109m 25s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 38s | | The patch does not generate ASF License warnings. | | | | 242m 40s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/6/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7041 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 767fb092bef3 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / b0274bdb9ef52b03c74c6b9c232854fb9a395ad9 | | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/6/testReport/ | | Max. process+thread count | 946 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/6/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app
[ https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882975#comment-17882975 ] ASF GitHub Bot commented on YARN-11708: --- hadoop-yetus commented on PR #7041: URL: https://github.com/apache/hadoop/pull/7041#issuecomment-2360679473 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 4s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 50m 4s | | trunk passed | | +1 :green_heart: | compile | 1m 3s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 0m 59s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 0m 59s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 6s | | trunk passed | | +1 :green_heart: | javadoc | 1m 2s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 52s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 2m 12s | | trunk passed | | +1 :green_heart: | shadedclient | 40m 41s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 50s | | the patch passed | | +1 :green_heart: | compile | 0m 53s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 0m 53s | | the patch passed | | +1 :green_heart: | compile | 0m 47s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 0m 47s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 45s | [/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/5/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 3 new + 286 unchanged - 0 fixed = 289 total (was 286) | | +1 :green_heart: | mvnsite | 0m 53s | | the patch passed | | -1 :x: | javadoc | 0m 46s | [/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/5/artifact/out/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt) | hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 generated 1 new + 1 unchanged - 0 fixed = 2 total (was 1) | | -1 :x: | javadoc | 0m 42s | [/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/5/artifact/out/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt) | hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05 with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 generated 1 new + 1 unchanged - 0 fixed = 2 total (was 1) | | +1 :green_heart: | spotbugs | 1m 59s | | the patch passed | | +1 :green_heart: | shadedclient | 35m 20s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 109m 43s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 39s | | The patch does not generate ASF
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882783#comment-17882783 ] ASF GitHub Bot commented on YARN-11730: --- arjunmohnot commented on PR #7049: URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2359151132 Hi @slfan1989, @zeekling, If there are no further concerns, could I kindly request an approval so we can merge this change? Thank you for the review. > Resourcemanager node reporting enhancement for unregistered hosts > - > > Key: YARN-11730 > URL: https://issues.apache.org/jira/browse/YARN-11730 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, yarn >Affects Versions: 3.4.0 > Environment: Tested on multiple environments: > A. Docker Environment{*}:{*} > * Base OS: *Ubuntu 20.04* > * *Java 8* installed from OpenJDK. > * Docker image includes Hadoop binaries, user configurations, and ports for > YARN services. > * Verified behavior using a Hadoop snapshot in a containerized environment. > * Performed Namenode formatting and validated service interactions through > exposed ports. > * Repo reference: > [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main] > B. Bare-metal Distributed Setup (RedHat Linux){*}:{*} > * Running *Java 8* in a High-Availability (HA) configuration with > *Zookeeper* for locking mechanism. > * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM > node, including state retention and proper node state transitions. > * Verified node state transitions during RM failover, ensuring nodes moved > between LOST, ACTIVE, and other states as expected. >Reporter: Arjun Mohnot >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > h3. Issue Overview > When the ResourceManager (RM) starts, nodes listed in the _"include"_ file > are not immediately reported until their corresponding NodeManagers (NMs) > send their first heartbeat. However, nodes in the _"exclude"_ file are > instantly reflected in the _"Decommissioned Hosts"_ section with a port value > -1. > This design creates several challenges: > * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM > standalone restart, some nodes may not report back, even though they are > listed in the _"include"_ file. These nodes neither appear in the _LOST_ > state nor are they represented in the RM's JMX metrics. This results in an > untracked state, making it difficult to monitor their status. While in HDFS > similar behaviour exists and is marked as {_}"DEAD"{_}. > * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until > they send their first heartbeat. This delay impacts real-time cluster > monitoring, leading to a lack of immediate visibility for these nodes in > Resourcemanager's state on the total no. of nodes. > * {*}Operational Impact{*}: These unreported nodes cause operational > difficulties, particularly in automated workflows such as OS Upgrade > Automation (OSUA), node recovery automation, and others where validation > depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or > {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky > workarounds to determine their accurate status. > h3. Proposed Solution > To address these issues, we propose automatically assigning the _LOST_ state > to any node listed in the _"include"_ file that are not registered and not > part of the exclude file by default at the RM startup or HA failover. This > can be done by marking the node with a special port value {_}-2{_}, signaling > that the node is considered LOST but has not yet been reported. Whenever a > heartbeat is received for that {color:#de350b}nodeID{color}, it will be > transitioned from _LOST_ to {_}RUNNING{_}, {_}UNHEALTHY{_}, or any other > required desired state. > h3. Key implementation points > * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of > the RM active node context should be automatically marked as {_}LOST{_}. This > can be achieved by modifying the _NodesListManager_ under the > {color:#de350b}refreshHostsReader{color} method, invoked during failover, or > manual node refresh operations. This logic should ensure that all > unregistered nodes are moved to the _LOST_ state, with port _-2_ indicating > the node is untracked. > * For non-HA setups, this process can be triggered during RM service startup > to mark nodes as _LOST_ initially, and they will gradually transition to > their desired state when the heartbeat is received. > * Handle Node Heartbeat and Transition: When a node sends its first > heartbeat, the system should ve
[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app
[ https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882705#comment-17882705 ] ASF GitHub Bot commented on YARN-11708: --- hadoop-yetus commented on PR #7041: URL: https://github.com/apache/hadoop/pull/7041#issuecomment-2358394779 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 59s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 44m 24s | | trunk passed | | +1 :green_heart: | compile | 1m 2s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 0m 56s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 0m 57s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 3s | | trunk passed | | +1 :green_heart: | javadoc | 0m 58s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 51s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 2m 2s | | trunk passed | | +1 :green_heart: | shadedclient | 35m 25s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 49s | | the patch passed | | +1 :green_heart: | compile | 0m 55s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 0m 55s | | the patch passed | | +1 :green_heart: | compile | 0m 49s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 0m 49s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 44s | [/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/4/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 2 new + 286 unchanged - 0 fixed = 288 total (was 286) | | +1 :green_heart: | mvnsite | 0m 52s | | the patch passed | | -1 :x: | javadoc | 0m 45s | [/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/4/artifact/out/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt) | hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 generated 1 new + 1 unchanged - 0 fixed = 2 total (was 1) | | -1 :x: | javadoc | 0m 43s | [/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/4/artifact/out/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt) | hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05 with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 generated 1 new + 1 unchanged - 0 fixed = 2 total (was 1) | | +1 :green_heart: | spotbugs | 1m 57s | | the patch passed | | +1 :green_heart: | shadedclient | 35m 44s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 110m 1s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 38s | | The patch does not generate ASF
[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app
[ https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882650#comment-17882650 ] ASF GitHub Bot commented on YARN-11708: --- K0K0V0K commented on code in PR #7041: URL: https://github.com/apache/hadoop/pull/7041#discussion_r1764751050 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java: ## @@ -3366,14 +3367,47 @@ public boolean moveReservedContainer(RMContainer toBeMovedContainer, @Override public long checkAndGetApplicationLifetime(String queueName, Review Comment: I think instead of add this create new queue logic to the method we could make the getOrCreateQueueFromPlacementContext public and call it before we call the checkAndGetApplicationLifetime method. If i see well this checkAndGetApplicationLifetime used just once in the code ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java: ## @@ -3366,14 +3367,47 @@ public boolean moveReservedContainer(RMContainer toBeMovedContainer, @Override public long checkAndGetApplicationLifetime(String queueName, - long lifetimeRequestedByApp) { -readLock.lock(); + long lifetimeRequestedByApp, RMAppImpl app) { +CSQueue queue; + +writeLock.lock(); try { - CSQueue queue = getQueue(queueName); + queue = getQueue(queueName); + + // This handles the case where the first submitted app in aqc queue does not exist, + // addressing the issue related to YARN-11708. + if (queue == null) { +queue = getOrCreateQueueFromPlacementContext(app.getApplicationId(), app.getUser(), + app.getQueue(), app.getApplicationPlacementContext(), false); + } + + if (queue == null) { +String message; +if (isAmbiguous(queueName)) { + message = "Application " + app.getApplicationId() Review Comment: I think here we have some code duplication. We can add the `"Application " + app.getApplicationId() + " submitted by user " + app.getUser()` part to the line 3385 > Setting maximum-application-lifetime using AQCv2 templates doesn't apply on > the first submitted app > > > Key: YARN-11708 > URL: https://issues.apache.org/jira/browse/YARN-11708 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Benjamin Teke >Assignee: Susheel Gupta >Priority: Major > Labels: pull-request-available > > Setting the _maximum-application-lifetime_ property using AQC v2 templates > (_yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime_) > doesn't apply to the first submitted application (through which the queue is > created), only to the subsequent ones. It should apply to the first > application as well. > Repro steps: > Create the queue root.test, enable AQCv2 on it. > Provide the following template properties: > yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime=8 > yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.default-application-lifetime=8 > The first submitted application, which triggers the queue creation will have > unlimited lifetime: > {code:java} > TimeoutType : LIFETIME > ExpiryTime : UNLIMITED > RemainingTime : -1seconds > Final-State : SUCCEEDED > {code} > The subsequent applications will be killed after the lifetime expires: > {code:java} > TimeoutType : LIFETIME > ExpiryTime : 2024-07-23T15:02:41.386+ > RemainingTime : 0seconds > Final-State : KILLED > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app
[ https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882639#comment-17882639 ] ASF GitHub Bot commented on YARN-11708: --- K0K0V0K commented on PR #7041: URL: https://github.com/apache/hadoop/pull/7041#issuecomment-2357904421 Thanks for the update @susheelgupta7 ! May i ask you to fill the PR description and "How was this patch tested?" parts? > Setting maximum-application-lifetime using AQCv2 templates doesn't apply on > the first submitted app > > > Key: YARN-11708 > URL: https://issues.apache.org/jira/browse/YARN-11708 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Benjamin Teke >Assignee: Susheel Gupta >Priority: Major > Labels: pull-request-available > > Setting the _maximum-application-lifetime_ property using AQC v2 templates > (_yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime_) > doesn't apply to the first submitted application (through which the queue is > created), only to the subsequent ones. It should apply to the first > application as well. > Repro steps: > Create the queue root.test, enable AQCv2 on it. > Provide the following template properties: > yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime=8 > yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.default-application-lifetime=8 > The first submitted application, which triggers the queue creation will have > unlimited lifetime: > {code:java} > TimeoutType : LIFETIME > ExpiryTime : UNLIMITED > RemainingTime : -1seconds > Final-State : SUCCEEDED > {code} > The subsequent applications will be killed after the lifetime expires: > {code:java} > TimeoutType : LIFETIME > ExpiryTime : 2024-07-23T15:02:41.386+ > RemainingTime : 0seconds > Final-State : KILLED > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882577#comment-17882577 ] ASF GitHub Bot commented on YARN-11730: --- hadoop-yetus commented on PR #7049: URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2357548110 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 20s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 1s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 22s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 22m 3s | | trunk passed | | +1 :green_heart: | compile | 4m 11s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 3m 53s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 0m 58s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 41s | | trunk passed | | +1 :green_heart: | javadoc | 1m 46s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 1m 37s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 3m 35s | | trunk passed | | +1 :green_heart: | shadedclient | 22m 32s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 22s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 9s | | the patch passed | | +1 :green_heart: | compile | 3m 52s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 3m 52s | | the patch passed | | +1 :green_heart: | compile | 3m 40s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 3m 40s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 55s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 27s | | the patch passed | | +1 :green_heart: | javadoc | 1m 28s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 1m 38s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 3m 55s | | the patch passed | | +1 :green_heart: | shadedclient | 23m 2s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 0m 45s | | hadoop-yarn-api in the patch passed. | | +1 :green_heart: | unit | 4m 43s | | hadoop-yarn-common in the patch passed. | | +1 :green_heart: | unit | 89m 39s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 40s | | The patch does not generate ASF License warnings. | | | | 215m 12s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/6/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7049 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint | | uname | Linux 9fe524a117b6 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / d5174734cec6b8942857cca7dce4f2e87e3d9753 | | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/6/testReport/ | | Max. process+thread count | 918 (vs. ulimit of 5500) |
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882576#comment-17882576 ] ASF GitHub Bot commented on YARN-11730: --- hadoop-yetus commented on PR #7049: URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2357543733 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 20s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 0s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 35s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 21m 40s | | trunk passed | | +1 :green_heart: | compile | 3m 53s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 3m 48s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 1m 0s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 47s | | trunk passed | | +1 :green_heart: | javadoc | 1m 41s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 1m 37s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 3m 31s | | trunk passed | | +1 :green_heart: | shadedclient | 22m 43s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 21s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 7s | | the patch passed | | +1 :green_heart: | compile | 3m 53s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 3m 53s | | the patch passed | | +1 :green_heart: | compile | 3m 47s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 3m 47s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 59s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 33s | | the patch passed | | +1 :green_heart: | javadoc | 1m 27s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 1m 23s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 3m 54s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 55s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 0m 40s | | hadoop-yarn-api in the patch passed. | | +1 :green_heart: | unit | 4m 25s | | hadoop-yarn-common in the patch passed. | | +1 :green_heart: | unit | 90m 18s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 37s | | The patch does not generate ASF License warnings. | | | | 214m 45s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/5/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7049 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint | | uname | Linux 43b873ba1713 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 3b38551284356efd4c48d7ec2eeb8ff213b05ca2 | | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/5/testReport/ | | Max. process+thread count | 928 (vs. ulimit of 5500) |
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882554#comment-17882554 ] ASF GitHub Bot commented on YARN-11730: --- slfan1989 commented on PR #7049: URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2357315118 > Hi @slfan1989, the required changes have been made and CI checks passed—could you kindly review again, and possibly merge when you get a chance? Thank you for your time and support! LGTM. > Resourcemanager node reporting enhancement for unregistered hosts > - > > Key: YARN-11730 > URL: https://issues.apache.org/jira/browse/YARN-11730 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, yarn >Affects Versions: 3.4.0 > Environment: Tested on multiple environments: > A. Docker Environment{*}:{*} > * Base OS: *Ubuntu 20.04* > * *Java 8* installed from OpenJDK. > * Docker image includes Hadoop binaries, user configurations, and ports for > YARN services. > * Verified behavior using a Hadoop snapshot in a containerized environment. > * Performed Namenode formatting and validated service interactions through > exposed ports. > * Repo reference: > [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main] > B. Bare-metal Distributed Setup (RedHat Linux){*}:{*} > * Running *Java 8* in a High-Availability (HA) configuration with > *Zookeeper* for locking mechanism. > * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM > node, including state retention and proper node state transitions. > * Verified node state transitions during RM failover, ensuring nodes moved > between LOST, ACTIVE, and other states as expected. >Reporter: Arjun Mohnot >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > h3. Issue Overview > When the ResourceManager (RM) starts, nodes listed in the _"include"_ file > are not immediately reported until their corresponding NodeManagers (NMs) > send their first heartbeat. However, nodes in the _"exclude"_ file are > instantly reflected in the _"Decommissioned Hosts"_ section with a port value > -1. > This design creates several challenges: > * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM > standalone restart, some nodes may not report back, even though they are > listed in the _"include"_ file. These nodes neither appear in the _LOST_ > state nor are they represented in the RM's JMX metrics. This results in an > untracked state, making it difficult to monitor their status. While in HDFS > similar behaviour exists and is marked as {_}"DEAD"{_}. > * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until > they send their first heartbeat. This delay impacts real-time cluster > monitoring, leading to a lack of immediate visibility for these nodes in > Resourcemanager's state on the total no. of nodes. > * {*}Operational Impact{*}: These unreported nodes cause operational > difficulties, particularly in automated workflows such as OS Upgrade > Automation (OSUA), node recovery automation, and others where validation > depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or > {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky > workarounds to determine their accurate status. > h3. Proposed Solution > To address these issues, we propose automatically assigning the _LOST_ state > to any node listed in the _"include"_ file that are not registered and not > part of the exclude file by default at the RM startup or HA failover. This > can be done by marking the node with a special port value {_}-2{_}, signaling > that the node is considered LOST but has not yet been reported. Whenever a > heartbeat is received for that {color:#de350b}nodeID{color}, it will be > transitioned from _LOST_ to {_}RUNNING{_}, {_}UNHEALTHY{_}, or any other > required desired state. > h3. Key implementation points > * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of > the RM active node context should be automatically marked as {_}LOST{_}. This > can be achieved by modifying the _NodesListManager_ under the > {color:#de350b}refreshHostsReader{color} method, invoked during failover, or > manual node refresh operations. This logic should ensure that all > unregistered nodes are moved to the _LOST_ state, with port _-2_ indicating > the node is untracked. > * For non-HA setups, this process can be triggered during RM service startup > to mark nodes as _LOST_ initially, and they will gradually transition to > their desired state when the heartbeat is received. > * Handle Node Heartbeat and Transition: When a node sends i
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882538#comment-17882538 ] ASF GitHub Bot commented on YARN-11730: --- hadoop-yetus commented on PR #7049: URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2357185560 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 19s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 0s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 55s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 20m 16s | | trunk passed | | +1 :green_heart: | compile | 3m 39s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 3m 31s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 1m 3s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 58s | | trunk passed | | +1 :green_heart: | javadoc | 1m 58s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 1m 55s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 3m 52s | | trunk passed | | +1 :green_heart: | shadedclient | 21m 37s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 21s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 14s | | the patch passed | | +1 :green_heart: | compile | 3m 26s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 3m 26s | | the patch passed | | +1 :green_heart: | compile | 3m 25s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 3m 25s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 54s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 47s | | the patch passed | | +1 :green_heart: | javadoc | 1m 42s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 1m 48s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 4m 2s | | the patch passed | | +1 :green_heart: | shadedclient | 21m 17s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 0m 48s | | hadoop-yarn-api in the patch passed. | | +1 :green_heart: | unit | 4m 42s | | hadoop-yarn-common in the patch passed. | | +1 :green_heart: | unit | 90m 2s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 37s | | The patch does not generate ASF License warnings. | | | | 212m 20s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/4/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7049 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint | | uname | Linux a65f08ea2011 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 3dac5d6df37371f533eee1c78e7fbc80593d9716 | | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/4/testReport/ | | Max. process+thread count | 931 (vs. ulimit of 5500) |
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882513#comment-17882513 ] ASF GitHub Bot commented on YARN-11730: --- arjunmohnot commented on code in PR #7049: URL: https://github.com/apache/hadoop/pull/7049#discussion_r1763997947 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java: ## @@ -1608,9 +1608,19 @@ protected void serviceStart() throws Exception { // Non HA case, start after RM services are started. if (!this.rmContext.isHAEnabled()) { transitionToActive(); + + // Refresh node state at the service startup to reflect the unregistered + // nodemanagers as LOST if the tracking for unregistered nodes flag is enabled. + // For HA setup, refreshNodes is already being called during the transition. + Configuration yarnConf = getConfig(); + if (yarnConf.getBoolean( + YarnConfiguration.ENABLE_TRACKING_FOR_UNREGISTERED_NODES, + YarnConfiguration.DEFAULT_ENABLE_TRACKING_FOR_UNREGISTERED_NODES)) { +this.rmContext.getNodesListManager().refreshNodes(yarnConf); Review Comment: Hey @zeekling, thanks for your question! After reviewing potential edge cases and comparing the existing implementation, here’s a summary of the different scenarios: ### Unregistered Lost Node Definition - A node is marked as _LOST_ when listed in the _"include"_ file but not registered in the ResourceManager's node context, and is also not part of the _"exclude"_ file during startup or HA failover. The unregistered lost node is indicated by a port value of -2. ### **Case 1**: Node Marked as LOST, Heartbeat Received - When a node is marked as **LOST**, a lost event is dispatched, adding the node to the **active** and **inactive** node maps of the RMContext. - If the node sends a heartbeat afterward, the transition method in `RMNodeImpl` checks for the same hostname of that node with **port -2** (LOST state). - If found, the nodeID is removed from the `rmContext` and re-registered with the desired port of the NM. - It also decrements the LOST node counter and increments the ACTIVE node counter, ensuring a clean state of transitions. ### **Case 2**: **Rare Scenario** - Race Condition - A race condition may occur if the **ResourceTrackerService** starts before the RM starts processing the unregistered lost nodes, and the NodeManager (NM) sends its heartbeat **quickly** in parallel. - **Example**: - When fetching nodes from `rmContext`, an NM (say **host1**) may not initially be present in the context. - Before this operation completes, **host1** may send a heartbeat and get registered with a valid port. - Meanwhile, the RM could still attempt to mark the same NM (host1) as LOST with port -2 as it was not registered while querying the context, resulting in two entries for the same host: one as ACTIVE and another as LOST. ### Details on Case-2 - I reviewed the code and found that for a node to register, the `ResourceTracker` service must start during service startup. In HA mode, nodes only register once the RM becomes active. - The current implementation for HA calls the `refreshNodes` function before the `transitionToActive` method, which rules out the race condition for HA setup since all unregistered nodes are dispatched first. For standalone RM setup, there was a slight oversight. However, during testing, I did not encounter or replicate this issue, as the NM heartbeat can take time to register while the nodes were marked as LOST beforehand. With the recent changes, I am making the `refreshNodes` operation before the service starts. This change ensures that unregistered nodes are consistently marked as LOST with **port -2** first. Only afterwards NMs can register themselves with a proper port during heartbeat reception when the `ResourceTrackerService` starts the server (Triggered during RM service start). The updated change should guarantee that the nodes are properly marked as LOST **before** any heartbeats are processed. This eliminates the chance of falsely reporting nodes as LOST. I’ve validated this behavior through logs, which show that the lost event is dispatched first, and then the heartbeats from NMs received after service start. Let me know if this clears things up! 😊 > Resourcemanager node reporting enhancement for unregistered hosts > - > > Key: YARN-11730 > URL: https://issues.apache.org/jira/browse/YARN-11730 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, yarn >Affects Versions: 3
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882512#comment-17882512 ] ASF GitHub Bot commented on YARN-11730: --- arjunmohnot commented on code in PR #7049: URL: https://github.com/apache/hadoop/pull/7049#discussion_r1763997947 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java: ## @@ -1608,9 +1608,19 @@ protected void serviceStart() throws Exception { // Non HA case, start after RM services are started. if (!this.rmContext.isHAEnabled()) { transitionToActive(); + + // Refresh node state at the service startup to reflect the unregistered + // nodemanagers as LOST if the tracking for unregistered nodes flag is enabled. + // For HA setup, refreshNodes is already being called during the transition. + Configuration yarnConf = getConfig(); + if (yarnConf.getBoolean( + YarnConfiguration.ENABLE_TRACKING_FOR_UNREGISTERED_NODES, + YarnConfiguration.DEFAULT_ENABLE_TRACKING_FOR_UNREGISTERED_NODES)) { +this.rmContext.getNodesListManager().refreshNodes(yarnConf); Review Comment: Hey @zeekling, thanks for your question! After reviewing potential edge cases and comparing the existing implementation, here’s a summary of the different scenarios: ### Unregistered Lost Node Definition - A node is marked as _LOST_ when listed in the _"include"_ file but not registered in the ResourceManager's node context, and is also not part of the _"exclude"_ file during startup or HA failover. The unregistered lost node is indicated by a port value of -2. ### **Case 1**: Node Marked as LOST, Heartbeat Received - When a node is marked as **LOST**, a lost event is dispatched, adding the node to the **active** and **inactive** node maps of the RMContext. - If the node sends a heartbeat afterward, the transition method in `RMNodeImpl` checks for the same hostname of that node with **port -2** (LOST state). - If found, the nodeID is removed from the `rmContext` and re-registered with the desired port of the NM. - It also decrements the LOST node counter and increments the ACTIVE node counter, ensuring a clean state of transitions. ### **Case 2**: **Rare Scenario** - Race Condition - A race condition may occur if the **ResourceTrackerService** starts before the RM starts processing the unregistered lost nodes, and the NodeManager (NM) sends its heartbeat **quickly** in parallel. - **Example**: - When fetching nodes from `rmContext`, an NM (say **host1**) may not initially be present in the context. - Before this operation completes, **host1** may send a heartbeat and get registered with a valid port. - Meanwhile, the RM could still attempt to mark the same NM (host1) as LOST with port -2 as it was not registered while querying the context, resulting in two entries for the same host: one as ACTIVE and another as LOST. ### Details on Case-2 - I reviewed the code and found that for a node to register, the `ResourceTracker` service must start during service startup. In HA mode, nodes only register once the RM becomes active. - The current implementation for HA calls the `refreshNodes` function before the `transitionToActive` method, which rules out the race condition for HA setup since all unregistered nodes are dispatched first. For standalone RM setup, there was a slight oversight. However, during testing, I did not encounter or replicate this issue, as the NM heartbeat can take time to register while the nodes were marked as LOST beforehand. With the recent changes, I am making the `refreshNodes` operation before the service starts. This change ensures that unregistered nodes are consistently marked as LOST with **port -2** first. Only afterwards NMs can register themselves with a proper port during heartbeat reception when the `ResourceTrackerService` starts the server (Triggered during RM service start). ### Conclusion The updated change should guarantee that the nodes are properly marked as LOST **before** any heartbeats are processed. This eliminates the chance of falsely reporting nodes as LOST. I’ve validated this behavior through logs, which show that the lost event is dispatched first, and then the heartbeats from NMs received after service start. Let me know if this clears things up! 😊 > Resourcemanager node reporting enhancement for unregistered hosts > - > > Key: YARN-11730 > URL: https://issues.apache.org/jira/browse/YARN-11730 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, yarn >Affec
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882511#comment-17882511 ] ASF GitHub Bot commented on YARN-11730: --- arjunmohnot commented on code in PR #7049: URL: https://github.com/apache/hadoop/pull/7049#discussion_r1763997947 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java: ## @@ -1608,9 +1608,19 @@ protected void serviceStart() throws Exception { // Non HA case, start after RM services are started. if (!this.rmContext.isHAEnabled()) { transitionToActive(); + + // Refresh node state at the service startup to reflect the unregistered + // nodemanagers as LOST if the tracking for unregistered nodes flag is enabled. + // For HA setup, refreshNodes is already being called during the transition. + Configuration yarnConf = getConfig(); + if (yarnConf.getBoolean( + YarnConfiguration.ENABLE_TRACKING_FOR_UNREGISTERED_NODES, + YarnConfiguration.DEFAULT_ENABLE_TRACKING_FOR_UNREGISTERED_NODES)) { +this.rmContext.getNodesListManager().refreshNodes(yarnConf); Review Comment: Hey @zeekling, thanks for your question! After reviewing potential edge cases and comparing the existing implementation, here’s a summary of the different scenarios: ### Unregistered Lost Node Definition - A node is marked as _LOST_ when listed in the _"include"_ file but not registered in the ResourceManager's node context, and is also not part of the _"exclude"_ file during startup or HA failover. The unregistered lost node is indicated by a port value of -2. ### **Case 1**: Node Marked as LOST, Heartbeat Received - When a node is marked as **LOST**, a lost event is dispatched, adding the node to the **active** and **inactive** node maps of the RMContext. - If the node sends a heartbeat afterward, the transition method in `RMNodeImpl` checks for the same hostname of that node with **port -2** (LOST state). - If found, the nodeID is removed from the `rmContext` and re-registered with the desired port of the NM. - It also decrements the LOST node counter and increments the ACTIVE node counter, ensuring a clean state of transitions. ### **Case 2**: **Rare Scenario** - Race Condition - A race condition may occur if the **ResourceTrackerService** starts before the RM starts processing the unregistered lost nodes, and the NodeManager (NM) sends its heartbeat **quickly** in parallel. - **Example**: - When fetching nodes from `rmContext`, an NM (say **host1**) may not initially be present in the context. - Before this operation completes, **host1** may send a heartbeat and get registered with a valid port. - Meanwhile, the RM could still attempt to mark the same NM (host1) as LOST with port -2 as it was not registered while querying the context, resulting in two entries for the same host: one as ACTIVE and another as LOST. ### Details on Case-2 - I reviewed the code and found that for a node to register, the `ResourceTracker` service must start during service startup. In HA mode, nodes only register once the RM becomes active. - The current implementation for HA calls the `refreshNodes` function before the `transitionToActive` method, which rules out the race condition for HA setup since all unregistered nodes are dispatched first. For standalone setups, there was a slight oversight. However, during testing, I did not encounter or replicate this issue, as the NM heartbeat can take time to register while the nodes were marked as LOST beforehand. With the recent changes, I am making the `refreshNodes` operation before the service starts. This change ensures that unregistered nodes are consistently marked as LOST with **port -2** first. Only afterwards NMs can register themselves with a proper port during heartbeat reception when the `ResourceTrackerService` starts the server (Triggered during RM service start). ### Conclusion The updated change should guarantee that the nodes are properly marked as LOST **before** any heartbeats are processed. This eliminates the chance of falsely reporting nodes as LOST. I’ve validated this behavior through logs, which show that the lost event is dispatched first, and then the heartbeats from NMs received after service start. Let me know if this clears things up! 😊 > Resourcemanager node reporting enhancement for unregistered hosts > - > > Key: YARN-11730 > URL: https://issues.apache.org/jira/browse/YARN-11730 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, yarn >Affects
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882510#comment-17882510 ] ASF GitHub Bot commented on YARN-11730: --- arjunmohnot commented on code in PR #7049: URL: https://github.com/apache/hadoop/pull/7049#discussion_r1763997947 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java: ## @@ -1608,9 +1608,19 @@ protected void serviceStart() throws Exception { // Non HA case, start after RM services are started. if (!this.rmContext.isHAEnabled()) { transitionToActive(); + + // Refresh node state at the service startup to reflect the unregistered + // nodemanagers as LOST if the tracking for unregistered nodes flag is enabled. + // For HA setup, refreshNodes is already being called during the transition. + Configuration yarnConf = getConfig(); + if (yarnConf.getBoolean( + YarnConfiguration.ENABLE_TRACKING_FOR_UNREGISTERED_NODES, + YarnConfiguration.DEFAULT_ENABLE_TRACKING_FOR_UNREGISTERED_NODES)) { +this.rmContext.getNodesListManager().refreshNodes(yarnConf); Review Comment: Hey @zeekling, thanks for your question! After reviewing potential edge cases and comparing the existing implementation, here’s a summary of the different scenarios: ### Unregistered Lost Node Definition - A node is marked as _LOST_ when listed in the _"include"_ file but not registered in the ResourceManager's active node context, and is also not part of the _"exclude"_ file during startup or HA failover. The unregistered lost node is indicated by a port value of -2. ### **Case 1**: Node Marked as LOST, Heartbeat Received - When a node is marked as **LOST**, a lost event is dispatched, adding the node to the **active** and **inactive** node maps of the RMContext. - If the node sends a heartbeat afterward, the transition method in `RMNodeImpl` checks for the same hostname of that node with **port -2** (LOST state). - If found, the nodeID is removed from the `rmContext` and re-registered with the desired port of the NM. - It also decrements the LOST node counter and increments the ACTIVE node counter, ensuring a clean state of transitions. ### **Case 2**: **Rare Scenario** - Race Condition - A race condition may occur if the **ResourceTrackerService** starts before the RM starts processing the unregistered lost nodes, and the NodeManager (NM) sends its heartbeat **quickly** in parallel. - **Example**: - When fetching nodes from `rmContext`, an NM (say **host1**) may not initially be present in the context. - Before this operation completes, **host1** may send a heartbeat and get registered with a valid port. - Meanwhile, the RM could still attempt to mark the same NM (host1) as LOST with port -2 as it was not registered while querying the context, resulting in two entries for the same host: one as ACTIVE and another as LOST. ### Details on Case-2 - I reviewed the code and found that for a node to register, the `ResourceTracker` service must start during service startup. In HA mode, nodes only register once the RM becomes active. - The current implementation for HA calls the `refreshNodes` function before the `transitionToActive` method, which rules out the race condition for HA setup since all unregistered nodes are dispatched first. For standalone setups, there was a slight oversight. However, during testing, I did not encounter or replicate this issue, as the NM heartbeat can take time to register while the nodes were marked as LOST beforehand. With the recent changes, I am making the `refreshNodes` operation before the service starts. This change ensures that unregistered nodes are consistently marked as LOST with **port -2** first. Only afterwards NMs can register themselves with a proper port during heartbeat reception when the `ResourceTrackerService` starts the server (Triggered during RM service start). ### Conclusion The updated change should guarantee that the nodes are properly marked as LOST **before** any heartbeats are processed. This eliminates the chance of falsely reporting nodes as LOST. I’ve validated this behavior through logs, which show that the lost event is dispatched first, and then the heartbeats from NMs received after service start. Let me know if this clears things up! 😊 > Resourcemanager node reporting enhancement for unregistered hosts > - > > Key: YARN-11730 > URL: https://issues.apache.org/jira/browse/YARN-11730 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, yarn >
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882422#comment-17882422 ] ASF GitHub Bot commented on YARN-11730: --- zeekling commented on code in PR #7049: URL: https://github.com/apache/hadoop/pull/7049#discussion_r1763342594 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java: ## @@ -1608,9 +1608,19 @@ protected void serviceStart() throws Exception { // Non HA case, start after RM services are started. if (!this.rmContext.isHAEnabled()) { transitionToActive(); + + // Refresh node state at the service startup to reflect the unregistered + // nodemanagers as LOST if the tracking for unregistered nodes flag is enabled. + // For HA setup, refreshNodes is already being called during the transition. + Configuration yarnConf = getConfig(); + if (yarnConf.getBoolean( + YarnConfiguration.ENABLE_TRACKING_FOR_UNREGISTERED_NODES, + YarnConfiguration.DEFAULT_ENABLE_TRACKING_FOR_UNREGISTERED_NODES)) { +this.rmContext.getNodesListManager().refreshNodes(yarnConf); Review Comment: When RM starts, is it possible that NM will falsely report the Lost state? > Resourcemanager node reporting enhancement for unregistered hosts > - > > Key: YARN-11730 > URL: https://issues.apache.org/jira/browse/YARN-11730 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, yarn >Affects Versions: 3.4.0 > Environment: Tested on multiple environments: > A. Docker Environment{*}:{*} > * Base OS: *Ubuntu 20.04* > * *Java 8* installed from OpenJDK. > * Docker image includes Hadoop binaries, user configurations, and ports for > YARN services. > * Verified behavior using a Hadoop snapshot in a containerized environment. > * Performed Namenode formatting and validated service interactions through > exposed ports. > * Repo reference: > [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main] > B. Bare-metal Distributed Setup (RedHat Linux){*}:{*} > * Running *Java 8* in a High-Availability (HA) configuration with > *Zookeeper* for locking mechanism. > * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM > node, including state retention and proper node state transitions. > * Verified node state transitions during RM failover, ensuring nodes moved > between LOST, ACTIVE, and other states as expected. >Reporter: Arjun Mohnot >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > h3. Issue Overview > When the ResourceManager (RM) starts, nodes listed in the _"include"_ file > are not immediately reported until their corresponding NodeManagers (NMs) > send their first heartbeat. However, nodes in the _"exclude"_ file are > instantly reflected in the _"Decommissioned Hosts"_ section with a port value > -1. > This design creates several challenges: > * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM > standalone restart, some nodes may not report back, even though they are > listed in the _"include"_ file. These nodes neither appear in the _LOST_ > state nor are they represented in the RM's JMX metrics. This results in an > untracked state, making it difficult to monitor their status. While in HDFS > similar behaviour exists and is marked as {_}"DEAD"{_}. > * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until > they send their first heartbeat. This delay impacts real-time cluster > monitoring, leading to a lack of immediate visibility for these nodes in > Resourcemanager's state on the total no. of nodes. > * {*}Operational Impact{*}: These unreported nodes cause operational > difficulties, particularly in automated workflows such as OS Upgrade > Automation (OSUA), node recovery automation, and others where validation > depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or > {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky > workarounds to determine their accurate status. > h3. Proposed Solution > To address these issues, we propose automatically assigning the _LOST_ state > to any node listed in the _"include"_ file that are not registered and not > part of the exclude file by default at the RM startup or HA failover. This > can be done by marking the node with a special port value {_}-2{_}, signaling > that the node is considered LOST but has not yet been reported. Whenever a > heartbeat is received for that {color:#de350b}nodeID{color}, it will be > transitioned from _LOST_ to {_}RUNNING{_
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882403#comment-17882403 ] ASF GitHub Bot commented on YARN-11730: --- arjunmohnot commented on PR #7049: URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2355769123 Hi @slfan1989, the changes have been made and CI checks passed—could you kindly review when you get a chance? Thank you for your time and support! > Resourcemanager node reporting enhancement for unregistered hosts > - > > Key: YARN-11730 > URL: https://issues.apache.org/jira/browse/YARN-11730 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, yarn >Affects Versions: 3.4.0 > Environment: Tested on multiple environments: > A. Docker Environment{*}:{*} > * Base OS: *Ubuntu 20.04* > * *Java 8* installed from OpenJDK. > * Docker image includes Hadoop binaries, user configurations, and ports for > YARN services. > * Verified behavior using a Hadoop snapshot in a containerized environment. > * Performed Namenode formatting and validated service interactions through > exposed ports. > * Repo reference: > [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main] > B. Bare-metal Distributed Setup (RedHat Linux){*}:{*} > * Running *Java 8* in a High-Availability (HA) configuration with > *Zookeeper* for locking mechanism. > * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM > node, including state retention and proper node state transitions. > * Verified node state transitions during RM failover, ensuring nodes moved > between LOST, ACTIVE, and other states as expected. >Reporter: Arjun Mohnot >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > h3. Issue Overview > When the ResourceManager (RM) starts, nodes listed in the _"include"_ file > are not immediately reported until their corresponding NodeManagers (NMs) > send their first heartbeat. However, nodes in the _"exclude"_ file are > instantly reflected in the _"Decommissioned Hosts"_ section with a port value > -1. > This design creates several challenges: > * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM > standalone restart, some nodes may not report back, even though they are > listed in the _"include"_ file. These nodes neither appear in the _LOST_ > state nor are they represented in the RM's JMX metrics. This results in an > untracked state, making it difficult to monitor their status. While in HDFS > similar behaviour exists and is marked as {_}"DEAD"{_}. > * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until > they send their first heartbeat. This delay impacts real-time cluster > monitoring, leading to a lack of immediate visibility for these nodes in > Resourcemanager's state on the total no. of nodes. > * {*}Operational Impact{*}: These unreported nodes cause operational > difficulties, particularly in automated workflows such as OS Upgrade > Automation (OSUA), node recovery automation, and others where validation > depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or > {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky > workarounds to determine their accurate status. > h3. Proposed Solution > To address these issues, we propose automatically assigning the _LOST_ state > to any node listed in the _"include"_ file that are not registered and not > part of the exclude file by default at the RM startup or HA failover. This > can be done by marking the node with a special port value {_}-2{_}, signaling > that the node is considered LOST but has not yet been reported. Whenever a > heartbeat is received for that {color:#de350b}nodeID{color}, it will be > transitioned from _LOST_ to {_}RUNNING{_}, {_}UNHEALTHY{_}, or any other > required desired state. > h3. Key implementation points > * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of > the RM active node context should be automatically marked as {_}LOST{_}. This > can be achieved by modifying the _NodesListManager_ under the > {color:#de350b}refreshHostsReader{color} method, invoked during failover, or > manual node refresh operations. This logic should ensure that all > unregistered nodes are moved to the _LOST_ state, with port _-2_ indicating > the node is untracked. > * For non-HA setups, this process can be triggered during RM service startup > to mark nodes as _LOST_ initially, and they will gradually transition to > their desired state when the heartbeat is received. > * Handle Node Heartbeat and Transition: When a node sends its first > heartbeat, the system should verify i
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882399#comment-17882399 ] ASF GitHub Bot commented on YARN-11730: --- hadoop-yetus commented on PR #7049: URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2355749836 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 17s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 0s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 22s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 20m 25s | | trunk passed | | +1 :green_heart: | compile | 3m 43s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 3m 31s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 1m 3s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 55s | | trunk passed | | +1 :green_heart: | javadoc | 2m 4s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 1m 59s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 3m 41s | | trunk passed | | +1 :green_heart: | shadedclient | 21m 53s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 21s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 10s | | the patch passed | | +1 :green_heart: | compile | 3m 30s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 3m 30s | | the patch passed | | +1 :green_heart: | compile | 3m 23s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 3m 23s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 56s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 37s | | the patch passed | | +1 :green_heart: | javadoc | 1m 28s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 1m 34s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 3m 38s | | the patch passed | | +1 :green_heart: | shadedclient | 21m 25s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 0m 49s | | hadoop-yarn-api in the patch passed. | | +1 :green_heart: | unit | 4m 40s | | hadoop-yarn-common in the patch passed. | | +1 :green_heart: | unit | 89m 26s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 35s | | The patch does not generate ASF License warnings. | | | | 210m 46s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7049 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint | | uname | Linux 67b9fa27fe59 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / ec01c95944ca730c822f92c5bd57b452b2addbf2 | | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/3/testReport/ | | Max. process+thread count | 944 (vs. ulimit of 5500) |
[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"
[ https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882379#comment-17882379 ] ASF GitHub Bot commented on YARN-11709: --- brumi1024 merged PR #7043: URL: https://github.com/apache/hadoop/pull/7043 > NodeManager should be shut down or blacklisted when it cannot run program > "/var/lib/yarn-ce/bin/container-executor" > --- > > Key: YARN-11709 > URL: https://issues.apache.org/jira/browse/YARN-11709 > Project: Hadoop YARN > Issue Type: Improvement > Components: container-executor >Reporter: Ferenc Erdelyi >Assignee: Benjamin Teke >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > When NodeManager encounters the below "No such file or directory" error > reported against the "container-executor", it should give up participating in > the cluster as it is not capable to run any container, but just fail the jobs. > {code:java} > 2023-01-18 10:08:10,600 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_e159_1673543180101_9407_02_ > 14 startLocalizer is : -1 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > java.io.IOException: Cannot run program > "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:183) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:403) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.j > ava:1250) > Caused by: java.io.IOException: Cannot run program > "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"
[ https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882380#comment-17882380 ] ASF GitHub Bot commented on YARN-11709: --- brumi1024 commented on PR #7043: URL: https://github.com/apache/hadoop/pull/7043#issuecomment-2355587939 Thanks @slfan1989 for the review, merged to trunk. > NodeManager should be shut down or blacklisted when it cannot run program > "/var/lib/yarn-ce/bin/container-executor" > --- > > Key: YARN-11709 > URL: https://issues.apache.org/jira/browse/YARN-11709 > Project: Hadoop YARN > Issue Type: Improvement > Components: container-executor >Reporter: Ferenc Erdelyi >Assignee: Benjamin Teke >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > When NodeManager encounters the below "No such file or directory" error > reported against the "container-executor", it should give up participating in > the cluster as it is not capable to run any container, but just fail the jobs. > {code:java} > 2023-01-18 10:08:10,600 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_e159_1673543180101_9407_02_ > 14 startLocalizer is : -1 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > java.io.IOException: Cannot run program > "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:183) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:403) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.j > ava:1250) > Caused by: java.io.IOException: Cannot run program > "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"
[ https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882366#comment-17882366 ] ASF GitHub Bot commented on YARN-11709: --- hadoop-yetus commented on PR #7043: URL: https://github.com/apache/hadoop/pull/7043#issuecomment-2355397523 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 48s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 1s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 49m 19s | | trunk passed | | +1 :green_heart: | compile | 1m 31s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 1m 26s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 0m 40s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 46s | | trunk passed | | +1 :green_heart: | javadoc | 0m 49s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 41s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 29s | | trunk passed | | +1 :green_heart: | shadedclient | 40m 5s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 35s | | the patch passed | | +1 :green_heart: | compile | 1m 22s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 1m 22s | | the patch passed | | +1 :green_heart: | compile | 1m 20s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 1m 20s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 29s | | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 0 new + 101 unchanged - 1 fixed = 101 total (was 102) | | +1 :green_heart: | mvnsite | 0m 36s | | the patch passed | | +1 :green_heart: | javadoc | 0m 35s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 33s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 27s | | the patch passed | | +1 :green_heart: | shadedclient | 40m 2s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 24m 30s | | hadoop-yarn-server-nodemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 37s | | The patch does not generate ASF License warnings. | | | | 170m 14s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.46 ServerAPI=1.46 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/4/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7043 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 6c4d375b47da 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / e26571b945692b692964e5c6be46f66bc43b2b60 | | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/4/testReport/ | | Max. process+thread count | 584 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/4/cons
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882338#comment-17882338 ] ASF GitHub Bot commented on YARN-11730: --- arjunmohnot commented on code in PR #7049: URL: https://github.com/apache/hadoop/pull/7049#discussion_r1762881987 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java: ## @@ -387,6 +388,115 @@ private void handleExcludeNodeList(boolean graceful, int timeout) { updateInactiveNodes(); } + /** + * Marks the unregistered nodes as LOST + * if the feature is enabled via a configuration flag. + * + * This method finds nodes that are present in the include list but are not + * registered with the ResourceManager. Such nodes are then marked as LOST. + * + * The steps are as follows: + * 1. Retrieve all hostnames of registered nodes from RM. + * 2. Identify the nodes present in the include list but are not registered + * 3. Remove nodes from the exclude list + * 4. Dispatch LOST events for filtered nodes to mark them as LOST. + * + * @param yarnConf Configuration object that holds the YARN configurations. + */ + private void markUnregisteredNodesAsLost(Configuration yarnConf) { +// Check if tracking unregistered nodes is enabled in the configuration +if (!yarnConf.getBoolean(YarnConfiguration.ENABLE_TRACKING_FOR_UNREGISTERED_NODES, +YarnConfiguration.DEFAULT_ENABLE_TRACKING_FOR_UNREGISTERED_NODES)) { + LOG.debug("Unregistered node tracking is disabled. " + + "Skipping marking unregistered nodes as LOST."); + return; +} + +// Set to store all registered hostnames from both active and inactive lists +Set registeredHostNames = gatherRegisteredHostNames(); +// Event handler to dispatch LOST events +EventHandler eventHandler = this.rmContext.getDispatcher().getEventHandler(); + +// Identify nodes that are in the include list but are not registered +// and are not in the exclude list +List nodesToMarkLost = new ArrayList<>(); +HostDetails hostDetails = hostsReader.getHostDetails(); +Set includes = hostDetails.getIncludedHosts(); +Set excludes = hostDetails.getExcludedHosts(); + +for (String includedNode : includes) { + if (!registeredHostNames.contains(includedNode) && !excludes.contains(includedNode)) { +LOG.info("Lost node: " + includedNode); +nodesToMarkLost.add(includedNode); + } +} + +// Dispatch LOST events for the identified lost nodes +for (String lostNode : nodesToMarkLost) { + dispatchLostEvent(eventHandler, lostNode); +} + +// Log successful completion of marking unregistered nodes as LOST +LOG.info("Successfully marked unregistered nodes as LOST"); + } + + /** + * Gathers all registered hostnames from both active and inactive RMNodes. + * + * @return A set of registered hostnames. + */ + private Set gatherRegisteredHostNames() { +Set registeredHostNames = new HashSet<>(); +LOG.info("Getting all the registered hostnames"); + +// Gather all registered nodes (active) from RM into the set +for (RMNode node : this.rmContext.getRMNodes().values()) { + registeredHostNames.add(node.getHostName()); +} + +// Gather all inactive nodes from RM into the set +for (RMNode node : this.rmContext.getInactiveRMNodes().values()) { + registeredHostNames.add(node.getHostName()); +} + +return registeredHostNames; + } + + /** + * Dispatches a LOST event for a specified lost node. + * + * @param eventHandler The EventHandler used to dispatch the LOST event. + * @param lostNode The hostname of the lost node for which the event is + * being dispatched. + */ + private void dispatchLostEvent(EventHandler eventHandler, String lostNode) { +// Generate a NodeId for the lost node with a special port -2 +NodeId nodeId = createLostNodeId(lostNode); +RMNodeEvent lostEvent = new RMNodeEvent(nodeId, RMNodeEventType.EXPIRE); +RMNodeImpl rmNode = new RMNodeImpl(nodeId, this.rmContext, lostNode, -2, -2, +new UnknownNode(lostNode), Resource.newInstance(0, 0), "unknown"); + +try { + // Dispatch the LOST event to signal the node is no longer active + eventHandler.handle(lostEvent); + + // After successful dispatch, update the node status in RMContext + // Set the node's timestamp for when it became untracked + rmNode.setUntrackedTimeStamp(Time.monotonicNow()); + + // Add the node to the active and inactive node maps in RMContext + this.rmContext.getRMNodes().put(nodeId, rmNode); + this.rmContext.getInactiveRMNodes().put(nodeId, rmNode); + + LOG.info("Successfully dispatched LOST event and deactivated node: " Review Comme
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882307#comment-17882307 ] ASF GitHub Bot commented on YARN-11730: --- arjunmohnot commented on code in PR #7049: URL: https://github.com/apache/hadoop/pull/7049#discussion_r1762644412 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java: ## @@ -387,6 +388,115 @@ private void handleExcludeNodeList(boolean graceful, int timeout) { updateInactiveNodes(); } + /** + * Marks the unregistered nodes as LOST + * if the feature is enabled via a configuration flag. + * + * This method finds nodes that are present in the include list but are not + * registered with the ResourceManager. Such nodes are then marked as LOST. + * + * The steps are as follows: + * 1. Retrieve all hostnames of registered nodes from RM. + * 2. Identify the nodes present in the include list but are not registered + * 3. Remove nodes from the exclude list + * 4. Dispatch LOST events for filtered nodes to mark them as LOST. + * + * @param yarnConf Configuration object that holds the YARN configurations. + */ + private void markUnregisteredNodesAsLost(Configuration yarnConf) { +// Check if tracking unregistered nodes is enabled in the configuration +if (!yarnConf.getBoolean(YarnConfiguration.ENABLE_TRACKING_FOR_UNREGISTERED_NODES, +YarnConfiguration.DEFAULT_ENABLE_TRACKING_FOR_UNREGISTERED_NODES)) { + LOG.debug("Unregistered node tracking is disabled. " + + "Skipping marking unregistered nodes as LOST."); + return; +} + +// Set to store all registered hostnames from both active and inactive lists +Set registeredHostNames = gatherRegisteredHostNames(); +// Event handler to dispatch LOST events +EventHandler eventHandler = this.rmContext.getDispatcher().getEventHandler(); + +// Identify nodes that are in the include list but are not registered +// and are not in the exclude list +List nodesToMarkLost = new ArrayList<>(); +HostDetails hostDetails = hostsReader.getHostDetails(); +Set includes = hostDetails.getIncludedHosts(); +Set excludes = hostDetails.getExcludedHosts(); + +for (String includedNode : includes) { + if (!registeredHostNames.contains(includedNode) && !excludes.contains(includedNode)) { +LOG.info("Lost node: " + includedNode); +nodesToMarkLost.add(includedNode); + } +} + +// Dispatch LOST events for the identified lost nodes +for (String lostNode : nodesToMarkLost) { + dispatchLostEvent(eventHandler, lostNode); +} + +// Log successful completion of marking unregistered nodes as LOST +LOG.info("Successfully marked unregistered nodes as LOST"); + } + + /** + * Gathers all registered hostnames from both active and inactive RMNodes. + * + * @return A set of registered hostnames. + */ + private Set gatherRegisteredHostNames() { +Set registeredHostNames = new HashSet<>(); +LOG.info("Getting all the registered hostnames"); + +// Gather all registered nodes (active) from RM into the set +for (RMNode node : this.rmContext.getRMNodes().values()) { + registeredHostNames.add(node.getHostName()); +} + +// Gather all inactive nodes from RM into the set +for (RMNode node : this.rmContext.getInactiveRMNodes().values()) { + registeredHostNames.add(node.getHostName()); +} + +return registeredHostNames; + } + + /** + * Dispatches a LOST event for a specified lost node. + * + * @param eventHandler The EventHandler used to dispatch the LOST event. + * @param lostNode The hostname of the lost node for which the event is + * being dispatched. + */ + private void dispatchLostEvent(EventHandler eventHandler, String lostNode) { +// Generate a NodeId for the lost node with a special port -2 +NodeId nodeId = createLostNodeId(lostNode); +RMNodeEvent lostEvent = new RMNodeEvent(nodeId, RMNodeEventType.EXPIRE); +RMNodeImpl rmNode = new RMNodeImpl(nodeId, this.rmContext, lostNode, -2, -2, +new UnknownNode(lostNode), Resource.newInstance(0, 0), "unknown"); + +try { + // Dispatch the LOST event to signal the node is no longer active + eventHandler.handle(lostEvent); + + // After successful dispatch, update the node status in RMContext + // Set the node's timestamp for when it became untracked + rmNode.setUntrackedTimeStamp(Time.monotonicNow()); + + // Add the node to the active and inactive node maps in RMContext + this.rmContext.getRMNodes().put(nodeId, rmNode); + this.rmContext.getInactiveRMNodes().put(nodeId, rmNode); + + LOG.info("Successfully dispatched LOST event and deactivated node: " Review Comme
[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app
[ https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882306#comment-17882306 ] ASF GitHub Bot commented on YARN-11708: --- hadoop-yetus commented on PR #7041: URL: https://github.com/apache/hadoop/pull/7041#issuecomment-2354879586 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 22m 55s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | -1 :x: | mvninstall | 2m 12s | [/branch-mvninstall-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/3/artifact/out/branch-mvninstall-root.txt) | root in trunk failed. | | -1 :x: | compile | 0m 23s | [/branch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/3/artifact/out/branch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt) | hadoop-yarn-server-resourcemanager in trunk failed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04. | | -1 :x: | compile | 0m 24s | [/branch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/3/artifact/out/branch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt) | hadoop-yarn-server-resourcemanager in trunk failed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05. | | -0 :warning: | checkstyle | 0m 21s | [/buildtool-branch-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/3/artifact/out/buildtool-branch-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | The patch fails to run checkstyle in hadoop-yarn-server-resourcemanager | | -1 :x: | mvnsite | 0m 24s | [/branch-mvnsite-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/3/artifact/out/branch-mvnsite-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-server-resourcemanager in trunk failed. | | -1 :x: | javadoc | 0m 23s | [/branch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/3/artifact/out/branch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt) | hadoop-yarn-server-resourcemanager in trunk failed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04. | | -1 :x: | javadoc | 0m 23s | [/branch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/3/artifact/out/branch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt) | hadoop-yarn-server-resourcemanager in trunk failed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05. | | -1 :x: | spotbugs | 0m 23s | [/branch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/3/artifact/out/branch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-server-resourcemanager in trunk failed. | | +1 :green_heart: | shadedclient | 2m 45s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | -1 :x: | mvninstall | 0m 22s | [/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-se
[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"
[ https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882305#comment-17882305 ] ASF GitHub Bot commented on YARN-11709: --- slfan1989 commented on PR #7043: URL: https://github.com/apache/hadoop/pull/7043#issuecomment-2354877834 LGTM +1. > NodeManager should be shut down or blacklisted when it cannot run program > "/var/lib/yarn-ce/bin/container-executor" > --- > > Key: YARN-11709 > URL: https://issues.apache.org/jira/browse/YARN-11709 > Project: Hadoop YARN > Issue Type: Improvement > Components: container-executor >Reporter: Ferenc Erdelyi >Assignee: Benjamin Teke >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > When NodeManager encounters the below "No such file or directory" error > reported against the "container-executor", it should give up participating in > the cluster as it is not capable to run any container, but just fail the jobs. > {code:java} > 2023-01-18 10:08:10,600 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_e159_1673543180101_9407_02_ > 14 startLocalizer is : -1 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > java.io.IOException: Cannot run program > "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:183) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:403) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.j > ava:1250) > Caused by: java.io.IOException: Cannot run program > "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882304#comment-17882304 ] ASF GitHub Bot commented on YARN-11730: --- slfan1989 commented on code in PR #7049: URL: https://github.com/apache/hadoop/pull/7049#discussion_r1762632209 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java: ## @@ -387,6 +388,115 @@ private void handleExcludeNodeList(boolean graceful, int timeout) { updateInactiveNodes(); } + /** + * Marks the unregistered nodes as LOST + * if the feature is enabled via a configuration flag. + * + * This method finds nodes that are present in the include list but are not + * registered with the ResourceManager. Such nodes are then marked as LOST. + * + * The steps are as follows: + * 1. Retrieve all hostnames of registered nodes from RM. + * 2. Identify the nodes present in the include list but are not registered + * 3. Remove nodes from the exclude list + * 4. Dispatch LOST events for filtered nodes to mark them as LOST. + * + * @param yarnConf Configuration object that holds the YARN configurations. + */ + private void markUnregisteredNodesAsLost(Configuration yarnConf) { +// Check if tracking unregistered nodes is enabled in the configuration +if (!yarnConf.getBoolean(YarnConfiguration.ENABLE_TRACKING_FOR_UNREGISTERED_NODES, +YarnConfiguration.DEFAULT_ENABLE_TRACKING_FOR_UNREGISTERED_NODES)) { + LOG.debug("Unregistered node tracking is disabled. " + + "Skipping marking unregistered nodes as LOST."); + return; +} + +// Set to store all registered hostnames from both active and inactive lists +Set registeredHostNames = gatherRegisteredHostNames(); +// Event handler to dispatch LOST events +EventHandler eventHandler = this.rmContext.getDispatcher().getEventHandler(); + +// Identify nodes that are in the include list but are not registered +// and are not in the exclude list +List nodesToMarkLost = new ArrayList<>(); +HostDetails hostDetails = hostsReader.getHostDetails(); +Set includes = hostDetails.getIncludedHosts(); +Set excludes = hostDetails.getExcludedHosts(); + +for (String includedNode : includes) { + if (!registeredHostNames.contains(includedNode) && !excludes.contains(includedNode)) { +LOG.info("Lost node: " + includedNode); +nodesToMarkLost.add(includedNode); + } +} + +// Dispatch LOST events for the identified lost nodes +for (String lostNode : nodesToMarkLost) { + dispatchLostEvent(eventHandler, lostNode); +} + +// Log successful completion of marking unregistered nodes as LOST +LOG.info("Successfully marked unregistered nodes as LOST"); + } + + /** + * Gathers all registered hostnames from both active and inactive RMNodes. + * + * @return A set of registered hostnames. + */ + private Set gatherRegisteredHostNames() { +Set registeredHostNames = new HashSet<>(); +LOG.info("Getting all the registered hostnames"); + +// Gather all registered nodes (active) from RM into the set +for (RMNode node : this.rmContext.getRMNodes().values()) { + registeredHostNames.add(node.getHostName()); +} + +// Gather all inactive nodes from RM into the set +for (RMNode node : this.rmContext.getInactiveRMNodes().values()) { + registeredHostNames.add(node.getHostName()); +} + +return registeredHostNames; + } + + /** + * Dispatches a LOST event for a specified lost node. + * + * @param eventHandler The EventHandler used to dispatch the LOST event. + * @param lostNode The hostname of the lost node for which the event is + * being dispatched. + */ + private void dispatchLostEvent(EventHandler eventHandler, String lostNode) { +// Generate a NodeId for the lost node with a special port -2 +NodeId nodeId = createLostNodeId(lostNode); +RMNodeEvent lostEvent = new RMNodeEvent(nodeId, RMNodeEventType.EXPIRE); +RMNodeImpl rmNode = new RMNodeImpl(nodeId, this.rmContext, lostNode, -2, -2, +new UnknownNode(lostNode), Resource.newInstance(0, 0), "unknown"); + +try { + // Dispatch the LOST event to signal the node is no longer active + eventHandler.handle(lostEvent); + + // After successful dispatch, update the node status in RMContext + // Set the node's timestamp for when it became untracked + rmNode.setUntrackedTimeStamp(Time.monotonicNow()); + + // Add the node to the active and inactive node maps in RMContext + this.rmContext.getRMNodes().put(nodeId, rmNode); + this.rmContext.getInactiveRMNodes().put(nodeId, rmNode); + + LOG.info("Successfully dispatched LOST event and deactivated node: " Review Comment
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882293#comment-17882293 ] ASF GitHub Bot commented on YARN-11730: --- hadoop-yetus commented on PR #7049: URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2354793253 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 19s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 0s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 15m 46s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 20m 2s | | trunk passed | | +1 :green_heart: | compile | 3m 47s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 3m 29s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 1m 1s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 54s | | trunk passed | | +1 :green_heart: | javadoc | 1m 57s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 1m 57s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 3m 40s | | trunk passed | | +1 :green_heart: | shadedclient | 21m 25s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 21s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 10s | | the patch passed | | +1 :green_heart: | compile | 3m 23s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 3m 23s | | the patch passed | | +1 :green_heart: | compile | 3m 25s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 3m 25s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 54s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 40s | | the patch passed | | +1 :green_heart: | javadoc | 1m 46s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 1m 48s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 3m 55s | | the patch passed | | +1 :green_heart: | shadedclient | 21m 38s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 0m 46s | | hadoop-yarn-api in the patch passed. | | +1 :green_heart: | unit | 4m 39s | | hadoop-yarn-common in the patch passed. | | +1 :green_heart: | unit | 89m 44s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 40s | | The patch does not generate ASF License warnings. | | | | 212m 34s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7049 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint | | uname | Linux e05255413a5c 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 072bb20b1cbc1f06a5815aa338c4703cd390054c | | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/2/testReport/ | | Max. process+thread count | 926 (vs. ulimit of 5500) |
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882208#comment-17882208 ] ASF GitHub Bot commented on YARN-11730: --- hadoop-yetus commented on PR #7049: URL: https://github.com/apache/hadoop/pull/7049#issuecomment-2354217074 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 6m 55s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 0s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 23s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 20m 14s | | trunk passed | | +1 :green_heart: | compile | 3m 50s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 3m 24s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 1m 0s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 55s | | trunk passed | | +1 :green_heart: | javadoc | 1m 53s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 1m 53s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 3m 47s | | trunk passed | | +1 :green_heart: | shadedclient | 21m 16s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 22s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 11s | | the patch passed | | +1 :green_heart: | compile | 3m 25s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 3m 25s | | the patch passed | | +1 :green_heart: | compile | 3m 24s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 3m 24s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 53s | [/results-checkstyle-hadoop-yarn-project_hadoop-yarn.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/1/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn.txt) | hadoop-yarn-project/hadoop-yarn: The patch generated 6 new + 248 unchanged - 0 fixed = 254 total (was 248) | | +1 :green_heart: | mvnsite | 1m 43s | | the patch passed | | +1 :green_heart: | javadoc | 1m 39s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 1m 49s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | -1 :x: | spotbugs | 1m 22s | [/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/1/artifact/out/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html) | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) | | +1 :green_heart: | shadedclient | 21m 31s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 0m 44s | | hadoop-yarn-api in the patch passed. | | +1 :green_heart: | unit | 4m 40s | | hadoop-yarn-common in the patch passed. | | -1 :x: | unit | 89m 20s | [/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7049/1/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 40s | | The patch does not generate ASF License warnings. | | | | 216m 55s | | | | Reason | Tests | |---:|:--| | SpotBugs | module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcema
[jira] [Updated] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-11730: -- Labels: pull-request-available (was: ) > Resourcemanager node reporting enhancement for unregistered hosts > - > > Key: YARN-11730 > URL: https://issues.apache.org/jira/browse/YARN-11730 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, yarn >Affects Versions: 3.4.0 > Environment: Tested on multiple environments: > A. Docker Environment{*}:{*} > * Base OS: *Ubuntu 20.04* > * *Java 8* installed from OpenJDK. > * Docker image includes Hadoop binaries, user configurations, and ports for > YARN services. > * Verified behavior using a Hadoop snapshot in a containerized environment. > * Performed Namenode formatting and validated service interactions through > exposed ports. > * Repo reference: > [arjunmohnot/hadoop-yarn-docker|https://github.com/arjunmohnot/hadoop-yarn-docker/tree/main] > B. Bare-metal Distributed Setup (RedHat Linux){*}:{*} > * Running *Java 8* in a High-Availability (HA) configuration with > *Zookeeper* for locking mechanism. > * Two ResourceManagers (RM) in HA: Failover tested between HA1 and HA2 RM > node, including state retention and proper node state transitions. > * Verified node state transitions during RM failover, ensuring nodes moved > between LOST, ACTIVE, and other states as expected. >Reporter: Arjun Mohnot >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > h3. Issue Overview > When the ResourceManager (RM) starts, nodes listed in the _"include"_ file > are not immediately reported until their corresponding NodeManagers (NMs) > send their first heartbeat. However, nodes in the _"exclude"_ file are > instantly reflected in the _"Decommissioned Hosts"_ section with a port value > -1. > This design creates several challenges: > * {*}Untracked Nodemanagers{*}: During Resourcemanager HA failover or RM > standalone restart, some nodes may not report back, even though they are > listed in the _"include"_ file. These nodes neither appear in the _LOST_ > state nor are they represented in the RM's JMX metrics. This results in an > untracked state, making it difficult to monitor their status. While in HDFS > similar behaviour exists and is marked as {_}"DEAD"{_}. > * {*}Monitoring Gaps{*}: Nodes in the _"include"_ file are not visible until > they send their first heartbeat. This delay impacts real-time cluster > monitoring, leading to a lack of immediate visibility for these nodes in > Resourcemanager's state on the total no. of nodes. > * {*}Operational Impact{*}: These unreported nodes cause operational > difficulties, particularly in automated workflows such as OS Upgrade > Automation (OSUA), node recovery automation, and others where validation > depends on nodes being reflected in JMX as {_}LOST{_}, {_}UNHEALTHY{_}, or > {_}DECOMMISSIONED, etc{_}. Nodes that don't report, however, require hacky > workarounds to determine their accurate status. > h3. Proposed Solution > To address these issues, we propose automatically assigning the _LOST_ state > to any node listed in the _"include"_ file by default at the RM startup or HA > failover. This can be done by marking the node with a special port value > {_}-2{_}, signaling that the node is considered LOST but has not yet been > reported. Whenever a heartbeat is received for that > {color:#de350b}nodeID{color}, it will be transitioned from _LOST_ to > {_}RUNNING{_}, {_}UNHEALTHY{_}, or any other required desired state. > h3. Key implementation points > * Mark Unreported Nodes as LOST: Nodes in the _"include"_ file not part of > the RM active node context should be automatically marked as {_}LOST{_}. This > can be achieved by modifying the _NodesListManager_ under the > {color:#de350b}refreshHostsReader{color} method, invoked during failover, or > manual node refresh operations. This logic should ensure that all > unregistered nodes are moved to the _LOST_ state, with port _-2_ indicating > the node is untracked. > * For non-HA setups, this process can be triggered during RM service startup > to mark nodes as _LOST_ initially, and they will gradually transition to > their desired state when the heartbeat is received. > * Handle Node Heartbeat and Transition: When a node sends its first > heartbeat, the system should verify if the node is listed in > {color:#de350b}getInactiveRMNodes(){color}. If the node exists in the _LOST_ > state, the RM should remove it from the inactive list, decrement the _LOST_ > node count, and handle the transition back to the active node set. > * This logic can be placed in the state transition method within > {color:#de35
[jira] [Commented] (YARN-11730) Resourcemanager node reporting enhancement for unregistered hosts
[ https://issues.apache.org/jira/browse/YARN-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882178#comment-17882178 ] ASF GitHub Bot commented on YARN-11730: --- arjunmohnot opened a new pull request, #7049: URL: https://github.com/apache/hadoop/pull/7049 ### Description of PR 1. Overview When the ResourceManager starts, nodes listed in the "include" file are not immediately reported until their corresponding NodeManagers send their first heartbeat. However, nodes in the "exclude" file are instantly reflected in the "Decommissioned Hosts" section with a port value of -1. 2. Challenges 1. **Untracked NodeManagers**: During Resourcemanager HA failover or RM standalone restart, some nodes may not report back, even though they are listed in the _"include"_ file. These nodes neither appear in the _LOST_ state nor are they represented in the RM's JMX metrics. This results in an untracked state, making it difficult to monitor their status. While in HDFS similar behaviour exists and datanodes are marked as _"DEAD"_. 2. **Monitoring Gaps**: Nodes in the "include" file are not visible until they send their first heartbeat, impacting real-time cluster monitoring when being dependent on cluster metrics sink. 3. **Operational Impact**: Unreported nodes cause operational difficulties, particularly in automated workflows such as OS Upgrade Automation (OSUA), node recovery automation, etc. requiring workarounds to determine accurate status for nodes that don't report. 3. Proposed Solution To address these issues, the code automatically assigns the **_LOST_** state to nodes listed in the _"include"_ file that are not registered and not part of the exclude file at RM startup or during HA failover. This is indicated by a special port value of **-2**, marking the node as LOST but not yet reported. Once a heartbeat is received for that node, it will transition from LOST to RUNNING, UNHEALTHY, or any other desired state. 4. Key Implementation Points 1. **Mark Unreported Nodes as LOST**: - **Class Modified**: `NodesListManager` - **Method**: `refreshHostsReader` - **Functionality**: - Automatically marks nodes listed in the **"include"** file as **LOST** if they are not part of the RM active node context. - For non-HA setups, this process is triggered during **RM service startup**, ensuring unregistered nodes are initially set to **LOST**. - Port value **-2** indicates that the node is untracked. 2. **Handle Node Heartbeat and Transition**: - **Class Modified**: `RMNodeImpl` - **Method**: State transition method - **Functionality**: - Upon receiving the first heartbeat from a node, the system checks if the node exists in the **LOST** state (If nodeID has -2 port for that host) by verifying against `getInactiveRMNodes()`. - If the node is found in the **LOST** state: - Remove the node from the inactive node list. - Remove the node from the active node list to register it with a new nodeID having its required port. - Maintain the hostname in the RM context for proper host tracking. - Decrement the count of **LOST** nodes. - Re-register the node with the new nodeID and transition it back to the active node set, ensuring it recovers gracefully from the **LOST** state. - This logic ensures a smooth transition for nodes from **NEW** to **LOST** and back to active upon heartbeat reception. 5. Flow Diagram ```yaml +---+ | RM Startup / HA Failover | +---+ | v Check Nodes in RM Context | +-+ | | Not Registered & Not in Registered or in Exclude FileExclude File | | v v Mark Node as LOST (port -2) Node processed normally | v Wait for Heartbeat | v Receive Heartbeat | v Node State Check | +v || Previous NodeID RemovedSame Hostname With Port -2 Still Remains in the RM Context || || v| [No Further Transition]| |
[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"
[ https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882119#comment-17882119 ] ASF GitHub Bot commented on YARN-11709: --- hadoop-yetus commented on PR #7043: URL: https://github.com/apache/hadoop/pull/7043#issuecomment-2353412188 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 48s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 49m 18s | | trunk passed | | +1 :green_heart: | compile | 1m 33s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 1m 27s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 0m 40s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 46s | | trunk passed | | +1 :green_heart: | javadoc | 0m 49s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 40s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 28s | | trunk passed | | +1 :green_heart: | shadedclient | 39m 48s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 34s | | the patch passed | | +1 :green_heart: | compile | 1m 22s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 1m 22s | | the patch passed | | +1 :green_heart: | compile | 1m 18s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 1m 18s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 28s | | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 0 new + 101 unchanged - 1 fixed = 101 total (was 102) | | +1 :green_heart: | mvnsite | 0m 35s | | the patch passed | | +1 :green_heart: | javadoc | 0m 36s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 32s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 27s | | the patch passed | | +1 :green_heart: | shadedclient | 40m 7s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 24m 30s | | hadoop-yarn-server-nodemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 37s | | The patch does not generate ASF License warnings. | | | | 170m 0s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.46 ServerAPI=1.46 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7043 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux dc7d10aaaf50 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / e2d0786c7c14222481f9995935fa8b6cb5bf5882 | | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/2/testReport/ | | Max. process+thread count | 533 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/2/cons
[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"
[ https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882117#comment-17882117 ] ASF GitHub Bot commented on YARN-11709: --- hadoop-yetus commented on PR #7043: URL: https://github.com/apache/hadoop/pull/7043#issuecomment-2353393874 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 11m 38s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 48m 31s | | trunk passed | | +1 :green_heart: | compile | 1m 47s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 1m 44s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 0m 44s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 54s | | trunk passed | | +1 :green_heart: | javadoc | 0m 57s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 49s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 55s | | trunk passed | | +1 :green_heart: | shadedclient | 45m 45s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | -1 :x: | mvninstall | 0m 35s | [/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/3/artifact/out/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt) | hadoop-yarn-server-nodemanager in the patch failed. | | -1 :x: | compile | 0m 24s | [/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/3/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt) | hadoop-yarn-server-nodemanager in the patch failed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04. | | -1 :x: | javac | 0m 24s | [/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/3/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt) | hadoop-yarn-server-nodemanager in the patch failed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04. | | -1 :x: | compile | 0m 23s | [/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/3/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt) | hadoop-yarn-server-nodemanager in the patch failed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05. | | -1 :x: | javac | 0m 23s | [/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/3/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt) | hadoop-yarn-server-nodemanager in the patch failed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05. | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 21s | [/buildtool-patch-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/3/artifact/out/buildtool-patch-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt) | The patc
[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers
[ https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882095#comment-17882095 ] ASF GitHub Bot commented on YARN-11702: --- zeekling commented on PR #6990: URL: https://github.com/apache/hadoop/pull/6990#issuecomment-2353173990 The scheduler will not discard resource application requests, so why do we need to apply multiple times? > Fix Yarn over allocating containers > --- > > Key: YARN-11702 > URL: https://issues.apache.org/jira/browse/YARN-11702 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, fairscheduler, scheduler, yarn >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > > *Replication Steps:* > Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler) > > {code:java} > spark.executor.memory 1024M > spark.driver.memory 2048M > spark.executor.cores 1 > spark.executor.instances 20 > spark.dynamicAllocation.enabled false{code} > > Based on the setup, there should be 20 spark executors, but from the > ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 > of them were released in seconds. On analyzing the Spark ApplicationMaster > (AM) logs, The following logs were observed. > > {code:java} > 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) > for ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with > custom resources: vCores:2147483647> > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, > launching executors on 4 of them. > 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, > launching executors on 0 of them. > {code} > It was clear for the logs that extra allocated 12 containers are being > ignored from Spark side. Inorder to debug this further, additional log lines > were added to > [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427] > class in increment and decrement of container request to expose additional > information about the request. > > {code:java} > 2024-06-24 14:10:14,075 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 > Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, > containerToUpdate=null} for: appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,111 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,113 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 15 De
[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"
[ https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881674#comment-17881674 ] ASF GitHub Bot commented on YARN-11709: --- hadoop-yetus commented on PR #7043: URL: https://github.com/apache/hadoop/pull/7043#issuecomment-2350004088 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 17m 20s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 49m 42s | | trunk passed | | +1 :green_heart: | compile | 1m 31s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 1m 26s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 0m 41s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 45s | | trunk passed | | +1 :green_heart: | javadoc | 0m 49s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 40s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 28s | | trunk passed | | +1 :green_heart: | shadedclient | 39m 53s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 35s | | the patch passed | | +1 :green_heart: | compile | 1m 22s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 1m 22s | | the patch passed | | +1 :green_heart: | compile | 1m 18s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 1m 18s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 28s | [/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/1/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt) | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 2 new + 101 unchanged - 1 fixed = 103 total (was 102) | | +1 :green_heart: | mvnsite | 0m 35s | | the patch passed | | -1 :x: | javadoc | 0m 35s | [/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/1/artifact/out/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt) | hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 generated 1 new + 195 unchanged - 0 fixed = 196 total (was 195) | | -1 :x: | javadoc | 0m 32s | [/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7043/1/artifact/out/results-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt) | hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05 with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 generated 1 new + 195 unchanged - 0 fixed = 196 total (was 195) | | +1 :green_heart: | spotbugs | 1m 29s | | the patch passed | | +1 :green_heart: | shadedclient | 40m 16s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 24m 31s | | hadoop-yarn-server-nodemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 37s | | The patch does not generate ASF License warnings. | |
[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"
[ https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881613#comment-17881613 ] ASF GitHub Bot commented on YARN-11709: --- brumi1024 opened a new pull request, #7043: URL: https://github.com/apache/hadoop/pull/7043 ### Description of PR The startLocalizer step didn't have the same error checks as the normal container launch. Updated the code to mark the NM unhealthy if the container-executor has config issues. ### How was this patch tested? ### For code changes: - [ ] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > NodeManager should be shut down or blacklisted when it cannot run program > "/var/lib/yarn-ce/bin/container-executor" > --- > > Key: YARN-11709 > URL: https://issues.apache.org/jira/browse/YARN-11709 > Project: Hadoop YARN > Issue Type: Improvement > Components: container-executor >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > When NodeManager encounters the below "No such file or directory" error > reported against the "container-executor", it should give up participating in > the cluster as it is not capable to run any container, but just fail the jobs. > {code:java} > 2023-01-18 10:08:10,600 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_e159_1673543180101_9407_02_ > 14 startLocalizer is : -1 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > java.io.IOException: Cannot run program > "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:183) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:403) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.j > ava:1250) > Caused by: java.io.IOException: Cannot run program > "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app
[ https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881593#comment-17881593 ] ASF GitHub Bot commented on YARN-11708: --- hadoop-yetus commented on PR #7041: URL: https://github.com/apache/hadoop/pull/7041#issuecomment-2349142064 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 56s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 45m 32s | | trunk passed | | +1 :green_heart: | compile | 1m 3s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 0m 56s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 0m 55s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 1s | | trunk passed | | +1 :green_heart: | javadoc | 1m 0s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 51s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 2m 2s | | trunk passed | | +1 :green_heart: | shadedclient | 36m 49s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 49s | | the patch passed | | +1 :green_heart: | compile | 0m 53s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 0m 53s | | the patch passed | | +1 :green_heart: | compile | 0m 47s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 0m 47s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 42s | [/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/2/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 4 new + 96 unchanged - 0 fixed = 100 total (was 96) | | +1 :green_heart: | mvnsite | 0m 51s | | the patch passed | | +1 :green_heart: | javadoc | 0m 46s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 42s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 58s | | the patch passed | | +1 :green_heart: | shadedclient | 35m 34s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 123m 15s | [/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/2/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 38s | | The patch does not generate ASF License warnings. | | | | 257m 41s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.yarn.server.resourcemanager.TestApplicationACLs | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerAutoQueueCreation | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerAmbiguousLeafs | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerApps | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7041 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spo
[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app
[ https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881525#comment-17881525 ] ASF GitHub Bot commented on YARN-11708: --- hadoop-yetus commented on PR #7041: URL: https://github.com/apache/hadoop/pull/7041#issuecomment-2348693336 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 17m 25s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 44m 46s | | trunk passed | | +1 :green_heart: | compile | 1m 2s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 0m 57s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 0m 56s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 1s | | trunk passed | | +1 :green_heart: | javadoc | 1m 0s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 52s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 59s | | trunk passed | | +1 :green_heart: | shadedclient | 35m 34s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 51s | | the patch passed | | +1 :green_heart: | compile | 0m 52s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 0m 52s | | the patch passed | | +1 :green_heart: | compile | 0m 49s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 0m 49s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 43s | [/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/1/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 4 new + 96 unchanged - 0 fixed = 100 total (was 96) | | +1 :green_heart: | mvnsite | 0m 51s | | the patch passed | | +1 :green_heart: | javadoc | 0m 45s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 42s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 2m 0s | | the patch passed | | +1 :green_heart: | shadedclient | 35m 48s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 122m 0s | [/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/1/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 36s | | The patch does not generate ASF License warnings. | | | | 271m 4s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.yarn.server.resourcemanager.TestApplicationACLs | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerAutoQueueCreation | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerAmbiguousLeafs | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerApps | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7041/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7041 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spo
[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app
[ https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881516#comment-17881516 ] ASF GitHub Bot commented on YARN-11708: --- susheel-gupta commented on code in PR #7041: URL: https://github.com/apache/hadoop/pull/7041#discussion_r1758603557 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java: ## @@ -3364,12 +3364,29 @@ public boolean moveReservedContainer(RMContainer toBeMovedContainer, } } - @Override + @Override public long checkAndGetApplicationLifetime(String queueName, long lifetimeRequestedByApp) { +CSQueue queue = getQueue(queueName); + +// This handles the case where the queue does not exist, +// addressing the issue related to YARN-11708. +if (queue == null) { + QueuePath queuePath = new QueuePath(queueName); + + writeLock.lock(); Review Comment: Thanks for reviewing, yes it should be. > Setting maximum-application-lifetime using AQCv2 templates doesn't apply on > the first submitted app > > > Key: YARN-11708 > URL: https://issues.apache.org/jira/browse/YARN-11708 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Benjamin Teke >Assignee: Susheel Gupta >Priority: Major > Labels: pull-request-available > > Setting the _maximum-application-lifetime_ property using AQC v2 templates > (_yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime_) > doesn't apply to the first submitted application (through which the queue is > created), only to the subsequent ones. It should apply to the first > application as well. > Repro steps: > Create the queue root.test, enable AQCv2 on it. > Provide the following template properties: > yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime=8 > yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.default-application-lifetime=8 > The first submitted application, which triggers the queue creation will have > unlimited lifetime: > {code:java} > TimeoutType : LIFETIME > ExpiryTime : UNLIMITED > RemainingTime : -1seconds > Final-State : SUCCEEDED > {code} > The subsequent applications will be killed after the lifetime expires: > {code:java} > TimeoutType : LIFETIME > ExpiryTime : 2024-07-23T15:02:41.386+ > RemainingTime : 0seconds > Final-State : KILLED > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app
[ https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881508#comment-17881508 ] ASF GitHub Bot commented on YARN-11708: --- K0K0V0K commented on code in PR #7041: URL: https://github.com/apache/hadoop/pull/7041#discussion_r1758562684 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java: ## @@ -3364,12 +3364,29 @@ public boolean moveReservedContainer(RMContainer toBeMovedContainer, } } - @Override + @Override public long checkAndGetApplicationLifetime(String queueName, long lifetimeRequestedByApp) { +CSQueue queue = getQueue(queueName); + +// This handles the case where the queue does not exist, +// addressing the issue related to YARN-11708. +if (queue == null) { + QueuePath queuePath = new QueuePath(queueName); + + writeLock.lock(); + try { +queue = queueManager.createQueue(queuePath); + } catch (YarnException | IOException e) { +LOG.error("Failed to create queue '{}': ", queueName, e); Review Comment: ```suggestion LOG.error("Failed to create queue " + queueName, e); ``` Because the current log line wont log the exception, cause will think that is the 2nd param in the message ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java: ## @@ -3364,12 +3364,29 @@ public boolean moveReservedContainer(RMContainer toBeMovedContainer, } } - @Override + @Override public long checkAndGetApplicationLifetime(String queueName, long lifetimeRequestedByApp) { +CSQueue queue = getQueue(queueName); + +// This handles the case where the queue does not exist, +// addressing the issue related to YARN-11708. +if (queue == null) { + QueuePath queuePath = new QueuePath(queueName); + + writeLock.lock(); Review Comment: This write lock should be before the get queue call, to properly handle the race condition, right? > Setting maximum-application-lifetime using AQCv2 templates doesn't apply on > the first submitted app > > > Key: YARN-11708 > URL: https://issues.apache.org/jira/browse/YARN-11708 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Benjamin Teke >Assignee: Susheel Gupta >Priority: Major > Labels: pull-request-available > > Setting the _maximum-application-lifetime_ property using AQC v2 templates > (_yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime_) > doesn't apply to the first submitted application (through which the queue is > created), only to the subsequent ones. It should apply to the first > application as well. > Repro steps: > Create the queue root.test, enable AQCv2 on it. > Provide the following template properties: > yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime=8 > yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.default-application-lifetime=8 > The first submitted application, which triggers the queue creation will have > unlimited lifetime: > {code:java} > TimeoutType : LIFETIME > ExpiryTime : UNLIMITED > RemainingTime : -1seconds > Final-State : SUCCEEDED > {code} > The subsequent applications will be killed after the lifetime expires: > {code:java} > TimeoutType : LIFETIME > ExpiryTime : 2024-07-23T15:02:41.386+ > RemainingTime : 0seconds > Final-State : KILLED > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app
[ https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-11708: -- Labels: pull-request-available (was: ) > Setting maximum-application-lifetime using AQCv2 templates doesn't apply on > the first submitted app > > > Key: YARN-11708 > URL: https://issues.apache.org/jira/browse/YARN-11708 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Benjamin Teke >Assignee: Susheel Gupta >Priority: Major > Labels: pull-request-available > > Setting the _maximum-application-lifetime_ property using AQC v2 templates > (_yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime_) > doesn't apply to the first submitted application (through which the queue is > created), only to the subsequent ones. It should apply to the first > application as well. > Repro steps: > Create the queue root.test, enable AQCv2 on it. > Provide the following template properties: > yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime=8 > yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.default-application-lifetime=8 > The first submitted application, which triggers the queue creation will have > unlimited lifetime: > {code:java} > TimeoutType : LIFETIME > ExpiryTime : UNLIMITED > RemainingTime : -1seconds > Final-State : SUCCEEDED > {code} > The subsequent applications will be killed after the lifetime expires: > {code:java} > TimeoutType : LIFETIME > ExpiryTime : 2024-07-23T15:02:41.386+ > RemainingTime : 0seconds > Final-State : KILLED > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11708) Setting maximum-application-lifetime using AQCv2 templates doesn't apply on the first submitted app
[ https://issues.apache.org/jira/browse/YARN-11708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881478#comment-17881478 ] ASF GitHub Bot commented on YARN-11708: --- susheelgupta7 opened a new pull request, #7041: URL: https://github.com/apache/hadoop/pull/7041 …s doesn't apply on the first submitted app ### Description of PR ### How was this patch tested? ### For code changes: - [x] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > Setting maximum-application-lifetime using AQCv2 templates doesn't apply on > the first submitted app > > > Key: YARN-11708 > URL: https://issues.apache.org/jira/browse/YARN-11708 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Benjamin Teke >Assignee: Susheel Gupta >Priority: Major > > Setting the _maximum-application-lifetime_ property using AQC v2 templates > (_yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime_) > doesn't apply to the first submitted application (through which the queue is > created), only to the subsequent ones. It should apply to the first > application as well. > Repro steps: > Create the queue root.test, enable AQCv2 on it. > Provide the following template properties: > yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.maximum-application-lifetime=8 > yarn.scheduler.capacity.root.test.auto-queue-creation-v2.template.default-application-lifetime=8 > The first submitted application, which triggers the queue creation will have > unlimited lifetime: > {code:java} > TimeoutType : LIFETIME > ExpiryTime : UNLIMITED > RemainingTime : -1seconds > Final-State : SUCCEEDED > {code} > The subsequent applications will be killed after the lifetime expires: > {code:java} > TimeoutType : LIFETIME > ExpiryTime : 2024-07-23T15:02:41.386+ > RemainingTime : 0seconds > Final-State : KILLED > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11664) Remove HDFS Binaries/Jars Dependency From YARN
[ https://issues.apache.org/jira/browse/YARN-11664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880352#comment-17880352 ] ASF GitHub Bot commented on YARN-11664: --- steveloughran commented on PR #6631: URL: https://github.com/apache/hadoop/pull/6631#issuecomment-2338234175 removing the package-info class would be the simpler solution, but we need to understand how the regression got in. your PR seemed to take, but everything after it broke. > Remove HDFS Binaries/Jars Dependency From YARN > -- > > Key: YARN-11664 > URL: https://issues.apache.org/jira/browse/YARN-11664 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > In principle Hadoop Yarn is independent of HDFS. It can work with any > filesystem. Currently there exists some code dependency for Yarn with HDFS. > This dependency requires Yarn to bring in some of the HDFS binaries/jars to > its class path. The idea behind this jira is to remove this dependency so > that Yarn can run without HDFS binaries/jars > *Scope* > 1. Non test classes are considered > 2. Some test classes which comes as transitive dependency are considered > *Out of scope* > 1. All test classes in Yarn module is not considered > > > A quick search in Yarn module revealed following HDFS dependencies > 1. Constants > {code:java} > import > org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier; > import org.apache.hadoop.hdfs.DFSConfigKeys;{code} > > > 2. Exception > {code:java} > import org.apache.hadoop.hdfs.protocol.DSQuotaExceededException;{code} > > 3. Utility > {code:java} > import org.apache.hadoop.hdfs.protocol.datatransfer.IOStreamPair;{code} > > Both Yarn and HDFS depends on *hadoop-common* module, > * Constants variables and Utility classes can be moved to *hadoop-common* > * Instead of DSQuotaExceededException, Use the parent exception > ClusterStoragrCapacityExceeded -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11729) Broken 'AM Node Web UI' link on App details page
[ https://issues.apache.org/jira/browse/YARN-11729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880281#comment-17880281 ] ASF GitHub Bot commented on YARN-11729: --- K0K0V0K opened a new pull request, #7030: URL: https://github.com/apache/hadoop/pull/7030 ### Description of PR - the current link ends with a '/' - with this ending the RM wont open the link - to fix the issue we remove the last '/' char from the url ### How was this patch tested? - manually run example job and click on the generated link ### For code changes: - [ ] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > Broken 'AM Node Web UI' link on App details page > > > Key: YARN-11729 > URL: https://issues.apache.org/jira/browse/YARN-11729 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Affects Versions: 3.4.0 >Reporter: Bence Kosztolnik >Assignee: Bence Kosztolnik >Priority: Major > > h6. Description: > Generated 'AM Node Web UI' link can not be interpreted by RM. > h6. Reproduction > - Run MapReduce pi example job > - Open the app details page > - Click on AM Node Web UI > - Page won't load > h6. Fix: > The problem is the URL finishes with a '/' so RM can not open the node page. > To fix this we should modify the UI code to generate the URL without the last > '/' char -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11729) Broken 'AM Node Web UI' link on App details page
[ https://issues.apache.org/jira/browse/YARN-11729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-11729: -- Labels: pull-request-available (was: ) > Broken 'AM Node Web UI' link on App details page > > > Key: YARN-11729 > URL: https://issues.apache.org/jira/browse/YARN-11729 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Affects Versions: 3.4.0 >Reporter: Bence Kosztolnik >Assignee: Bence Kosztolnik >Priority: Major > Labels: pull-request-available > > h6. Description: > Generated 'AM Node Web UI' link can not be interpreted by RM. > h6. Reproduction > - Run MapReduce pi example job > - Open the app details page > - Click on AM Node Web UI > - Page won't load > h6. Fix: > The problem is the URL finishes with a '/' so RM can not open the node page. > To fix this we should modify the UI code to generate the URL without the last > '/' char -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11664) Remove HDFS Binaries/Jars Dependency From YARN
[ https://issues.apache.org/jira/browse/YARN-11664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880194#comment-17880194 ] ASF GitHub Bot commented on YARN-11664: --- shameersss1 commented on PR #6631: URL: https://github.com/apache/hadoop/pull/6631#issuecomment-2337134702 > I think this is triggering a regression in enforcer > > ``` > [INFO] Adding ignore: * > [WARNING] Rule 1: org.apache.maven.plugins.enforcer.BanDuplicateClasses failed with message: > Duplicate classes found: > > Found in: > org.apache.hadoop:hadoop-client-minicluster:jar:3.5.0-SNAPSHOT:compile > org.apache.hadoop:hadoop-client-api:jar:3.5.0-SNAPSHOT:compile > Duplicate classes: > org/apache/hadoop/hdfs/protocol/datatransfer/package-info.class > ``` > > I'm going to revert the PR and we'll have to move that IOStreamPair class to a new package after all. pity Sure, @steveloughran - Instead of moving IOStreamPair to a new package , Can we ignore this specific `org/apache/hadoop/hdfs/protocol/datatransfer/package-info.class` class from BanDuplicateClasses enforcer? Anyhow package-info.class is not a critical class. > Remove HDFS Binaries/Jars Dependency From YARN > -- > > Key: YARN-11664 > URL: https://issues.apache.org/jira/browse/YARN-11664 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > In principle Hadoop Yarn is independent of HDFS. It can work with any > filesystem. Currently there exists some code dependency for Yarn with HDFS. > This dependency requires Yarn to bring in some of the HDFS binaries/jars to > its class path. The idea behind this jira is to remove this dependency so > that Yarn can run without HDFS binaries/jars > *Scope* > 1. Non test classes are considered > 2. Some test classes which comes as transitive dependency are considered > *Out of scope* > 1. All test classes in Yarn module is not considered > > > A quick search in Yarn module revealed following HDFS dependencies > 1. Constants > {code:java} > import > org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier; > import org.apache.hadoop.hdfs.DFSConfigKeys;{code} > > > 2. Exception > {code:java} > import org.apache.hadoop.hdfs.protocol.DSQuotaExceededException;{code} > > 3. Utility > {code:java} > import org.apache.hadoop.hdfs.protocol.datatransfer.IOStreamPair;{code} > > Both Yarn and HDFS depends on *hadoop-common* module, > * Constants variables and Utility classes can be moved to *hadoop-common* > * Instead of DSQuotaExceededException, Use the parent exception > ClusterStoragrCapacityExceeded -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers
[ https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880158#comment-17880158 ] ASF GitHub Bot commented on YARN-11702: --- zeekling commented on PR #6990: URL: https://github.com/apache/hadoop/pull/6990#issuecomment-2336729314 Why are multiple requests for Containers sent? This is the key to the problem. > Fix Yarn over allocating containers > --- > > Key: YARN-11702 > URL: https://issues.apache.org/jira/browse/YARN-11702 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, fairscheduler, scheduler, yarn >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > > *Replication Steps:* > Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler) > > {code:java} > spark.executor.memory 1024M > spark.driver.memory 2048M > spark.executor.cores 1 > spark.executor.instances 20 > spark.dynamicAllocation.enabled false{code} > > Based on the setup, there should be 20 spark executors, but from the > ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 > of them were released in seconds. On analyzing the Spark ApplicationMaster > (AM) logs, The following logs were observed. > > {code:java} > 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) > for ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with > custom resources: vCores:2147483647> > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, > launching executors on 4 of them. > 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, > launching executors on 0 of them. > {code} > It was clear for the logs that extra allocated 12 containers are being > ignored from Spark side. Inorder to debug this further, additional log lines > were added to > [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427] > class in increment and decrement of container request to expose additional > information about the request. > > {code:java} > 2024-06-24 14:10:14,075 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 > Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, > containerToUpdate=null} for: appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,111 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,113 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 15 Decremented by: 1 SchedulerR
[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"
[ https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880111#comment-17880111 ] ASF GitHub Bot commented on YARN-11709: --- zeekling commented on code in PR #6960: URL: https://github.com/apache/hadoop/pull/6960#discussion_r1749152353 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java: ## @@ -451,8 +451,10 @@ public void startLocalizer(LocalizerStartContext ctx) } catch (PrivilegedOperationException e) { int exitCode = e.getExitCode(); - LOG.warn("Exit code from container {} startLocalizer is : {}", - locId, exitCode, e); + LOG.error("Unrecoverable issue occurred. Marking the node as unhealthy to prevent " + + "further containers to get scheduled on the node and cause application failures. " + + "Exit code from the container " + locId + "startLocalizer is : " + exitCode, e); + nmContext.getNodeStatusUpdater().reportException(e); Review Comment: ok > NodeManager should be shut down or blacklisted when it cannot run program > "/var/lib/yarn-ce/bin/container-executor" > --- > > Key: YARN-11709 > URL: https://issues.apache.org/jira/browse/YARN-11709 > Project: Hadoop YARN > Issue Type: Improvement > Components: container-executor >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > When NodeManager encounters the below "No such file or directory" error > reported against the "container-executor", it should give up participating in > the cluster as it is not capable to run any container, but just fail the jobs. > {code:java} > 2023-01-18 10:08:10,600 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_e159_1673543180101_9407_02_ > 14 startLocalizer is : -1 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > java.io.IOException: Cannot run program > "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:183) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:403) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.j > ava:1250) > Caused by: java.io.IOException: Cannot run program > "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11709) NodeManager should be shut down or blacklisted when it cannot run program "/var/lib/yarn-ce/bin/container-executor"
[ https://issues.apache.org/jira/browse/YARN-11709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880012#comment-17880012 ] ASF GitHub Bot commented on YARN-11709: --- brumi1024 merged PR #7028: URL: https://github.com/apache/hadoop/pull/7028 > NodeManager should be shut down or blacklisted when it cannot run program > "/var/lib/yarn-ce/bin/container-executor" > --- > > Key: YARN-11709 > URL: https://issues.apache.org/jira/browse/YARN-11709 > Project: Hadoop YARN > Issue Type: Improvement > Components: container-executor >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > When NodeManager encounters the below "No such file or directory" error > reported against the "container-executor", it should give up participating in > the cluster as it is not capable to run any container, but just fail the jobs. > {code:java} > 2023-01-18 10:08:10,600 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_e159_1673543180101_9407_02_ > 14 startLocalizer is : -1 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > java.io.IOException: Cannot run program > "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:183) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:403) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.j > ava:1250) > Caused by: java.io.IOException: Cannot run program > "/var/lib/yarn-ce/bin/container-executor": error=2, No such file or directory > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11664) Remove HDFS Binaries/Jars Dependency From YARN
[ https://issues.apache.org/jira/browse/YARN-11664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879564#comment-17879564 ] ASF GitHub Bot commented on YARN-11664: --- steveloughran commented on PR #6631: URL: https://github.com/apache/hadoop/pull/6631#issuecomment-2331703195 I think this is triggering a regression in enforcer ``` [INFO] Adding ignore: * [WARNING] Rule 1: org.apache.maven.plugins.enforcer.BanDuplicateClasses failed with message: Duplicate classes found: Found in: org.apache.hadoop:hadoop-client-minicluster:jar:3.5.0-SNAPSHOT:compile org.apache.hadoop:hadoop-client-api:jar:3.5.0-SNAPSHOT:compile Duplicate classes: org/apache/hadoop/hdfs/protocol/datatransfer/package-info.class ``` I'm going to revert the PR and we'll have to move that IOStreamPair class to a new package after all. pity > Remove HDFS Binaries/Jars Dependency From YARN > -- > > Key: YARN-11664 > URL: https://issues.apache.org/jira/browse/YARN-11664 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > In principle Hadoop Yarn is independent of HDFS. It can work with any > filesystem. Currently there exists some code dependency for Yarn with HDFS. > This dependency requires Yarn to bring in some of the HDFS binaries/jars to > its class path. The idea behind this jira is to remove this dependency so > that Yarn can run without HDFS binaries/jars > *Scope* > 1. Non test classes are considered > 2. Some test classes which comes as transitive dependency are considered > *Out of scope* > 1. All test classes in Yarn module is not considered > > > A quick search in Yarn module revealed following HDFS dependencies > 1. Constants > {code:java} > import > org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier; > import org.apache.hadoop.hdfs.DFSConfigKeys;{code} > > > 2. Exception > {code:java} > import org.apache.hadoop.hdfs.protocol.DSQuotaExceededException;{code} > > 3. Utility > {code:java} > import org.apache.hadoop.hdfs.protocol.datatransfer.IOStreamPair;{code} > > Both Yarn and HDFS depends on *hadoop-common* module, > * Constants variables and Utility classes can be moved to *hadoop-common* > * Instead of DSQuotaExceededException, Use the parent exception > ClusterStoragrCapacityExceeded -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11664) Remove HDFS Binaries/Jars Dependency From YARN
[ https://issues.apache.org/jira/browse/YARN-11664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879211#comment-17879211 ] ASF GitHub Bot commented on YARN-11664: --- steveloughran commented on PR #6631: URL: https://github.com/apache/hadoop/pull/6631#issuecomment-2328853748 merged to trunk; will take a PR to branch-3.4 > Remove HDFS Binaries/Jars Dependency From YARN > -- > > Key: YARN-11664 > URL: https://issues.apache.org/jira/browse/YARN-11664 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > In principle Hadoop Yarn is independent of HDFS. It can work with any > filesystem. Currently there exists some code dependency for Yarn with HDFS. > This dependency requires Yarn to bring in some of the HDFS binaries/jars to > its class path. The idea behind this jira is to remove this dependency so > that Yarn can run without HDFS binaries/jars > *Scope* > 1. Non test classes are considered > 2. Some test classes which comes as transitive dependency are considered > *Out of scope* > 1. All test classes in Yarn module is not considered > > > A quick search in Yarn module revealed following HDFS dependencies > 1. Constants > {code:java} > import > org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier; > import org.apache.hadoop.hdfs.DFSConfigKeys;{code} > > > 2. Exception > {code:java} > import org.apache.hadoop.hdfs.protocol.DSQuotaExceededException;{code} > > 3. Utility > {code:java} > import org.apache.hadoop.hdfs.protocol.datatransfer.IOStreamPair;{code} > > Both Yarn and HDFS depends on *hadoop-common* module, > * Constants variables and Utility classes can be moved to *hadoop-common* > * Instead of DSQuotaExceededException, Use the parent exception > ClusterStoragrCapacityExceeded -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers
[ https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879174#comment-17879174 ] ASF GitHub Bot commented on YARN-11702: --- slfan1989 commented on PR #6990: URL: https://github.com/apache/hadoop/pull/6990#issuecomment-2328506556 @aajisaka Sorry I missed some messages. I will review this PR. Please give me 1-2 days. cc: @shameersss1 > Fix Yarn over allocating containers > --- > > Key: YARN-11702 > URL: https://issues.apache.org/jira/browse/YARN-11702 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, fairscheduler, scheduler, yarn >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > > *Replication Steps:* > Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler) > > {code:java} > spark.executor.memory 1024M > spark.driver.memory 2048M > spark.executor.cores 1 > spark.executor.instances 20 > spark.dynamicAllocation.enabled false{code} > > Based on the setup, there should be 20 spark executors, but from the > ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 > of them were released in seconds. On analyzing the Spark ApplicationMaster > (AM) logs, The following logs were observed. > > {code:java} > 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) > for ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with > custom resources: vCores:2147483647> > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, > launching executors on 4 of them. > 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, > launching executors on 0 of them. > {code} > It was clear for the logs that extra allocated 12 containers are being > ignored from Spark side. Inorder to debug this further, additional log lines > were added to > [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427] > class in increment and decrement of container request to expose additional > information about the request. > > {code:java} > 2024-06-24 14:10:14,075 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 > Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, > containerToUpdate=null} for: appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,111 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,113 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContaine
[jira] [Commented] (YARN-11702) Fix Yarn over allocating containers
[ https://issues.apache.org/jira/browse/YARN-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879120#comment-17879120 ] ASF GitHub Bot commented on YARN-11702: --- aajisaka commented on PR #6990: URL: https://github.com/apache/hadoop/pull/6990#issuecomment-2328090305 Thank you @shameersss1. I'll merge this in this weekend if there's no objection. > Fix Yarn over allocating containers > --- > > Key: YARN-11702 > URL: https://issues.apache.org/jira/browse/YARN-11702 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, fairscheduler, scheduler, yarn >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > > *Replication Steps:* > Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler) > > {code:java} > spark.executor.memory 1024M > spark.driver.memory 2048M > spark.executor.cores 1 > spark.executor.instances 20 > spark.dynamicAllocation.enabled false{code} > > Based on the setup, there should be 20 spark executors, but from the > ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 > of them were released in seconds. On analyzing the Spark ApplicationMaster > (AM) logs, The following logs were observed. > > {code:java} > 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) > for ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with > custom resources: vCores:2147483647> > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, > launching executors on 8 of them. > 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, > launching executors on 4 of them. > 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, > launching executors on 0 of them. > {code} > It was clear for the logs that extra allocated 12 containers are being > ignored from Spark side. Inorder to debug this further, additional log lines > were added to > [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427] > class in increment and decrement of container request to expose additional > information about the request. > > {code:java} > 2024-06-24 14:10:14,075 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 > Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, > containerToUpdate=null} for: appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,077 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,111 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,112 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, > allocationRequestId=0, containerToUpdate=null} for: > appattempt_1719234929152_0004_01 > 2024-06-24 14:10:14,113 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo > (SchedulerEventDispatcher:Event Processor): Allocate Updates > PendingContainers: 15 Decremented by: 1 Schedule
[jira] [Updated] (YARN-6261) YARN queue mapping fails for users with no group
[ https://issues.apache.org/jira/browse/YARN-6261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-6261: - Labels: pull-request-available (was: ) > YARN queue mapping fails for users with no group > > > Key: YARN-6261 > URL: https://issues.apache.org/jira/browse/YARN-6261 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Pierre Villard >Assignee: Pierre Villard >Priority: Major > Labels: pull-request-available > > *Issue:* > Since Hadoop group mapping can be overridden (to get groups from an AD for > example), it is possible to be in a situation where a user does not have any > group (because the user is not in the AD but only defined locally): > {noformat} > $ hdfs groups zeppelin > zeppelin: > {noformat} > In this case, if the YARN Queue Mapping is configured and contains at least > one mapping of {{MappingType.GROUP}}, it won't be possible to get a queue for > the job submitted by such a user and the job won't be submitted at all. > *Expected result:* > In case a user does not have any group and no mapping is defined for this > user, the default queue should be assigned whatever the queue mapping > definition is. > *Workaround:* > A workaround is to define a group mapping of {{MappingType.USER}} for the > given user before defining any mapping of {{MappingType.GROUP}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6261) YARN queue mapping fails for users with no group
[ https://issues.apache.org/jira/browse/YARN-6261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878579#comment-17878579 ] ASF GitHub Bot commented on YARN-6261: -- pvillard31 closed pull request #198: YARN-6261 - Catch user with no group when getting queue from mapping … URL: https://github.com/apache/hadoop/pull/198 > YARN queue mapping fails for users with no group > > > Key: YARN-6261 > URL: https://issues.apache.org/jira/browse/YARN-6261 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Pierre Villard >Assignee: Pierre Villard >Priority: Major > > *Issue:* > Since Hadoop group mapping can be overridden (to get groups from an AD for > example), it is possible to be in a situation where a user does not have any > group (because the user is not in the AD but only defined locally): > {noformat} > $ hdfs groups zeppelin > zeppelin: > {noformat} > In this case, if the YARN Queue Mapping is configured and contains at least > one mapping of {{MappingType.GROUP}}, it won't be possible to get a queue for > the job submitted by such a user and the job won't be submitted at all. > *Expected result:* > In case a user does not have any group and no mapping is defined for this > user, the default queue should be assigned whatever the queue mapping > definition is. > *Workaround:* > A workaround is to define a group mapping of {{MappingType.USER}} for the > given user before defining any mapping of {{MappingType.GROUP}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877077#comment-17877077 ] ASF GitHub Bot commented on YARN-10345: --- brumi1024 commented on PR #7013: URL: https://github.com/apache/hadoop/pull/7013#issuecomment-2312945414 Thanks @K0K0V0K for the update, merged to trunk. > HsWebServices containerlogs does not honor ACLs for completed jobs > -- > > Key: YARN-10345 > URL: https://issues.apache.org/jira/browse/YARN-10345 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.3.0, 3.2.2, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Bence Kosztolnik >Priority: Critical > Labels: pull-request-available > Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png > > > HsWebServices containerlogs does not honor ACLs. User who does not have > permission to view a job is allowed to view the job logs for completed jobs > from YARN UI2 through HsWebServices. > *Repro:* > Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + > HistoryServer runs as mapred > # Run a sample MR job using systest user > # Once the job is complete, access the job logs using hue user from YARN > UI2. > !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! > > YARN CLI works fine and does not allow hue user to view systest user job logs. > {code:java} > [hue@pjoseph-cm-2 /]$ > [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at > rmhostname:8032 > Permission denied: user=hue, access=EXECUTE, > inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877076#comment-17877076 ] ASF GitHub Bot commented on YARN-10345: --- brumi1024 merged PR #7013: URL: https://github.com/apache/hadoop/pull/7013 > HsWebServices containerlogs does not honor ACLs for completed jobs > -- > > Key: YARN-10345 > URL: https://issues.apache.org/jira/browse/YARN-10345 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.3.0, 3.2.2, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Bence Kosztolnik >Priority: Critical > Labels: pull-request-available > Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png > > > HsWebServices containerlogs does not honor ACLs. User who does not have > permission to view a job is allowed to view the job logs for completed jobs > from YARN UI2 through HsWebServices. > *Repro:* > Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + > HistoryServer runs as mapred > # Run a sample MR job using systest user > # Once the job is complete, access the job logs using hue user from YARN > UI2. > !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! > > YARN CLI works fine and does not allow hue user to view systest user job logs. > {code:java} > [hue@pjoseph-cm-2 /]$ > [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at > rmhostname:8032 > Permission denied: user=hue, access=EXECUTE, > inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877073#comment-17877073 ] ASF GitHub Bot commented on YARN-10345: --- hadoop-yetus commented on PR #7013: URL: https://github.com/apache/hadoop/pull/7013#issuecomment-2312931479 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 32s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 15m 24s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 32m 49s | | trunk passed | | +1 :green_heart: | compile | 1m 45s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 1m 36s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 1m 10s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 12s | | trunk passed | | +1 :green_heart: | javadoc | 1m 13s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 1m 3s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 53s | | trunk passed | | +1 :green_heart: | shadedclient | 33m 41s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 32s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 0m 48s | | the patch passed | | +1 :green_heart: | compile | 1m 36s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 1m 36s | | the patch passed | | +1 :green_heart: | compile | 1m 31s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 1m 31s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 0s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 55s | | the patch passed | | +1 :green_heart: | javadoc | 0m 49s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 48s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 59s | | the patch passed | | +1 :green_heart: | shadedclient | 33m 54s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 8m 43s | | hadoop-mapreduce-client-app in the patch passed. | | +1 :green_heart: | unit | 4m 32s | | hadoop-mapreduce-client-hs in the patch passed. | | +1 :green_heart: | asflicense | 0m 39s | | The patch does not generate ASF License warnings. | | | | 152m 22s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.46 ServerAPI=1.46 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7013/4/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7013 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 6fce02db40db 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 5406efead5107d238dec4a05cae63f7f1b38ca62 | | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7013/4/testReport/ | | Max. process+thread count | 742 (vs. ulimit of 5500) | | modules | C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-clien
[jira] [Commented] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877017#comment-17877017 ] ASF GitHub Bot commented on YARN-10345: --- brumi1024 commented on code in PR #7013: URL: https://github.com/apache/hadoop/pull/7013#discussion_r1732826984 ## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/AMWebServices.java: ## @@ -113,9 +113,17 @@ private void init() { response.setContentType(null); } - /** - * convert a job id string to an actual job and handle all the error checking. - */ + public static Job getJobFromContainerIdString(String cid, AppContext appCtx) + throws NotFoundException { +//example container_e06_1724414851587_0004_01_01 +String[] parts = cid.split("_"); +return getJobFromJobIdString("job_" + parts[2] + "_" + parts[3], appCtx); Review Comment: Nit: the string "job" and the separators could be replaced with the public static constants from the JobID class: https://github.com/apache/hadoop/blob/f3c3d9e0c6eae02dd21f875097ef76d85025ffe4/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/JobID.java#L51 > HsWebServices containerlogs does not honor ACLs for completed jobs > -- > > Key: YARN-10345 > URL: https://issues.apache.org/jira/browse/YARN-10345 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.3.0, 3.2.2, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Bence Kosztolnik >Priority: Critical > Labels: pull-request-available > Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png > > > HsWebServices containerlogs does not honor ACLs. User who does not have > permission to view a job is allowed to view the job logs for completed jobs > from YARN UI2 through HsWebServices. > *Repro:* > Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + > HistoryServer runs as mapred > # Run a sample MR job using systest user > # Once the job is complete, access the job logs using hue user from YARN > UI2. > !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! > > YARN CLI works fine and does not allow hue user to view systest user job logs. > {code:java} > [hue@pjoseph-cm-2 /]$ > [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at > rmhostname:8032 > Permission denied: user=hue, access=EXECUTE, > inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876995#comment-17876995 ] ASF GitHub Bot commented on YARN-10345: --- hadoop-yetus commented on PR #7013: URL: https://github.com/apache/hadoop/pull/7013#issuecomment-2312427765 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 35s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 36s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 38m 4s | | trunk passed | | +1 :green_heart: | compile | 1m 55s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 1m 46s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 1m 17s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 8s | | trunk passed | | +1 :green_heart: | javadoc | 1m 6s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 1m 2s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 2m 8s | | trunk passed | | +1 :green_heart: | shadedclient | 40m 53s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 32s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 0m 51s | | the patch passed | | +1 :green_heart: | compile | 1m 43s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 1m 43s | | the patch passed | | +1 :green_heart: | compile | 1m 36s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 1m 36s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 7s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 54s | | the patch passed | | +1 :green_heart: | javadoc | 0m 47s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 46s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 2m 5s | | the patch passed | | +1 :green_heart: | shadedclient | 36m 6s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 8m 43s | | hadoop-mapreduce-client-app in the patch passed. | | +1 :green_heart: | unit | 4m 30s | | hadoop-mapreduce-client-hs in the patch passed. | | +1 :green_heart: | asflicense | 0m 37s | | The patch does not generate ASF License warnings. | | | | 167m 7s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.46 ServerAPI=1.46 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7013/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7013 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 000dd57f74d3 5.15.0-119-generic #129-Ubuntu SMP Fri Aug 2 19:25:20 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 2846e029c86045f586f67a581d8c1fadcae7093d | | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7013/3/testReport/ | | Max. process+thread count | 745 (vs. ulimit of 5500) | | modules | C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-clien
[jira] [Commented] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876990#comment-17876990 ] ASF GitHub Bot commented on YARN-10345: --- hadoop-yetus commented on PR #7013: URL: https://github.com/apache/hadoop/pull/7013#issuecomment-2312389988 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 32s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 45s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 32m 48s | | trunk passed | | +1 :green_heart: | compile | 1m 43s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 1m 38s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 1m 11s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 12s | | trunk passed | | +1 :green_heart: | javadoc | 1m 11s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 1m 1s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 53s | | trunk passed | | +1 :green_heart: | shadedclient | 33m 40s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 33s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 0m 51s | | the patch passed | | +1 :green_heart: | compile | 1m 34s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 1m 34s | | the patch passed | | +1 :green_heart: | compile | 1m 32s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 1m 32s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 1s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 54s | | the patch passed | | +1 :green_heart: | javadoc | 0m 49s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 47s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 57s | | the patch passed | | +1 :green_heart: | shadedclient | 33m 44s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 8m 43s | | hadoop-mapreduce-client-app in the patch passed. | | +1 :green_heart: | unit | 4m 30s | | hadoop-mapreduce-client-hs in the patch passed. | | +1 :green_heart: | asflicense | 0m 40s | | The patch does not generate ASF License warnings. | | | | 151m 31s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.46 ServerAPI=1.46 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7013/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7013 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 0366c7518436 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 2846e029c86045f586f67a581d8c1fadcae7093d | | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7013/2/testReport/ | | Max. process+thread count | 712 (vs. ulimit of 5500) | | modules | C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-clien
[jira] [Commented] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876806#comment-17876806 ] ASF GitHub Bot commented on YARN-10345: --- hadoop-yetus commented on PR #7013: URL: https://github.com/apache/hadoop/pull/7013#issuecomment-2310735488 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 11m 57s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 15m 42s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 32m 29s | | trunk passed | | +1 :green_heart: | compile | 1m 42s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 1m 37s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 1m 12s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 10s | | trunk passed | | +1 :green_heart: | javadoc | 1m 10s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 1m 4s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 54s | | trunk passed | | +1 :green_heart: | shadedclient | 33m 59s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 33s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 0m 51s | | the patch passed | | +1 :green_heart: | compile | 1m 36s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 1m 36s | | the patch passed | | +1 :green_heart: | compile | 1m 28s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 1m 28s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 0s | [/results-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7013/1/artifact/out/results-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client.txt) | hadoop-mapreduce-project/hadoop-mapreduce-client: The patch generated 2 new + 8 unchanged - 0 fixed = 10 total (was 8) | | +1 :green_heart: | mvnsite | 0m 54s | | the patch passed | | +1 :green_heart: | javadoc | 0m 48s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 47s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 59s | | the patch passed | | +1 :green_heart: | shadedclient | 34m 1s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 8m 42s | | hadoop-mapreduce-client-app in the patch passed. | | -1 :x: | unit | 4m 40s | [/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-hs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7013/1/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-hs.txt) | hadoop-mapreduce-client-hs in the patch passed. | | +1 :green_heart: | asflicense | 0m 38s | | The patch does not generate ASF License warnings. | | | | 164m 9s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.mapreduce.v2.hs.webapp.TestHsWebServicesLogs | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.46 ServerAPI=1.46 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7013/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7013 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux ffd8f5dea4e1 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool |
[jira] [Updated] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-10345: -- Labels: pull-request-available (was: ) > HsWebServices containerlogs does not honor ACLs for completed jobs > -- > > Key: YARN-10345 > URL: https://issues.apache.org/jira/browse/YARN-10345 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.3.0, 3.2.2, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Bence Kosztolnik >Priority: Critical > Labels: pull-request-available > Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png > > > HsWebServices containerlogs does not honor ACLs. User who does not have > permission to view a job is allowed to view the job logs for completed jobs > from YARN UI2 through HsWebServices. > *Repro:* > Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + > HistoryServer runs as mapred > # Run a sample MR job using systest user > # Once the job is complete, access the job logs using hue user from YARN > UI2. > !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! > > YARN CLI works fine and does not allow hue user to view systest user job logs. > {code:java} > [hue@pjoseph-cm-2 /]$ > [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at > rmhostname:8032 > Permission denied: user=hue, access=EXECUTE, > inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876749#comment-17876749 ] ASF GitHub Bot commented on YARN-10345: --- K0K0V0K opened a new pull request, #7013: URL: https://github.com/apache/hadoop/pull/7013 ### Description of PR - following rest apis did not have authorization - - /ws/v1/history/containerlogs/{containerid}/{filename} - - /ws/v1/history/containers/{containerid}/logs - after this fix it has acl authorization ### How was this patch tested? Setup: - mapreduce.cluster.acls.enabled = true on history server - submit example pi job with user1 (called job1) - - pi -Dmapreduce.job.queuename=root.default -Dmapreduce.job.acl-view-job=* 10 100 - submit example pi job with user1 (called job2) - - pi -Dmapreduce.job.queuename=root.default -Dmapreduce.job.acl-view-job=' ' 10 100 Before this commit - Some other user (aka user2) can see job1 and job2 logs via problematic apis After this commit - Some user2 can not see job2 logs ### For code changes: - [ ] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > HsWebServices containerlogs does not honor ACLs for completed jobs > -- > > Key: YARN-10345 > URL: https://issues.apache.org/jira/browse/YARN-10345 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.3.0, 3.2.2, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Bence Kosztolnik >Priority: Critical > Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png > > > HsWebServices containerlogs does not honor ACLs. User who does not have > permission to view a job is allowed to view the job logs for completed jobs > from YARN UI2 through HsWebServices. > *Repro:* > Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + > HistoryServer runs as mapred > # Run a sample MR job using systest user > # Once the job is complete, access the job logs using hue user from YARN > UI2. > !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! > > YARN CLI works fine and does not allow hue user to view systest user job logs. > {code:java} > [hue@pjoseph-cm-2 /]$ > [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at > rmhostname:8032 > Permission denied: user=hue, access=EXECUTE, > inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11664) Remove HDFS Binaries/Jars Dependency From YARN
[ https://issues.apache.org/jira/browse/YARN-11664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876168#comment-17876168 ] ASF GitHub Bot commented on YARN-11664: --- shameersss1 commented on PR #6631: URL: https://github.com/apache/hadoop/pull/6631#issuecomment-2306276633 @steveloughran - Javadoc is fixed. UT failures, Spotbugs and asf seems unrelated. > Remove HDFS Binaries/Jars Dependency From YARN > -- > > Key: YARN-11664 > URL: https://issues.apache.org/jira/browse/YARN-11664 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > > In principle Hadoop Yarn is independent of HDFS. It can work with any > filesystem. Currently there exists some code dependency for Yarn with HDFS. > This dependency requires Yarn to bring in some of the HDFS binaries/jars to > its class path. The idea behind this jira is to remove this dependency so > that Yarn can run without HDFS binaries/jars > *Scope* > 1. Non test classes are considered > 2. Some test classes which comes as transitive dependency are considered > *Out of scope* > 1. All test classes in Yarn module is not considered > > > A quick search in Yarn module revealed following HDFS dependencies > 1. Constants > {code:java} > import > org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier; > import org.apache.hadoop.hdfs.DFSConfigKeys;{code} > > > 2. Exception > {code:java} > import org.apache.hadoop.hdfs.protocol.DSQuotaExceededException;{code} > > 3. Utility > {code:java} > import org.apache.hadoop.hdfs.protocol.datatransfer.IOStreamPair;{code} > > Both Yarn and HDFS depends on *hadoop-common* module, > * Constants variables and Utility classes can be moved to *hadoop-common* > * Instead of DSQuotaExceededException, Use the parent exception > ClusterStoragrCapacityExceeded -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11664) Remove HDFS Binaries/Jars Dependency From YARN
[ https://issues.apache.org/jira/browse/YARN-11664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876134#comment-17876134 ] ASF GitHub Bot commented on YARN-11664: --- hadoop-yetus commented on PR #6631: URL: https://github.com/apache/hadoop/pull/6631#issuecomment-2305875091 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 31s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 15m 30s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 32m 39s | | trunk passed | | +1 :green_heart: | compile | 17m 40s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 16m 9s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 4m 26s | | trunk passed | | +1 :green_heart: | mvnsite | 6m 24s | | trunk passed | | +1 :green_heart: | javadoc | 5m 14s | | trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 5m 39s | | trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | -1 :x: | spotbugs | 1m 13s | [/branch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-services_hadoop-yarn-services-core-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6631/20/artifact/out/branch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-services_hadoop-yarn-services-core-warnings.html) | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core in trunk has 1 extant spotbugs warnings. | | +1 :green_heart: | shadedclient | 34m 23s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 33s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 4m 5s | | the patch passed | | +1 :green_heart: | compile | 16m 56s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 16m 56s | | the patch passed | | +1 :green_heart: | compile | 16m 12s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 16m 12s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 4m 23s | | the patch passed | | +1 :green_heart: | mvnsite | 6m 24s | | the patch passed | | +1 :green_heart: | javadoc | 5m 15s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 5m 31s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 12m 49s | | the patch passed | | -1 :x: | shadedclient | 34m 44s | | patch has errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 19m 38s | | hadoop-common in the patch passed. | | +1 :green_heart: | unit | 2m 48s | | hadoop-hdfs-client in the patch passed. | | -1 :x: | unit | 120m 29s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6631/20/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch failed. | | +1 :green_heart: | unit | 3m 31s | | hadoop-yarn-common in the patch passed. | | -1 :x: | unit | 0m 50s | [/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-services_hadoop-yarn-services-core.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6631/20/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-services_hadoop-yarn-services-core.txt) | hadoop-yarn-services-core in the patch failed. | | -1 :x: | asflicense | 1m 8s | [/results-asflicense.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6631/20/artifact/out/results-asflicense.txt) | The patch generated 149 ASF License warnings.