[jira] [Updated] (YARN-11573) Add config option to make container allocation prefer nodes without reserved containers
[ https://issues.apache.org/jira/browse/YARN-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-11573: -- Fix Version/s: 3.4.0 > Add config option to make container allocation prefer nodes without reserved > containers > --- > > Key: YARN-11573 > URL: https://issues.apache.org/jira/browse/YARN-11573 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > > Applications could be stuck when the container allocation logic does not > consider more nodes, but only nodes that are having reserved containers. > This behavior can even block new AMs to be allocated on nodes so they don't > reach the running status. > A jira that mentions the same thing is YARN-9598: > {quote}Nodes which have been reserved should be skipped when iterating > candidates in RegularContainerAllocator#allocate, otherwise scheduler may > generate allocation or reservation proposal on these node which will always > be rejected in FiCaScheduler#commonCheckContainerAllocation. > {quote} > Since this jira implements 2 other points, I decided to create this one and > implement the 3rd point separately. > h2. Notes: > 1. FiCaSchedulerApp#commonCheckContainerAllocation will log this: > {code:java} > Trying to allocate from reserved container in async scheduling mode > {code} > in case RegularContainerAllocator creates a reservation proposal for nodes > having reserved container. > 2. A better way is to prevent generating an AM container (or even normal > container) allocation proposal on a node if it already has a reservation on > it and we still have more nodes to check in the preferred node set. > Completely disabling task containers from being allocated to worker nodes > could limit the downscaling ability that we have currently. > h2. 3. CALL HIERARCHY > 1. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#nodeUpdate > 2. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.api.records.NodeId, > boolean) > 3. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.CandidateNodeSet, > boolean) > 3.1. This is the place where it is decided whether to call > allocateContainerOnSingleNode or allocateContainersOnMultiNodes > 4. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersOnMultiNodes > 5. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateOrReserveNewContainers > 6. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue#assignContainers > 7. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractParentQueue#assignContainersToChildQueues > 8. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue#assignContainers > 9. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#assignContainers > 10. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainers > 11. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#allocate > 12. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#tryAllocateOnNode > 13. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainersOnNode > 14. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignNodeLocalContainers > 15. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainer > Logs these lines as an example: > {code:java} > 2023-08-23 17:44:08,129 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator: > assignContainers: node= application=application_1692304118418_3151 > priority=0 pendingAsk= vCores:1>,repeat=1> type=OFF_SWITCH > {code} > h2. 4. DETAILS OF RegularContainerAllocator#allocate > [Method > definition|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L826-L896] > 4.1. Defining ordered
[jira] [Commented] (YARN-11573) Add config option to make container allocation prefer nodes without reserved containers
[ https://issues.apache.org/jira/browse/YARN-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768112#comment-17768112 ] ASF GitHub Bot commented on YARN-11573: --- brumi1024 commented on PR #6098: URL: https://github.com/apache/hadoop/pull/6098#issuecomment-1731835705 Thanks @szilard-nemeth for the update, LGTM. Merging to trunk. > Add config option to make container allocation prefer nodes without reserved > containers > --- > > Key: YARN-11573 > URL: https://issues.apache.org/jira/browse/YARN-11573 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Minor > Labels: pull-request-available > > Applications could be stuck when the container allocation logic does not > consider more nodes, but only nodes that are having reserved containers. > This behavior can even block new AMs to be allocated on nodes so they don't > reach the running status. > A jira that mentions the same thing is YARN-9598: > {quote}Nodes which have been reserved should be skipped when iterating > candidates in RegularContainerAllocator#allocate, otherwise scheduler may > generate allocation or reservation proposal on these node which will always > be rejected in FiCaScheduler#commonCheckContainerAllocation. > {quote} > Since this jira implements 2 other points, I decided to create this one and > implement the 3rd point separately. > h2. Notes: > 1. FiCaSchedulerApp#commonCheckContainerAllocation will log this: > {code:java} > Trying to allocate from reserved container in async scheduling mode > {code} > in case RegularContainerAllocator creates a reservation proposal for nodes > having reserved container. > 2. A better way is to prevent generating an AM container (or even normal > container) allocation proposal on a node if it already has a reservation on > it and we still have more nodes to check in the preferred node set. > Completely disabling task containers from being allocated to worker nodes > could limit the downscaling ability that we have currently. > h2. 3. CALL HIERARCHY > 1. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#nodeUpdate > 2. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.api.records.NodeId, > boolean) > 3. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.CandidateNodeSet, > boolean) > 3.1. This is the place where it is decided whether to call > allocateContainerOnSingleNode or allocateContainersOnMultiNodes > 4. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersOnMultiNodes > 5. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateOrReserveNewContainers > 6. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue#assignContainers > 7. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractParentQueue#assignContainersToChildQueues > 8. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue#assignContainers > 9. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#assignContainers > 10. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainers > 11. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#allocate > 12. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#tryAllocateOnNode > 13. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainersOnNode > 14. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignNodeLocalContainers > 15. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainer > Logs these lines as an example: > {code:java} > 2023-08-23 17:44:08,129 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator: > assignContainers: node= application=application_1692304118418_3151 > priority=0 pendingAsk= vCores:1>,repeat=1> type=OFF_SWITCH > {code} > h2. 4. DETAILS OF RegularContainerAllocator#allocate > [Method >
[jira] [Commented] (YARN-11573) Add config option to make container allocation prefer nodes without reserved containers
[ https://issues.apache.org/jira/browse/YARN-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768113#comment-17768113 ] ASF GitHub Bot commented on YARN-11573: --- brumi1024 merged PR #6098: URL: https://github.com/apache/hadoop/pull/6098 > Add config option to make container allocation prefer nodes without reserved > containers > --- > > Key: YARN-11573 > URL: https://issues.apache.org/jira/browse/YARN-11573 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Minor > Labels: pull-request-available > > Applications could be stuck when the container allocation logic does not > consider more nodes, but only nodes that are having reserved containers. > This behavior can even block new AMs to be allocated on nodes so they don't > reach the running status. > A jira that mentions the same thing is YARN-9598: > {quote}Nodes which have been reserved should be skipped when iterating > candidates in RegularContainerAllocator#allocate, otherwise scheduler may > generate allocation or reservation proposal on these node which will always > be rejected in FiCaScheduler#commonCheckContainerAllocation. > {quote} > Since this jira implements 2 other points, I decided to create this one and > implement the 3rd point separately. > h2. Notes: > 1. FiCaSchedulerApp#commonCheckContainerAllocation will log this: > {code:java} > Trying to allocate from reserved container in async scheduling mode > {code} > in case RegularContainerAllocator creates a reservation proposal for nodes > having reserved container. > 2. A better way is to prevent generating an AM container (or even normal > container) allocation proposal on a node if it already has a reservation on > it and we still have more nodes to check in the preferred node set. > Completely disabling task containers from being allocated to worker nodes > could limit the downscaling ability that we have currently. > h2. 3. CALL HIERARCHY > 1. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#nodeUpdate > 2. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.api.records.NodeId, > boolean) > 3. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.CandidateNodeSet, > boolean) > 3.1. This is the place where it is decided whether to call > allocateContainerOnSingleNode or allocateContainersOnMultiNodes > 4. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersOnMultiNodes > 5. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateOrReserveNewContainers > 6. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue#assignContainers > 7. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractParentQueue#assignContainersToChildQueues > 8. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue#assignContainers > 9. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#assignContainers > 10. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainers > 11. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#allocate > 12. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#tryAllocateOnNode > 13. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainersOnNode > 14. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignNodeLocalContainers > 15. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainer > Logs these lines as an example: > {code:java} > 2023-08-23 17:44:08,129 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator: > assignContainers: node= application=application_1692304118418_3151 > priority=0 pendingAsk= vCores:1>,repeat=1> type=OFF_SWITCH > {code} > h2. 4. DETAILS OF RegularContainerAllocator#allocate > [Method >
[jira] [Commented] (YARN-11573) Add config option to make container allocation prefer nodes without reserved containers
[ https://issues.apache.org/jira/browse/YARN-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768105#comment-17768105 ] ASF GitHub Bot commented on YARN-11573: --- hadoop-yetus commented on PR #6098: URL: https://github.com/apache/hadoop/pull/6098#issuecomment-1731798334 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 50s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 48m 38s | | trunk passed | | +1 :green_heart: | compile | 1m 1s | | trunk passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | compile | 0m 52s | | trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | checkstyle | 0m 53s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 57s | | trunk passed | | +1 :green_heart: | javadoc | 0m 55s | | trunk passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 0m 45s | | trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | spotbugs | 1m 58s | | trunk passed | | +1 :green_heart: | shadedclient | 39m 29s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 48s | | the patch passed | | +1 :green_heart: | compile | 0m 54s | | the patch passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javac | 0m 54s | | the patch passed | | +1 :green_heart: | compile | 0m 46s | | the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | javac | 0m 46s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 44s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 48s | | the patch passed | | +1 :green_heart: | javadoc | 0m 43s | | the patch passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 0m 39s | | the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | spotbugs | 1m 56s | | the patch passed | | +1 :green_heart: | shadedclient | 39m 27s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 103m 2s | | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 34s | | The patch does not generate ASF License warnings. | | | | 248m 46s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6098/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6098 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux c6c5ace082c1 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / ffc0cc9cb155be99f075f8125376ee475debee7b | | Default Java | Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6098/3/testReport/ | | Max. process+thread count | 898 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6098/3/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
[jira] [Updated] (YARN-11468) Zookeeper SSL/TLS support
[ https://issues.apache.org/jira/browse/YARN-11468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-11468: -- Labels: pull-request-available (was: ) > Zookeeper SSL/TLS support > - > > Key: YARN-11468 > URL: https://issues.apache.org/jira/browse/YARN-11468 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Critical > Labels: pull-request-available > > Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its > clients. > [https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide] > The SSL communication should be possible in the different parts of YARN, > where it communicates with Zookeeper servers. The Zookeeper clients are used > in the following places: > * ResourceManager > * ZKConfigurationStore > * ZKRMStateStore > The yarn.resourcemanager.zk-client-ssl.enabled flag to enable SSL > communication should be provided in the yarn-default.xml and the required > parameters for the keystore and truststore should be picked up from the > core-default.xml (HADOOP-18709) > yarn.resourcemanager.ha.curator-leader-elector.enabled has to set to true via > yarn-site.xml to make sure Curator is used, otherwise we can't enable SSL. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11468) Zookeeper SSL/TLS support
[ https://issues.apache.org/jira/browse/YARN-11468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768055#comment-17768055 ] ASF GitHub Bot commented on YARN-11468: --- ferdelyi commented on code in PR #6027: URL: https://github.com/apache/hadoop/pull/6027#discussion_r1334511785 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMStoreCommands.java: ## @@ -101,6 +102,16 @@ public void testFormatConfStoreCmdForZK() throws Exception { } } + @Test + public void testSSLEnabledConfiguration() { +//Test if we can enable SSL/TLS for the ZK Curator Client in YARN. +Configuration conf = new Configuration(); +conf.set(YarnConfiguration.RM_ZK_CLIENT_SSL_ENABLED, Boolean.TRUE.toString()); + +assertEquals("The " + YarnConfiguration.RM_ZK_CLIENT_SSL_ENABLED + " value should be true.", +conf.get(YarnConfiguration.RM_ZK_CLIENT_SSL_ENABLED), Boolean.TRUE.toString()); + } Review Comment: Thank you Szilard for the review! "The ZKCuratorManager is started with SSL disabled by default. " case is implicitly covered in the already existing TestLeaderElectorService.java, as it uses Curator. Testing the SSL case will be more tricky due to CURATOR-658 "Add Support for TLS-enabled TestingZooKeeperMain" won't be fixed, but it seems there is a way by using ZooKeeperServerEmbeddedAdapter, which I need to explore how to implement. > Zookeeper SSL/TLS support > - > > Key: YARN-11468 > URL: https://issues.apache.org/jira/browse/YARN-11468 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Critical > > Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its > clients. > [https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide] > The SSL communication should be possible in the different parts of YARN, > where it communicates with Zookeeper servers. The Zookeeper clients are used > in the following places: > * ResourceManager > * ZKConfigurationStore > * ZKRMStateStore > The yarn.resourcemanager.zk-client-ssl.enabled flag to enable SSL > communication should be provided in the yarn-default.xml and the required > parameters for the keystore and truststore should be picked up from the > core-default.xml (HADOOP-18709) > yarn.resourcemanager.ha.curator-leader-elector.enabled has to set to true via > yarn-site.xml to make sure Curator is used, otherwise we can't enable SSL. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11567) Aggregate container launch debug artifacts automatically in case of error
[ https://issues.apache.org/jira/browse/YARN-11567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-11567: - Fix Version/s: 3.4.0 > Aggregate container launch debug artifacts automatically in case of error > - > > Key: YARN-11567 > URL: https://issues.apache.org/jira/browse/YARN-11567 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Bence Kosztolnik >Assignee: Bence Kosztolnik >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > > In cases where a container fails to launch without writing to a log file, we > often would want to see the artifacts captured by > {{yarn.nodemanager.log-container-debug-info.enabled}} in order to better > understand the cause of the exit code. To enable this feature for every > container maybe over kill, so we need a feature flag to capture these > artifacts in case of errors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11567) Aggregate container launch debug artifacts automatically in case of error
[ https://issues.apache.org/jira/browse/YARN-11567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767999#comment-17767999 ] ASF GitHub Bot commented on YARN-11567: --- brumi1024 merged PR #6053: URL: https://github.com/apache/hadoop/pull/6053 > Aggregate container launch debug artifacts automatically in case of error > - > > Key: YARN-11567 > URL: https://issues.apache.org/jira/browse/YARN-11567 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Bence Kosztolnik >Assignee: Bence Kosztolnik >Priority: Minor > Labels: pull-request-available > > In cases where a container fails to launch without writing to a log file, we > often would want to see the artifacts captured by > {{yarn.nodemanager.log-container-debug-info.enabled}} in order to better > understand the cause of the exit code. To enable this feature for every > container maybe over kill, so we need a feature flag to capture these > artifacts in case of errors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11567) Aggregate container launch debug artifacts automatically in case of error
[ https://issues.apache.org/jira/browse/YARN-11567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767998#comment-17767998 ] ASF GitHub Bot commented on YARN-11567: --- brumi1024 commented on PR #6053: URL: https://github.com/apache/hadoop/pull/6053#issuecomment-1731389567 Thanks @K0K0V0K for the patch, @p-szucs @slfan1989 for the review. Merging to trunk. > Aggregate container launch debug artifacts automatically in case of error > - > > Key: YARN-11567 > URL: https://issues.apache.org/jira/browse/YARN-11567 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Bence Kosztolnik >Assignee: Bence Kosztolnik >Priority: Minor > Labels: pull-request-available > > In cases where a container fails to launch without writing to a log file, we > often would want to see the artifacts captured by > {{yarn.nodemanager.log-container-debug-info.enabled}} in order to better > understand the cause of the exit code. To enable this feature for every > container maybe over kill, so we need a feature flag to capture these > artifacts in case of errors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11573) Add config option to make container allocation prefer nodes without reserved containers
[ https://issues.apache.org/jira/browse/YARN-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767995#comment-17767995 ] ASF GitHub Bot commented on YARN-11573: --- brumi1024 commented on code in PR #6098: URL: https://github.com/apache/hadoop/pull/6098#discussion_r1334354428 ## hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java: ## @@ -154,6 +154,12 @@ public class CapacitySchedulerConfiguration extends ReservationSchedulerConfigur @Private public static final boolean DEFAULT_RESERVE_CONT_LOOK_ALL_NODES = true; + public static final String PREFER_ALLOCATE_ON_NODES_WITHOUT_RESERVED_CONTAINERS = PREFIX Review Comment: Nit: isn't this essentially a skip action on nodes that have reserved container? > Add config option to make container allocation prefer nodes without reserved > containers > --- > > Key: YARN-11573 > URL: https://issues.apache.org/jira/browse/YARN-11573 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Minor > > Applications could be stuck when the container allocation logic does not > consider more nodes, but only nodes that are having reserved containers. > This behavior can even block new AMs to be allocated on nodes so they don't > reach the running status. > A jira that mentions the same thing is YARN-9598: > {quote}Nodes which have been reserved should be skipped when iterating > candidates in RegularContainerAllocator#allocate, otherwise scheduler may > generate allocation or reservation proposal on these node which will always > be rejected in FiCaScheduler#commonCheckContainerAllocation. > {quote} > Since this jira implements 2 other points, I decided to create this one and > implement the 3rd point separately. > h2. Notes: > 1. FiCaSchedulerApp#commonCheckContainerAllocation will log this: > {code:java} > Trying to allocate from reserved container in async scheduling mode > {code} > in case RegularContainerAllocator creates a reservation proposal for nodes > having reserved container. > 2. A better way is to prevent generating an AM container (or even normal > container) allocation proposal on a node if it already has a reservation on > it and we still have more nodes to check in the preferred node set. > Completely disabling task containers from being allocated to worker nodes > could limit the downscaling ability that we have currently. > h2. 3. CALL HIERARCHY > 1. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#nodeUpdate > 2. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.api.records.NodeId, > boolean) > 3. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.CandidateNodeSet, > boolean) > 3.1. This is the place where it is decided whether to call > allocateContainerOnSingleNode or allocateContainersOnMultiNodes > 4. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersOnMultiNodes > 5. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateOrReserveNewContainers > 6. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue#assignContainers > 7. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractParentQueue#assignContainersToChildQueues > 8. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue#assignContainers > 9. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#assignContainers > 10. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainers > 11. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#allocate > 12. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#tryAllocateOnNode > 13. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainersOnNode > 14. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignNodeLocalContainers > 15. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainer > Logs these lines as an example: > {code:java} >
[jira] [Updated] (YARN-11573) Add config option to make container allocation prefer nodes without reserved containers
[ https://issues.apache.org/jira/browse/YARN-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-11573: -- Labels: pull-request-available (was: ) > Add config option to make container allocation prefer nodes without reserved > containers > --- > > Key: YARN-11573 > URL: https://issues.apache.org/jira/browse/YARN-11573 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Minor > Labels: pull-request-available > > Applications could be stuck when the container allocation logic does not > consider more nodes, but only nodes that are having reserved containers. > This behavior can even block new AMs to be allocated on nodes so they don't > reach the running status. > A jira that mentions the same thing is YARN-9598: > {quote}Nodes which have been reserved should be skipped when iterating > candidates in RegularContainerAllocator#allocate, otherwise scheduler may > generate allocation or reservation proposal on these node which will always > be rejected in FiCaScheduler#commonCheckContainerAllocation. > {quote} > Since this jira implements 2 other points, I decided to create this one and > implement the 3rd point separately. > h2. Notes: > 1. FiCaSchedulerApp#commonCheckContainerAllocation will log this: > {code:java} > Trying to allocate from reserved container in async scheduling mode > {code} > in case RegularContainerAllocator creates a reservation proposal for nodes > having reserved container. > 2. A better way is to prevent generating an AM container (or even normal > container) allocation proposal on a node if it already has a reservation on > it and we still have more nodes to check in the preferred node set. > Completely disabling task containers from being allocated to worker nodes > could limit the downscaling ability that we have currently. > h2. 3. CALL HIERARCHY > 1. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#nodeUpdate > 2. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.api.records.NodeId, > boolean) > 3. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.CandidateNodeSet, > boolean) > 3.1. This is the place where it is decided whether to call > allocateContainerOnSingleNode or allocateContainersOnMultiNodes > 4. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersOnMultiNodes > 5. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateOrReserveNewContainers > 6. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue#assignContainers > 7. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractParentQueue#assignContainersToChildQueues > 8. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue#assignContainers > 9. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#assignContainers > 10. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainers > 11. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#allocate > 12. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#tryAllocateOnNode > 13. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainersOnNode > 14. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignNodeLocalContainers > 15. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainer > Logs these lines as an example: > {code:java} > 2023-08-23 17:44:08,129 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator: > assignContainers: node= application=application_1692304118418_3151 > priority=0 pendingAsk= vCores:1>,repeat=1> type=OFF_SWITCH > {code} > h2. 4. DETAILS OF RegularContainerAllocator#allocate > [Method > definition|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L826-L896] > 4.1. Defining ordered list of