[jira] [Updated] (YARN-11573) Add config option to make container allocation prefer nodes without reserved containers

2023-09-22 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-11573:
--
Fix Version/s: 3.4.0

> Add config option to make container allocation prefer nodes without reserved 
> containers
> ---
>
> Key: YARN-11573
> URL: https://issues.apache.org/jira/browse/YARN-11573
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Applications could be stuck when the container allocation logic does not 
> consider more nodes, but only nodes that are having reserved containers.
> This behavior can even block new AMs to be allocated on nodes so they don't 
> reach the running status.
> A jira that mentions the same thing is YARN-9598:
> {quote}Nodes which have been reserved should be skipped when iterating 
> candidates in RegularContainerAllocator#allocate, otherwise scheduler may 
> generate allocation or reservation proposal on these node which will always 
> be rejected in FiCaScheduler#commonCheckContainerAllocation.
> {quote}
> Since this jira implements 2 other points, I decided to create this one and 
> implement the 3rd point separately.
> h2. Notes:
> 1. FiCaSchedulerApp#commonCheckContainerAllocation will log this:
> {code:java}
> Trying to allocate from reserved container in async scheduling mode
> {code}
> in case RegularContainerAllocator creates a reservation proposal for nodes 
> having reserved container.
> 2. A better way is to prevent generating an AM container (or even normal 
> container) allocation proposal on a node if it already has a reservation on 
> it and we still have more nodes to check in the preferred node set. 
> Completely disabling task containers from being allocated to worker nodes 
> could limit the downscaling ability that we have currently.
> h2. 3. CALL HIERARCHY
> 1. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#nodeUpdate
> 2. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.api.records.NodeId,
>  boolean)
> 3. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.CandidateNodeSet,
>  boolean)
> 3.1. This is the place where it is decided whether to call 
> allocateContainerOnSingleNode or allocateContainersOnMultiNodes
> 4. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersOnMultiNodes
> 5. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateOrReserveNewContainers
> 6. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue#assignContainers
> 7. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractParentQueue#assignContainersToChildQueues
> 8. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue#assignContainers
> 9. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#assignContainers
> 10. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainers
> 11. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#allocate
> 12. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#tryAllocateOnNode
> 13. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainersOnNode
> 14. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignNodeLocalContainers
> 15. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainer
> Logs these lines as an example:
> {code:java}
> 2023-08-23 17:44:08,129 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator:
>  assignContainers: node= application=application_1692304118418_3151 
> priority=0 pendingAsk= vCores:1>,repeat=1> type=OFF_SWITCH
> {code}
> h2. 4. DETAILS OF RegularContainerAllocator#allocate
> [Method 
> definition|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L826-L896]
> 4.1. Defining ordered 

[jira] [Commented] (YARN-11573) Add config option to make container allocation prefer nodes without reserved containers

2023-09-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768112#comment-17768112
 ] 

ASF GitHub Bot commented on YARN-11573:
---

brumi1024 commented on PR #6098:
URL: https://github.com/apache/hadoop/pull/6098#issuecomment-1731835705

   Thanks @szilard-nemeth for the update, LGTM. Merging to trunk.




> Add config option to make container allocation prefer nodes without reserved 
> containers
> ---
>
> Key: YARN-11573
> URL: https://issues.apache.org/jira/browse/YARN-11573
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
>  Labels: pull-request-available
>
> Applications could be stuck when the container allocation logic does not 
> consider more nodes, but only nodes that are having reserved containers.
> This behavior can even block new AMs to be allocated on nodes so they don't 
> reach the running status.
> A jira that mentions the same thing is YARN-9598:
> {quote}Nodes which have been reserved should be skipped when iterating 
> candidates in RegularContainerAllocator#allocate, otherwise scheduler may 
> generate allocation or reservation proposal on these node which will always 
> be rejected in FiCaScheduler#commonCheckContainerAllocation.
> {quote}
> Since this jira implements 2 other points, I decided to create this one and 
> implement the 3rd point separately.
> h2. Notes:
> 1. FiCaSchedulerApp#commonCheckContainerAllocation will log this:
> {code:java}
> Trying to allocate from reserved container in async scheduling mode
> {code}
> in case RegularContainerAllocator creates a reservation proposal for nodes 
> having reserved container.
> 2. A better way is to prevent generating an AM container (or even normal 
> container) allocation proposal on a node if it already has a reservation on 
> it and we still have more nodes to check in the preferred node set. 
> Completely disabling task containers from being allocated to worker nodes 
> could limit the downscaling ability that we have currently.
> h2. 3. CALL HIERARCHY
> 1. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#nodeUpdate
> 2. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.api.records.NodeId,
>  boolean)
> 3. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.CandidateNodeSet,
>  boolean)
> 3.1. This is the place where it is decided whether to call 
> allocateContainerOnSingleNode or allocateContainersOnMultiNodes
> 4. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersOnMultiNodes
> 5. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateOrReserveNewContainers
> 6. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue#assignContainers
> 7. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractParentQueue#assignContainersToChildQueues
> 8. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue#assignContainers
> 9. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#assignContainers
> 10. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainers
> 11. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#allocate
> 12. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#tryAllocateOnNode
> 13. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainersOnNode
> 14. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignNodeLocalContainers
> 15. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainer
> Logs these lines as an example:
> {code:java}
> 2023-08-23 17:44:08,129 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator:
>  assignContainers: node= application=application_1692304118418_3151 
> priority=0 pendingAsk= vCores:1>,repeat=1> type=OFF_SWITCH
> {code}
> h2. 4. DETAILS OF RegularContainerAllocator#allocate
> [Method 
> 

[jira] [Commented] (YARN-11573) Add config option to make container allocation prefer nodes without reserved containers

2023-09-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768113#comment-17768113
 ] 

ASF GitHub Bot commented on YARN-11573:
---

brumi1024 merged PR #6098:
URL: https://github.com/apache/hadoop/pull/6098




> Add config option to make container allocation prefer nodes without reserved 
> containers
> ---
>
> Key: YARN-11573
> URL: https://issues.apache.org/jira/browse/YARN-11573
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
>  Labels: pull-request-available
>
> Applications could be stuck when the container allocation logic does not 
> consider more nodes, but only nodes that are having reserved containers.
> This behavior can even block new AMs to be allocated on nodes so they don't 
> reach the running status.
> A jira that mentions the same thing is YARN-9598:
> {quote}Nodes which have been reserved should be skipped when iterating 
> candidates in RegularContainerAllocator#allocate, otherwise scheduler may 
> generate allocation or reservation proposal on these node which will always 
> be rejected in FiCaScheduler#commonCheckContainerAllocation.
> {quote}
> Since this jira implements 2 other points, I decided to create this one and 
> implement the 3rd point separately.
> h2. Notes:
> 1. FiCaSchedulerApp#commonCheckContainerAllocation will log this:
> {code:java}
> Trying to allocate from reserved container in async scheduling mode
> {code}
> in case RegularContainerAllocator creates a reservation proposal for nodes 
> having reserved container.
> 2. A better way is to prevent generating an AM container (or even normal 
> container) allocation proposal on a node if it already has a reservation on 
> it and we still have more nodes to check in the preferred node set. 
> Completely disabling task containers from being allocated to worker nodes 
> could limit the downscaling ability that we have currently.
> h2. 3. CALL HIERARCHY
> 1. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#nodeUpdate
> 2. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.api.records.NodeId,
>  boolean)
> 3. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.CandidateNodeSet,
>  boolean)
> 3.1. This is the place where it is decided whether to call 
> allocateContainerOnSingleNode or allocateContainersOnMultiNodes
> 4. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersOnMultiNodes
> 5. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateOrReserveNewContainers
> 6. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue#assignContainers
> 7. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractParentQueue#assignContainersToChildQueues
> 8. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue#assignContainers
> 9. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#assignContainers
> 10. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainers
> 11. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#allocate
> 12. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#tryAllocateOnNode
> 13. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainersOnNode
> 14. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignNodeLocalContainers
> 15. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainer
> Logs these lines as an example:
> {code:java}
> 2023-08-23 17:44:08,129 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator:
>  assignContainers: node= application=application_1692304118418_3151 
> priority=0 pendingAsk= vCores:1>,repeat=1> type=OFF_SWITCH
> {code}
> h2. 4. DETAILS OF RegularContainerAllocator#allocate
> [Method 
> 

[jira] [Commented] (YARN-11573) Add config option to make container allocation prefer nodes without reserved containers

2023-09-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768105#comment-17768105
 ] 

ASF GitHub Bot commented on YARN-11573:
---

hadoop-yetus commented on PR #6098:
URL: https://github.com/apache/hadoop/pull/6098#issuecomment-1731798334

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 50s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  48m 38s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m  1s |  |  trunk passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  compile  |   0m 52s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 53s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 57s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 55s |  |  trunk passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 45s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 58s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  39m 29s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 48s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 54s |  |  the patch passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javac  |   0m 54s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 46s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  javac  |   0m 46s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 44s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 48s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 43s |  |  the patch passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 39s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 56s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  39m 27s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 103m  2s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 34s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 248m 46s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6098/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6098 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux c6c5ace082c1 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / ffc0cc9cb155be99f075f8125376ee475debee7b |
   | Default Java | Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6098/3/testReport/ |
   | Max. process+thread count | 898 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6098/3/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   

[jira] [Updated] (YARN-11468) Zookeeper SSL/TLS support

2023-09-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-11468:
--
Labels: pull-request-available  (was: )

> Zookeeper SSL/TLS support
> -
>
> Key: YARN-11468
> URL: https://issues.apache.org/jira/browse/YARN-11468
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Critical
>  Labels: pull-request-available
>
> Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its 
> clients.
> [https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide]
> The SSL communication should be possible in the different parts of YARN, 
> where it communicates with Zookeeper servers. The Zookeeper clients are used 
> in the following places:
>  * ResourceManager
>  * ZKConfigurationStore
>  * ZKRMStateStore
> The yarn.resourcemanager.zk-client-ssl.enabled flag to enable SSL 
> communication should be provided in the yarn-default.xml and the required 
> parameters for the keystore and truststore should be picked up from the 
> core-default.xml (HADOOP-18709)
> yarn.resourcemanager.ha.curator-leader-elector.enabled has to set to true via 
> yarn-site.xml to make sure Curator is used, otherwise we can't enable SSL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11468) Zookeeper SSL/TLS support

2023-09-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768055#comment-17768055
 ] 

ASF GitHub Bot commented on YARN-11468:
---

ferdelyi commented on code in PR #6027:
URL: https://github.com/apache/hadoop/pull/6027#discussion_r1334511785


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMStoreCommands.java:
##
@@ -101,6 +102,16 @@ public void testFormatConfStoreCmdForZK() throws Exception 
{
 }
   }
 
+  @Test
+  public void testSSLEnabledConfiguration() {
+//Test if we can enable SSL/TLS for the ZK Curator Client in YARN.
+Configuration conf = new Configuration();
+conf.set(YarnConfiguration.RM_ZK_CLIENT_SSL_ENABLED, 
Boolean.TRUE.toString());
+
+assertEquals("The " + YarnConfiguration.RM_ZK_CLIENT_SSL_ENABLED + " value 
should be true.",
+conf.get(YarnConfiguration.RM_ZK_CLIENT_SSL_ENABLED), 
Boolean.TRUE.toString());
+  }

Review Comment:
   Thank you Szilard for the review!
   
   "The ZKCuratorManager is started with SSL disabled by default. " case is 
implicitly covered in the already existing TestLeaderElectorService.java, as it 
uses Curator.
   
   Testing the SSL case will be more tricky due to CURATOR-658 "Add Support for 
TLS-enabled TestingZooKeeperMain" won't be fixed, but it seems there is a way 
by using ZooKeeperServerEmbeddedAdapter, which I need to explore how to 
implement. 





> Zookeeper SSL/TLS support
> -
>
> Key: YARN-11468
> URL: https://issues.apache.org/jira/browse/YARN-11468
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Critical
>
> Zookeeper 3.5.5 server can operate with SSL/TLS secure connection with its 
> clients.
> [https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide]
> The SSL communication should be possible in the different parts of YARN, 
> where it communicates with Zookeeper servers. The Zookeeper clients are used 
> in the following places:
>  * ResourceManager
>  * ZKConfigurationStore
>  * ZKRMStateStore
> The yarn.resourcemanager.zk-client-ssl.enabled flag to enable SSL 
> communication should be provided in the yarn-default.xml and the required 
> parameters for the keystore and truststore should be picked up from the 
> core-default.xml (HADOOP-18709)
> yarn.resourcemanager.ha.curator-leader-elector.enabled has to set to true via 
> yarn-site.xml to make sure Curator is used, otherwise we can't enable SSL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11567) Aggregate container launch debug artifacts automatically in case of error

2023-09-22 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-11567:
-
Fix Version/s: 3.4.0

> Aggregate container launch debug artifacts automatically in case of error
> -
>
> Key: YARN-11567
> URL: https://issues.apache.org/jira/browse/YARN-11567
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> In cases where a container fails to launch without writing to a log file, we 
> often would want to see the artifacts captured by 
> {{yarn.nodemanager.log-container-debug-info.enabled}} in order to better 
> understand the cause of the exit code. To enable this feature for every 
> container maybe over kill, so we need a feature flag to capture these 
> artifacts in case of errors. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11567) Aggregate container launch debug artifacts automatically in case of error

2023-09-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767999#comment-17767999
 ] 

ASF GitHub Bot commented on YARN-11567:
---

brumi1024 merged PR #6053:
URL: https://github.com/apache/hadoop/pull/6053




> Aggregate container launch debug artifacts automatically in case of error
> -
>
> Key: YARN-11567
> URL: https://issues.apache.org/jira/browse/YARN-11567
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Minor
>  Labels: pull-request-available
>
> In cases where a container fails to launch without writing to a log file, we 
> often would want to see the artifacts captured by 
> {{yarn.nodemanager.log-container-debug-info.enabled}} in order to better 
> understand the cause of the exit code. To enable this feature for every 
> container maybe over kill, so we need a feature flag to capture these 
> artifacts in case of errors. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11567) Aggregate container launch debug artifacts automatically in case of error

2023-09-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767998#comment-17767998
 ] 

ASF GitHub Bot commented on YARN-11567:
---

brumi1024 commented on PR #6053:
URL: https://github.com/apache/hadoop/pull/6053#issuecomment-1731389567

   Thanks @K0K0V0K for the patch, @p-szucs @slfan1989 for the review. Merging 
to trunk.




> Aggregate container launch debug artifacts automatically in case of error
> -
>
> Key: YARN-11567
> URL: https://issues.apache.org/jira/browse/YARN-11567
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Minor
>  Labels: pull-request-available
>
> In cases where a container fails to launch without writing to a log file, we 
> often would want to see the artifacts captured by 
> {{yarn.nodemanager.log-container-debug-info.enabled}} in order to better 
> understand the cause of the exit code. To enable this feature for every 
> container maybe over kill, so we need a feature flag to capture these 
> artifacts in case of errors. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11573) Add config option to make container allocation prefer nodes without reserved containers

2023-09-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767995#comment-17767995
 ] 

ASF GitHub Bot commented on YARN-11573:
---

brumi1024 commented on code in PR #6098:
URL: https://github.com/apache/hadoop/pull/6098#discussion_r1334354428


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java:
##
@@ -154,6 +154,12 @@ public class CapacitySchedulerConfiguration extends 
ReservationSchedulerConfigur
   @Private
   public static final boolean DEFAULT_RESERVE_CONT_LOOK_ALL_NODES = true;
 
+  public static final String 
PREFER_ALLOCATE_ON_NODES_WITHOUT_RESERVED_CONTAINERS = PREFIX

Review Comment:
   Nit: isn't this essentially a skip action on nodes that have reserved 
container?





> Add config option to make container allocation prefer nodes without reserved 
> containers
> ---
>
> Key: YARN-11573
> URL: https://issues.apache.org/jira/browse/YARN-11573
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
>
> Applications could be stuck when the container allocation logic does not 
> consider more nodes, but only nodes that are having reserved containers.
> This behavior can even block new AMs to be allocated on nodes so they don't 
> reach the running status.
> A jira that mentions the same thing is YARN-9598:
> {quote}Nodes which have been reserved should be skipped when iterating 
> candidates in RegularContainerAllocator#allocate, otherwise scheduler may 
> generate allocation or reservation proposal on these node which will always 
> be rejected in FiCaScheduler#commonCheckContainerAllocation.
> {quote}
> Since this jira implements 2 other points, I decided to create this one and 
> implement the 3rd point separately.
> h2. Notes:
> 1. FiCaSchedulerApp#commonCheckContainerAllocation will log this:
> {code:java}
> Trying to allocate from reserved container in async scheduling mode
> {code}
> in case RegularContainerAllocator creates a reservation proposal for nodes 
> having reserved container.
> 2. A better way is to prevent generating an AM container (or even normal 
> container) allocation proposal on a node if it already has a reservation on 
> it and we still have more nodes to check in the preferred node set. 
> Completely disabling task containers from being allocated to worker nodes 
> could limit the downscaling ability that we have currently.
> h2. 3. CALL HIERARCHY
> 1. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#nodeUpdate
> 2. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.api.records.NodeId,
>  boolean)
> 3. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.CandidateNodeSet,
>  boolean)
> 3.1. This is the place where it is decided whether to call 
> allocateContainerOnSingleNode or allocateContainersOnMultiNodes
> 4. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersOnMultiNodes
> 5. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateOrReserveNewContainers
> 6. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue#assignContainers
> 7. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractParentQueue#assignContainersToChildQueues
> 8. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue#assignContainers
> 9. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#assignContainers
> 10. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainers
> 11. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#allocate
> 12. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#tryAllocateOnNode
> 13. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainersOnNode
> 14. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignNodeLocalContainers
> 15. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainer
> Logs these lines as an example:
> {code:java}
> 

[jira] [Updated] (YARN-11573) Add config option to make container allocation prefer nodes without reserved containers

2023-09-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-11573:
--
Labels: pull-request-available  (was: )

> Add config option to make container allocation prefer nodes without reserved 
> containers
> ---
>
> Key: YARN-11573
> URL: https://issues.apache.org/jira/browse/YARN-11573
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
>  Labels: pull-request-available
>
> Applications could be stuck when the container allocation logic does not 
> consider more nodes, but only nodes that are having reserved containers.
> This behavior can even block new AMs to be allocated on nodes so they don't 
> reach the running status.
> A jira that mentions the same thing is YARN-9598:
> {quote}Nodes which have been reserved should be skipped when iterating 
> candidates in RegularContainerAllocator#allocate, otherwise scheduler may 
> generate allocation or reservation proposal on these node which will always 
> be rejected in FiCaScheduler#commonCheckContainerAllocation.
> {quote}
> Since this jira implements 2 other points, I decided to create this one and 
> implement the 3rd point separately.
> h2. Notes:
> 1. FiCaSchedulerApp#commonCheckContainerAllocation will log this:
> {code:java}
> Trying to allocate from reserved container in async scheduling mode
> {code}
> in case RegularContainerAllocator creates a reservation proposal for nodes 
> having reserved container.
> 2. A better way is to prevent generating an AM container (or even normal 
> container) allocation proposal on a node if it already has a reservation on 
> it and we still have more nodes to check in the preferred node set. 
> Completely disabling task containers from being allocated to worker nodes 
> could limit the downscaling ability that we have currently.
> h2. 3. CALL HIERARCHY
> 1. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#nodeUpdate
> 2. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.api.records.NodeId,
>  boolean)
> 3. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.CandidateNodeSet,
>  boolean)
> 3.1. This is the place where it is decided whether to call 
> allocateContainerOnSingleNode or allocateContainersOnMultiNodes
> 4. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersOnMultiNodes
> 5. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateOrReserveNewContainers
> 6. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue#assignContainers
> 7. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractParentQueue#assignContainersToChildQueues
> 8. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue#assignContainers
> 9. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#assignContainers
> 10. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainers
> 11. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#allocate
> 12. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#tryAllocateOnNode
> 13. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainersOnNode
> 14. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignNodeLocalContainers
> 15. 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainer
> Logs these lines as an example:
> {code:java}
> 2023-08-23 17:44:08,129 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator:
>  assignContainers: node= application=application_1692304118418_3151 
> priority=0 pendingAsk= vCores:1>,repeat=1> type=OFF_SWITCH
> {code}
> h2. 4. DETAILS OF RegularContainerAllocator#allocate
> [Method 
> definition|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L826-L896]
> 4.1. Defining ordered list of