[jira] [Created] (YARN-11704) Avoid nested 'AND' placement constraint for non tags in scheduling request
Junfan Zhang created YARN-11704: --- Summary: Avoid nested 'AND' placement constraint for non tags in scheduling request Key: YARN-11704 URL: https://issues.apache.org/jira/browse/YARN-11704 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler Reporter: Junfan Zhang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11703) Validate accessibility of Node Manager working directories
Bence Kosztolnik created YARN-11703: --- Summary: Validate accessibility of Node Manager working directories Key: YARN-11703 URL: https://issues.apache.org/jira/browse/YARN-11703 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 3.5.0 Reporter: Bence Kosztolnik Assignee: Bence Kosztolnik h3. Problem: If some subdirectory or file changes permission under *yarn.nodemanager.local-dirs* or {*}yarn.nodemanager.log-dirs{*}, and won't be accessible by the node manager, then the node manager will not reach an unhealthy state, but container runs would fail. h3. Testing: - run an example PI job in a cluster - change the user cache directory of the user to not readable by the node manager. For example: {noformat} chmod 222 ./usercache/{user} {noformat} - cluster state will stay healthy - re-run the PI job - containers will fail on the affected node, with {noformat} ... Not able to initialize app-cache directories in any of the configured local directories for user ...{noformat} h3. Solution: Add an extra validation to the DirectoryCollection#testdirs to ensure the content of the local-dirs and log-dirs are accessible by the node manager, and turn the node unhealthy if not. New flag will be introduced to enable this validation: *yarn.nodemanager.working-dir-content-accessibility-validation.enabled* (default true) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11702) Fix Yarn over allocating containers
Syed Shameerur Rahman created YARN-11702: Summary: Fix Yarn over allocating containers Key: YARN-11702 URL: https://issues.apache.org/jira/browse/YARN-11702 Project: Hadoop YARN Issue Type: Bug Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman *Replication Steps:* Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler) {code:java} spark.executor.memory 1024M spark.driver.memory 2048M spark.executor.cores 1 spark.executor.instances 20 spark.dynamicAllocation.enabled false{code} Based on the setup, there should be 20 spark executors, but from the ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 of them were released in seconds. On analyzing the Spark ApplicationMaster (AM) logs, The following logs were observed. {code:java} 4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) for ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with custom resources: 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, launching executors on 8 of them. 24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, launching executors on 8 of them. 24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, launching executors on 4 of them. 24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, launching executors on 0 of them. {code} It was clear for the logs that extra allocated 12 containers are being ignored from Spark side. Inorder to debug this further, additional log lines were added to [AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427] class in increment and decrement of container request to expose additional information about the request. {code:java} 2024-06-24 14:10:14,075 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC Server handler 42 on default port 8030): Updates PendingContainers: 0 Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_01 2024-06-24 14:10:14,077 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 20 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_01 2024-06-24 14:10:14,077 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 19 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_01 2024-06-24 14:10:14,111 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 18 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_01 2024-06-24 14:10:14,112 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 17 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_01 2024-06-24 14:10:14,112 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 16 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_01 2024-06-24 14:10:14,113 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 15 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_01 2024-06-24 14:10:14,113 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 14 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, containerToUpdate=null} for: appattempt_1719234929152_0004_01 2024-06-24 14:10:14,113 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 13 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0
[jira] [Created] (YARN-11700) some applications on the query page display queues, usernames, and applications that appear as null
zhangzhanchang created YARN-11700: - Summary: some applications on the query page display queues, usernames, and applications that appear as null Key: YARN-11700 URL: https://issues.apache.org/jira/browse/YARN-11700 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: zhangzhanchang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10379) Refactor ContainerExecutor exit code Exception handling
[ https://issues.apache.org/jira/browse/YARN-10379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenc Erdelyi resolved YARN-10379. --- Resolution: Won't Fix > Refactor ContainerExecutor exit code Exception handling > --- > > Key: YARN-10379 > URL: https://issues.apache.org/jira/browse/YARN-10379 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Benjamin Teke >Assignee: Ferenc Erdelyi >Priority: Minor > > **Currently every time a shell command is executed and returns with a > non-zero exitcode an exception gets thrown. But along the call tree this > exception gets catched, after some info/warn logging and other processing > steps rethrown, possibly packaged to another exception. For example: > * from PrivilegedOperationExecutor.executePrivilegedOperation - > ExitCodeException catch (as IOException), PrivilegedOperationException thrown > * then in LinuxContainerExecutor.startLocalizer - > PrivilegedOperationException catch, exitCode collection, logging, IOException > rethrown > * then in ResourceLocalizationService.run - generic Exception catch, but > there is a TODO for separate ExitCodeException handling, however that > information is only present here in an error message string > This flow could be simplified and unified in the different executors. For > example use one specific exception till the last possible step, catch it only > where it is necessary and keep the exitcode as it could be used later in the > process. This change could help with maintainability and readability. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11699) Diagnostics lacks userlimit info when user capacity has reached its maximum limit
[ https://issues.apache.org/jira/browse/YARN-11699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-11699. --- Fix Version/s: 3.4.1 3.5.0 Hadoop Flags: Reviewed Target Version/s: 3.4.1, 3.5.0 Assignee: Jiandan Yang Resolution: Fixed > Diagnostics lacks userlimit info when user capacity has reached its maximum > limit > - > > Key: YARN-11699 > URL: https://issues.apache.org/jira/browse/YARN-11699 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Labels: pull-request-available > Fix For: 3.4.1, 3.5.0 > > Attachments: image-2024-05-29-15-47-53-217.png > > > Capacity scheduler supports user limit to avoid a single user use the whole > queue resource. but when resource used by a user reached its user limit, it > lacks related info in diagnostics in web page, just as shown in the figure > below. We may need add user limit and resource used by the user to help debug. > !image-2024-05-29-15-47-53-217.png|width=831,height=145! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11471) FederationStateStoreFacade Cache Support Caffeine
[ https://issues.apache.org/jira/browse/YARN-11471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-11471. --- Fix Version/s: 3.4.1 3.5.0 Hadoop Flags: Reviewed Resolution: Fixed > FederationStateStoreFacade Cache Support Caffeine > - > > Key: YARN-11471 > URL: https://issues.apache.org/jira/browse/YARN-11471 > Project: Hadoop YARN > Issue Type: Improvement > Components: federation >Affects Versions: 3.4.0 >Reporter: Shilun Fan >Assignee: Shilun Fan >Priority: Major > Labels: pull-request-available > Fix For: 3.4.1, 3.5.0 > > > FederationStateStoreFacade Cache Support Caffeine -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11699) Diagnostics
Jiandan Yang created YARN-11699: Summary: Diagnostics Key: YARN-11699 URL: https://issues.apache.org/jira/browse/YARN-11699 Project: Hadoop YARN Issue Type: Improvement Reporter: Jiandan Yang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11698) Finished containers shouldn't be stored indefinitely in the NM state store
Adam Binford created YARN-11698: --- Summary: Finished containers shouldn't be stored indefinitely in the NM state store Key: YARN-11698 URL: https://issues.apache.org/jira/browse/YARN-11698 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 3.4.0 Reporter: Adam Binford https://issues.apache.org/jira/browse/YARN-4771 updated the container tracking in the state store to only remove containers when their application ends, in order to make sure all containers logs get aggregated even during NM restarts. This can lead to a significant number of containers building up in the state store and a lot of things to recover. Since this was purely for making sure logs get aggregated, it could be done smarter that takes into account both rolling log aggregation or not having log aggregation enabled at all. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication
Syed Shameerur Rahman created YARN-11697: Summary: Fix fair scheduler race condition in removeApplicationAttempt and moveApplication Key: YARN-11697 URL: https://issues.apache.org/jira/browse/YARN-11697 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.2.1 Reporter: Syed Shameerur Rahman Assignee: Syed Shameerur Rahman For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with the following exception {code:java} 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher (SchedulerEventDispatcher:Event Processor): Error in handling event type APP_ATTEMPT_REMOVED to the Event Dispatcher java.lang.IllegalStateException: Given app to remove appattempt_1706879498319_86660_01 Alloc: does not exist in queue [root.tier2.livy, demand=, running=, share=, w=1.0] at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) at java.lang.Thread.run(Thread.java:750) {code} The exception seems similar to the one mentioned in YARN-5136, but it looks like there is still some edge cases not covered by YARN-5136. 1. On deeper look, i could see that as mentioned in the comment here. if a call for a moveApplication and removeApplicationAttempt for the same attempt are processed in short succession the application attempt will still contain a queue reference but is already removed from the list of applications for the queue. 2. This can happen when [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908] removes the appAttempt from the queue and [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707] also tries to remove the same appAttempt from the queue. 3. On further checking, i could see that before doing [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779] writeLock on appAttempt is taken where as for [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665] , i don't see any writelock being taken which can result in race condition if same appAttempt is being processed. 4. Additionally as mentioned in the comment here when such scenario occurs ideally we should not take down RM. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11692) Support mixed cgroup v1/v2 controller structure
[ https://issues.apache.org/jira/browse/YARN-11692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11692. -- Hadoop Flags: Reviewed Target Version/s: 3.5.0 Resolution: Fixed > Support mixed cgroup v1/v2 controller structure > --- > > Key: YARN-11692 > URL: https://issues.apache.org/jira/browse/YARN-11692 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: Peter Szucs >Priority: Major > Labels: pull-request-available > > There were heavy changes on the device side in cgroup v2. To keep supporting > FGPAs and GPUs short term, mixed structures where some of the cgroup > controllers are from v1 while others from v2 should be supported. More info: > https://dropbear.xyz/2023/05/23/devices-with-cgroup-v2/ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11669) [Umbrella] cgroup v2 support
[ https://issues.apache.org/jira/browse/YARN-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11669. -- Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Resolution: Fixed > [Umbrella] cgroup v2 support > > > Key: YARN-11669 > URL: https://issues.apache.org/jira/browse/YARN-11669 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Ferenc Erdelyi >Assignee: Benjamin Teke >Priority: Major > Fix For: 3.5.0 > > > cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 > already moved to cgroup v2 as a default, hence YARN should support it. This > umbrella tracks the required work. > [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html] > A way to test the newly added features: > # Turn on cgroup v1 based on the current > [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html]. > # System prerequisites: > ## the file {{/etc/mtab}} should contain a mount path with the file system > type {{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's > ## the {{cgroup.subtree_control}} file should contain the necessary > controllers (update it with: {{echo "+cpu +io +memory" > > cgroup.subtree_control}}) > ## either create the YARN hierarchy and give recursive access to the user > running the NM on the node. The hierarchy is {{hadoop-yarn}} by default > (controller by > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}}), and > recursive mode is required, because as soon as the directory is created it > will be filled with the controller files which YARN will try to edit. > ### Alternatively if the NM process user has access rights on the > {{/sys/fs/cgroup}} directory it'll try to create the hierarchy and update the > {{cgroup.subtree_control}} file. > # YARN configuration > ## {{yarn.nodemanager.linux-container-executor.cgroups.mount-path}} should > point to the directory where the cgroup2 structure is mounted and the > {{hadoop-yarn}} hierarchy was created > ## {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} should be > set to {{true}} > ## Enable a cgroup controller, like {{yarn. nodemanager. resource. > cpu.enabled}}: {{true}} > # Launch the NM and monitor the cgroup files on container launches (i.e: > {{/sys/fs/cgroup/hadoop-yarn/container_id/cpu.weight}}) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11696) Add debug-level logs in RMAppImpl#aggregateLogReport and RMAppImpl#getLogAggregationStatusForAppReport
Susheel Gupta created YARN-11696: Summary: Add debug-level logs in RMAppImpl#aggregateLogReport and RMAppImpl#getLogAggregationStatusForAppReport Key: YARN-11696 URL: https://issues.apache.org/jira/browse/YARN-11696 Project: Hadoop YARN Issue Type: Improvement Components: yarn Reporter: Susheel Gupta Assignee: Susheel Gupta The events keep increasing in event-queue and many event thread are blocked. To discover the deadlocking threads, add a few debug level logs to RMAppImpl#aggregateLogReport and RMAppImpl#getLogAggregationStatusForAppReport. {code:java} "RM Event dispatcher" #93 prio=5 os_prio=0 tid=0x7fcb67120800 nid=0x13e62 waiting on condition [0x7fbef632a000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x7fc44cada248> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.aggregateLogReport(RMAppImpl.java:1799) at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handleLogAggregationStatus(RMNodeImpl.java:1478) at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.access$500(RMNodeImpl.java:104) at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl$StatusUpdateWhenHealthyTransition.transition(RMNodeImpl.java:1239) at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl$StatusUpdateWhenHealthyTransition.transition(RMNodeImpl.java:1195) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) - locked <0x7fc04c0b6970> (a org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:667) at org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:101) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:1124) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:1108) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:219) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:133) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11695) Fixed non-idempotent tests in `TestTaskRunner`
Kaiyao Ke created YARN-11695: Summary: Fixed non-idempotent tests in `TestTaskRunner` Key: YARN-11695 URL: https://issues.apache.org/jira/browse/YARN-11695 Project: Hadoop YARN Issue Type: Bug Reporter: Kaiyao Ke All tests in `org.apache.hadoop.yarn.sls.scheduler.TestTaskRunner` are not idempotent and fails upon repeated execution within the same JVM instance due to self-induced state pollution. Specifically, the test runs made changes to the static fields (e.g. `PreStartTask.first` in the task classes without restoring them. Therefore, repeated runs throw assertion errors. Sample error message of `TestTaskRunner#testPreStartQueueing` in repeated test run: ``` java.lang.AssertionError: at org.junit.Assert.fail(Assert.java:87) at org.junit.Assert.assertTrue(Assert.java:42) at org.junit.Assert.assertTrue(Assert.java:53) at org.apache.hadoop.yarn.sls.scheduler.TestTaskRunner.testPreStartQueueing(TestTaskRunner.java:244) at java.base/java.lang.reflect.Method.invoke(Method.java:568) ``` The fix is done by explicitly setting (resetting) the static variables (countdown latches and booleans) at the start of each test. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11694) 2 tests are non-idempotent (passes in the first run but fails in repeated runs in the same JVM)
Kaiyao Ke created YARN-11694: Summary: 2 tests are non-idempotent (passes in the first run but fails in repeated runs in the same JVM) Key: YARN-11694 URL: https://issues.apache.org/jira/browse/YARN-11694 Project: Hadoop YARN Issue Type: Bug Reporter: Kaiyao Ke ## TestTimelineReaderMetrics#testTimelineReaderMetrics `org.apache.hadoop.yarn.server.timelineservice.reader.TestTimelineReaderMetrics#testTimelineReaderMetrics` does not perform a source unregistration after test execution, so the `TimelineReaderMetrics.getInstance()` call in repeated runs will throw an error since the metrics source `TimelineReaderMetrics` already exists. Error message in the 2nd run: ``` org.apache.hadoop.metrics2.MetricsException: Metrics source TimelineReaderMetrics already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) at org.apache.hadoop.yarn.server.timelineservice.metrics.TimelineReaderMetrics.getInstance(TimelineReaderMetrics.java:61) at org.apache.hadoop.yarn.server.timelineservice.reader.TestTimelineReaderMetrics.setup(TestTimelineReaderMetrics.java:52) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) ``` ## TestFederationStateStoreClientMetrics#testSuccessfulCalls `org.apache.hadoop.yarn.server.federation.store.metrics.TestFederationStateStoreClientMetrics#testSuccessfulCalls` retrieves the historical number of successful calls, but does not retrieve the historical average latency of those calls. For example, it asserts `FederationStateStoreClientMetrics.getLatencySucceededCalls()` is 100 after the `goodStateStore.registerSubCluster(100);` call. However, in the second execution of the test, 2 historical calls from the first execution (with latency 100 and 200 respectively) has already been recorded, so `FederationStateStoreClientMetrics.getLatencySucceededCalls()` will be 133. (mean of 100, 200 and 100) Error message in the 2nd run: ``` java.lang.AssertionError: expected:<100.0> but was:<133.34> at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:685) at org.apache.hadoop.yarn.server.federation.store.metrics.TestFederationStateStoreClientMetrics.testSuccessfulCalls(TestFederationStateStoreClientMetrics.java:63) at java.base/java.lang.reflect.Method.invoke(Method.java:568) ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11693) Refactor Container scheduler
Mohit Gaggar created YARN-11693: --- Summary: Refactor Container scheduler Key: YARN-11693 URL: https://issues.apache.org/jira/browse/YARN-11693 Project: Hadoop YARN Issue Type: Task Components: scheduler, scheduler preemption Reporter: Mohit Gaggar Container Scheduler class, responsible for scheduling containers on nodes handles multiple smaller responsibilities making it hard to extend the functionalities. This PR works on breaking down the class responsibilities into * ContainerQueueManager : handles all queuing related functions, like adding/removing to queue * ContainerStarter : maintains the running queue of containers and starts new containers * ContainerPolicyManager : handles the container termination/pausing policy when enough resources not available * ContainerScheduler : main class which works with other helper classes to maintain container queues !https://msdata.visualstudio.com/25bee5cc-1a60-44a1-904d-a734363b40d4/_apis/git/repositories/719ef898-e962-4b70-a49b-03c67abb2b07/pullRequests/1249358/attachments/Refactoring%20Container%20Scheduler%20%281%29.png|width=710,height=441! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11692) Support mixed cgroup v1/v2 controller structure
Benjamin Teke created YARN-11692: Summary: Support mixed cgroup v1/v2 controller structure Key: YARN-11692 URL: https://issues.apache.org/jira/browse/YARN-11692 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke There were heavy changes on the device side in cgroup v2. To keep supporting FGPAs and GPUs short term, mixed structures where some of the cgroup controllers are from v1 while others from v2 should be supported. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11691) The yarn web proxy doesn‘t support HTTP POST method
ude created YARN-11691: -- Summary: The yarn web proxy doesn‘t support HTTP POST method Key: YARN-11691 URL: https://issues.apache.org/jira/browse/YARN-11691 Project: Hadoop YARN Issue Type: Bug Components: webproxy Affects Versions: 3.3.6, 3.2.2 Reporter: ude Fix For: 3.3.6, 3.2.2 When the flink task is running in the YARN environment, the client encounters an error HTTP ERROR 405 when calling the http proxy rest api. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11690) Update container executor to use CGROUP2_SUPER_MAGIC in cgroup 2 scenarios
Benjamin Teke created YARN-11690: Summary: Update container executor to use CGROUP2_SUPER_MAGIC in cgroup 2 scenarios Key: YARN-11690 URL: https://issues.apache.org/jira/browse/YARN-11690 Project: Hadoop YARN Issue Type: Sub-task Components: container-executor Reporter: Benjamin Teke Assignee: Benjamin Teke The container executor function {{write_pid_to_cgroup_as_root}} writes the PID of the newly launched container to the correct cgroup.procs file. However it checks if the file is mounted on a cgroup filesystem, and does that check using the magic number, which differs for v1 and v2. This should handle v1 or v2 filesystems as well. {code:java} /** * Write the pid of the current process to the cgroup file. * cgroup_file: Path to cgroup file where pid needs to be written to. */ static int write_pid_to_cgroup_as_root(const char* cgroup_file, pid_t pid) { int rc = 0; uid_t user = geteuid(); gid_t group = getegid(); if (change_effective_user(0, 0) != 0) { rc = -1; goto cleanup; } // statfs struct statfs buf; if (statfs(cgroup_file, ) == -1) { fprintf(LOGFILE, "Can't statfs file %s as node manager - %s\n", cgroup_file, strerror(errno)); rc = -1; goto cleanup; } else if (buf.f_type != CGROUP_SUPER_MAGIC) { fprintf(LOGFILE, "Pid file %s is not located on cgroup filesystem\n", cgroup_file); rc = -1; goto cleanup; } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11689) Update getErrorWithDetails method to provide more meaningful error messages
Benjamin Teke created YARN-11689: Summary: Update getErrorWithDetails method to provide more meaningful error messages Key: YARN-11689 URL: https://issues.apache.org/jira/browse/YARN-11689 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke The method AbstractCGroupsHandler.getErrorWithDetails hides quite a lot of information. It would be useful to show the underlying exception and it's message as well, by default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11662) RM Web API endpoint queue reference differs from JMX endpoint for CS
[ https://issues.apache.org/jira/browse/YARN-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferenc Erdelyi resolved YARN-11662. --- Resolution: Duplicate Duplicate of YARN-11538 > RM Web API endpoint queue reference differs from JMX endpoint for CS > > > Key: YARN-11662 > URL: https://issues.apache.org/jira/browse/YARN-11662 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.4.0 >Reporter: Ferenc Erdelyi >Assignee: Ferenc Erdelyi >Priority: Major > > When a placement is not successful (because of the lack of a placement rule > or an unsuccessful placement), the application is placed in the default queue > instead of the root.default. The parent queue won't be defined when there is > no placement rule. This causes an inconsistency between the JMX endpoint > (reporting the app. runs under the root.default) and the RM Web API endpoint > (reporting the app runs under the default queue). > Similarly, when we submit an application with an unambiguous leaf queue > specified, the RM Web API endpoint will report the queue as the leaf queue > name instead of the full queue path. However, the full queue path is the > expected value to be consistent with the JMX endpoint. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11688) FS-CS converter: call System.exit replaced with ExitUtil.halt
[ https://issues.apache.org/jira/browse/YARN-11688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihui resolved YARN-11688. --- Resolution: Resolved > FS-CS converter: call System.exit replaced with ExitUtil.halt > - > > Key: YARN-11688 > URL: https://issues.apache.org/jira/browse/YARN-11688 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: wangzhihui >Assignee: wangzhihui >Priority: Blocker > Fix For: 3.3.0 > > Attachments: image-2024-04-20-22-17-49-522.png > > > Added System.exit logic in YARN-10191 to avoid issues with the tool will > never terminate. > Causing TestFSConfigToCSConfigConverterMain to VM terminated during running > test. > ExitUtil tool in Hadoop-common facilitates process termination for tests, > debugging. > Call System.exit replaced with ExitUtil.halt ,It would be more suitable for > this purpose. > {code:java} > // code placeholder > Crashed tests: > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigConverterMain > org.apache.maven.surefire.booter.SurefireBooterForkException: > ExecutionException The forked VM terminated without properly saying goodbye. > VM crash or System.exit called? > Command was /bin/sh -c cd > /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager > && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx2048m > -XX:+HeapDumpOnOutOfMemoryError -jar > /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/surefire/surefirebooter2247421570320659117.jar > > /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/surefire > 2024-04-17T14-34-01_743-jvmRun1 surefire5773923906402489727tmp > surefire_1524181064953128391099tmp > Process Exit Code: 0 > Crashed tests: > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigConverterMain > at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:511) > at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkPerTestSet(ForkStarter.java:458) > at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:299) > at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:247) > at > org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1149) > at > org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:991) > at > org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:837) > at > org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137) > at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:210) > at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:156) > at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:148) > at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117) > at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81) > at > org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56) > at > org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128) > at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:305) > at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192) > at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105) > at org.apache.maven.cli.MavenCli.execute(MavenCli.java:956) > at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288) > at org.apache.maven.cli.MavenCli.main(MavenCli.java:192) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
[jira] [Created] (YARN-11688) FS-CS converter: call System.exit replaced with ExitUtil.halt
wangzhihui created YARN-11688: - Summary: FS-CS converter: call System.exit replaced with ExitUtil.halt Key: YARN-11688 URL: https://issues.apache.org/jira/browse/YARN-11688 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: wangzhihui Assignee: wangzhihui Fix For: 3.3.0 Attachments: image-2024-04-20-22-17-49-522.png Added System.exit logic in YARN-10191 to avoid issues with the tool will never terminate. Causing TestFSConfigToCSConfigConverterMain to VM terminated during running test. ExitUtil tool in Hadoop-common facilitates process termination for tests, debugging. Call System.exit replaced with ExitUtil.halt ,It would be more suitable for this purpose. {code:java} // code placeholder Crashed tests: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigConverterMain org.apache.maven.surefire.booter.SurefireBooterForkException: ExecutionException The forked VM terminated without properly saying goodbye. VM crash or System.exit called? Command was /bin/sh -c cd /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx2048m -XX:+HeapDumpOnOutOfMemoryError -jar /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/surefire/surefirebooter2247421570320659117.jar /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/surefire 2024-04-17T14-34-01_743-jvmRun1 surefire5773923906402489727tmp surefire_1524181064953128391099tmp Process Exit Code: 0 Crashed tests: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigConverterMain at org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:511) at org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkPerTestSet(ForkStarter.java:458) at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:299) at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:247) at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1149) at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:991) at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:837) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:210) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:156) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:148) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81) at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:305) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192) at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105) at org.apache.maven.cli.MavenCli.execute(MavenCli.java:956) at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288) at org.apache.maven.cli.MavenCli.main(MavenCli.java:192) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289) at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415) at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356) Caused by: org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
[jira] [Created] (YARN-11687) Update CGroupsResourceCalculator to track usages using cgroupv2
Benjamin Teke created YARN-11687: Summary: Update CGroupsResourceCalculator to track usages using cgroupv2 Key: YARN-11687 URL: https://issues.apache.org/jira/browse/YARN-11687 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke [CGroupsResourceCalculator|https://github.com/apache/hadoop/blob/f609460bda0c2bd87dd3580158e549e2f34f14d5/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsResourceCalculator.java] should also be updated to handle the cgroup v2 changes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11686) Correct traversing indexs when scheduling asynchronously using Capacity Scheduler
Yihe Li created YARN-11686: -- Summary: Correct traversing indexs when scheduling asynchronously using Capacity Scheduler Key: YARN-11686 URL: https://issues.apache.org/jira/browse/YARN-11686 Project: Hadoop YARN Issue Type: Improvement Reporter: Yihe Li When scheduling asynchronously using Capacity Scheduler, the traversing indexs in `CapacityScheduler#schedule` will always contains `start` index twice. This may not in line with the original intention and needs to be corrected. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11685) Create a config to enable/disable cgroup v2 functionality
Benjamin Teke created YARN-11685: Summary: Create a config to enable/disable cgroup v2 functionality Key: YARN-11685 URL: https://issues.apache.org/jira/browse/YARN-11685 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke Various OS's mount the cgroup v2 differently, some of them mount both the v1 and v2 structure, others mount a hybrid structure. To avoid initialization issues the cgroup v1/v2 functionality should be set by a config property. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11684) PriorityQueueComparator violates general contract
Tamas Domok created YARN-11684: -- Summary: PriorityQueueComparator violates general contract Key: YARN-11684 URL: https://issues.apache.org/jira/browse/YARN-11684 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 3.5.0 Reporter: Tamas Domok Assignee: Tamas Domok YARN-10178 tried to fix the issue but there are still 2 property that might change during sorting which causes an exception. {code} 2024-04-10 12:36:56,420 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-28,5,main] threw an Exception. java.lang.IllegalArgumentException: Comparison method violates its general contract! at java.util.TimSort.mergeHi(TimSort.java:899) at java.util.TimSort.mergeAt(TimSort.java:516) at java.util.TimSort.mergeCollapse(TimSort.java:441) at java.util.TimSort.sort(TimSort.java:245) at java.util.Arrays.sort(Arrays.java:1512) at java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:348) at java.util.stream.Sink$ChainedReference.end(Sink.java:258) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:483) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1719) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1654) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1811) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1557) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:539) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:591) {code} The `queue.getAccessibleNodeLabels()` and `queue.getPriority()` could change in another thread while the `queues` are being sorted. Those should be saved when constructing the PriorityQueueResourcesForSorting helper object. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11683) RM crash due to RELEASE_CONTAINER NPE
Yuan Luo created YARN-11683: --- Summary: RM crash due to RELEASE_CONTAINER NPE Key: YARN-11683 URL: https://issues.apache.org/jira/browse/YARN-11683 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Yuan Luo -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11681) Update the cgroup documentation with v2 support
Benjamin Teke created YARN-11681: Summary: Update the cgroup documentation with v2 support Key: YARN-11681 URL: https://issues.apache.org/jira/browse/YARN-11681 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke Update the related [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html] with v2 support. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11680) Update FpgaResourceHandler for cgroup v2 support
Benjamin Teke created YARN-11680: Summary: Update FpgaResourceHandler for cgroup v2 support Key: YARN-11680 URL: https://issues.apache.org/jira/browse/YARN-11680 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke cgroup v2 has some changes in various controllers (some changed their functionality, some were removed). This task is about checking if FpgaResourceHandler's [implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/fpga/FpgaResourceHandlerImpl.java#L55] need any updates. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11679) Update GpuResourceHandler for cgroup v2 support
Benjamin Teke created YARN-11679: Summary: Update GpuResourceHandler for cgroup v2 support Key: YARN-11679 URL: https://issues.apache.org/jira/browse/YARN-11679 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke cgroup v2 has some changes in various controllers (some changed their functionality, some were removed). This task is about checking if GpuResourceHandler's [implementation|https://github.com/apache/hadoop/blob/e8fa192f07b6f2e7a0b03813edca03c505a8ac1b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/gpu/GpuResourceHandlerImpl.java#L45] need any updates. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11678) Update CGroupElasticMemoryController for cgroup v2 support
Benjamin Teke created YARN-11678: Summary: Update CGroupElasticMemoryController for cgroup v2 support Key: YARN-11678 URL: https://issues.apache.org/jira/browse/YARN-11678 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke cgroup v2 has some changes in various controllers (some changed their functionality, some were removed). This task is about checking if CGroupElasticMemoryController's [implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupElasticMemoryController.java#L58] need any updates. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11677) Update OutboundBandwidthResourceHandler implementation for cgroup v2 support
Benjamin Teke created YARN-11677: Summary: Update OutboundBandwidthResourceHandler implementation for cgroup v2 support Key: YARN-11677 URL: https://issues.apache.org/jira/browse/YARN-11677 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke cgroup v2 has some changes in various controllers (some changed their functionality, some were removed). This task is about checking if OutboundBandwidthResourceHandler's [implementation|https://github.com/apache/hadoop/blob/2064ca015d1584263aac0cc20c60b925a3aff612/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/TrafficControlBandwidthHandlerImpl.java#L43] need any updates. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11676) Update CGroupsBlkioResourceHandler implementation for cgroup v2 support
Benjamin Teke created YARN-11676: Summary: Update CGroupsBlkioResourceHandler implementation for cgroup v2 support Key: YARN-11676 URL: https://issues.apache.org/jira/browse/YARN-11676 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke cgroup v2 has some changes in various controllers (some changed their functionality, some were removed). This task is about checking if CGroupsBlkioResourceHandler's [implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsBlkioResourceHandlerImpl.java#L46] need any updates. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11675) Update MemoryResourceHandler implementation for cgroup v2 support
Benjamin Teke created YARN-11675: Summary: Update MemoryResourceHandler implementation for cgroup v2 support Key: YARN-11675 URL: https://issues.apache.org/jira/browse/YARN-11675 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke cgroup v2 has some changes in various controllers (some changed their functionality, some were removed). This task is about checking if MemoryResourceHandler's [implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsMemoryResourceHandlerImpl.java#L47-L46] need any updates. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11674) Update CpuResourceHandler implementation for cgroup v2 support
Benjamin Teke created YARN-11674: Summary: Update CpuResourceHandler implementation for cgroup v2 support Key: YARN-11674 URL: https://issues.apache.org/jira/browse/YARN-11674 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke cgroup v2 has some changes in various controllers (some changed their functionality, some were removed). This task is about checking if CpuResourceHandler's [implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsCpuResourceHandlerImpl.java#L60] need any updates. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11673) Extend the cgroup mount functionality to mount the v2 structure
Benjamin Teke created YARN-11673: Summary: Extend the cgroup mount functionality to mount the v2 structure Key: YARN-11673 URL: https://issues.apache.org/jira/browse/YARN-11673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke YARN has a --mount-cgroup operation in the [container-executor|https://github.com/apache/hadoop/blob/9c7b8cf54ea88833d54fc71a9612c448dc0eb78d/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L2929] which mounts each controller's cgroup folder to a specified path. In cgroup v2 the controller structure changed, it's flat now, so there are no more separate controller paths. To keep being compatible with v1 a new mount method should be added, but its functionality could be simplified quite a bit for v2. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11672) Create a CgroupHandler implementation for cgroup v2
Benjamin Teke created YARN-11672: Summary: Create a CgroupHandler implementation for cgroup v2 Key: YARN-11672 URL: https://issues.apache.org/jira/browse/YARN-11672 Project: Hadoop YARN Issue Type: Sub-task Reporter: Benjamin Teke Assignee: Benjamin Teke [CGroupsHandler's|https://github.com/apache/hadoop/blob/69b328943edf2f61c8fc139934420e3f10bf3813/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsHandler.java#L36] current implementation holds the functionality to mount and setup the YARN specific cgroup v1 functionality. A similar v2 implementation should be created that allows initialising the v2 structure. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11670) Add CallerContext in NodeManager
[ https://issues.apache.org/jira/browse/YARN-11670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dinesh Chitlangia resolved YARN-11670. -- Fix Version/s: 3.5.0 Resolution: Fixed Thanks [~yangjiandan] for contribution and [~whbing] and [~slfan1989] for reviews. > Add CallerContext in NodeManager > > > Key: YARN-11670 > URL: https://issues.apache.org/jira/browse/YARN-11670 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > Currently, MR and Spark have added caller context, enabling tracing of > HDFS/ResourceManager operators from Spark apps and MapReduce apps. However, > operators from NodeManagers cannot be identified in the audit log. For > example, HDFS operations issued from NodeManagers during resource > localization cannot be identified. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11444) Improve YARN md documentation format
[ https://issues.apache.org/jira/browse/YARN-11444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-11444. --- Fix Version/s: 3.4.1 3.5.0 Hadoop Flags: Reviewed Target Version/s: 3.4.1, 3.5.0 (was: 3.5.0) Resolution: Fixed > Improve YARN md documentation format > > > Key: YARN-11444 > URL: https://issues.apache.org/jira/browse/YARN-11444 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.4.0 >Reporter: Shilun Fan >Assignee: Shilun Fan >Priority: Major > Labels: pull-request-available > Fix For: 3.4.1, 3.5.0 > > > 1. Modify some typo errors -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11671) Tests in hadoop-yarn-server-router are not running
Ayush Saxena created YARN-11671: --- Summary: Tests in hadoop-yarn-server-router are not running Key: YARN-11671 URL: https://issues.apache.org/jira/browse/YARN-11671 Project: Hadoop YARN Issue Type: Bug Reporter: Ayush Saxena [https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/1549/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-router.txt] {noformat} [INFO] --- maven-surefire-plugin:3.0.0-M1:test (default-test) @ hadoop-yarn-server-router --- [INFO] [INFO] --- [INFO] T E S T S [INFO] --- [INFO] [INFO] Results: [INFO] [INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] [INFO] BUILD SUCCESS [INFO] {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11663) [Federation] Add Cache Entity Nums Limit.
[ https://issues.apache.org/jira/browse/YARN-11663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-11663. --- Fix Version/s: 3.4.1 3.5.0 Target Version/s: 3.4.0 Assignee: Shilun Fan Resolution: Fixed > [Federation] Add Cache Entity Nums Limit. > - > > Key: YARN-11663 > URL: https://issues.apache.org/jira/browse/YARN-11663 > Project: Hadoop YARN > Issue Type: Improvement > Components: federation, yarn >Affects Versions: 3.4.0 >Reporter: Yuan Luo >Assignee: Shilun Fan >Priority: Major > Labels: pull-request-available > Fix For: 3.4.1, 3.5.0 > > Attachments: image-2024-03-14-18-12-28-426.png, > image-2024-03-14-18-12-49-950.png, image-2024-03-15-10-50-32-860.png > > > !image-2024-03-14-18-12-28-426.png! > !image-2024-03-14-18-12-49-950.png! > hi [~slfan1989] After apply this feature to our prod env, I found the memory > of the router keeps growing over time. This is because after jobs finished, > we won't access the expired key to trigger cleanup mechanism. Is it better to > add cache maximum number limit? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11668) Potential concurrent modification exception for node attributes of node manager
[ https://issues.apache.org/jira/browse/YARN-11668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-11668. --- Fix Version/s: 3.4.1 3.5.0 Hadoop Flags: Reviewed Target Version/s: 3.4.1 Assignee: Junfan Zhang Resolution: Fixed > Potential concurrent modification exception for node attributes of node > manager > --- > > Key: YARN-11668 > URL: https://issues.apache.org/jira/browse/YARN-11668 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junfan Zhang >Assignee: Junfan Zhang >Priority: Major > Labels: pull-request-available > Fix For: 3.4.1, 3.5.0 > > Attachments: img_v3_029c_55ac6b50-64aa-4cbe-81a0-5f8d22c623fg.jpg > > > The RM crash when encoutering the following the stacktrace in the attachment. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11670) Add CallerContext in NodeManager
Jiandan Yang created YARN-11670: Summary: Add CallerContext in NodeManager Key: YARN-11670 URL: https://issues.apache.org/jira/browse/YARN-11670 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Jiandan Yang Currently, MR and Spark have added caller context, enabling tracing of HDFS/ResourceManager operators from Spark apps and MapReduce apps. However, operators from NodeManagers cannot be identified in the audit log. For example, HDFS operations issued from NodeManagers during resource localization cannot be identified. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11669) cgroups v2 support for YARN
Ferenc Erdelyi created YARN-11669: - Summary: cgroups v2 support for YARN Key: YARN-11669 URL: https://issues.apache.org/jira/browse/YARN-11669 Project: Hadoop YARN Issue Type: New Feature Components: yarn Reporter: Ferenc Erdelyi The cgroups v2 is becoming the default for OSs, like RHEL9. Support for YARN has to be implemented. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11668) Potential concurrent modification exception for node attributes of node manager
Junfan Zhang created YARN-11668: --- Summary: Potential concurrent modification exception for node attributes of node manager Key: YARN-11668 URL: https://issues.apache.org/jira/browse/YARN-11668 Project: Hadoop YARN Issue Type: Bug Reporter: Junfan Zhang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11667) Federation: ResourceRequestComparator occurs NPE when using low version of hadoop submit application
[ https://issues.apache.org/jira/browse/YARN-11667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qiuliang resolved YARN-11667. - Resolution: Won't Do > Federation: ResourceRequestComparator occurs NPE when using low version of > hadoop submit application > > > Key: YARN-11667 > URL: https://issues.apache.org/jira/browse/YARN-11667 > Project: Hadoop YARN > Issue Type: Bug > Components: amrmproxy >Affects Versions: 3.4.0 >Reporter: qiuliang >Priority: Major > Labels: pull-request-available > > When a application is submitted using a lower version of hadoop and the > Resource Request built by AM has no ExecutionTypeRequest. After the Resource > Request is submitted to AMRMProxy, the NPE occurs when AMRMProxy reconstructs > the Allocate Request to add Resource Request to its ask -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11626) Optimization of the safeDelete operation in ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-11626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dinesh Chitlangia resolved YARN-11626. -- Fix Version/s: 3.5.0 Resolution: Fixed > Optimization of the safeDelete operation in ZKRMStateStore > -- > > Key: YARN-11626 > URL: https://issues.apache.org/jira/browse/YARN-11626 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0 >Reporter: wangzhihui >Priority: Minor > Labels: pull-request-available > Fix For: 3.5.0 > > > h1. Description > * We can be observed that removing app info started at 06:17:20, but the > NoNodeException was received at 06:17:35. > * During the 15s interval, Curator was retrying the metadata operation. Due > to the non-idempotent nature of the Zookeeper deletion operation, in one of > the retry attempts, the metadata operation was successful but no response was > received. In the next retry it resulted in a NoNodeException, triggering the > STATE_STORE_FENCED event and ultimately causing the current ResourceManager > to switch to standby . > {code:java} > 2023-10-28 06:17:20,359 INFO recovery.RMStateStore > (RMStateStore.java:transition(333)) - Removing info for app: > application_1697410508608_140368 > 2023-10-28 06:17:20,359 INFO resourcemanager.RMAppManager > (RMAppManager.java:checkAppNumCompletedLimit(303)) - Application should be > expired, max number of completed apps kept in memory met: > maxCompletedAppsInMemory = 1000, removing app > application_1697410508608_140368 from memory: > 2023-10-28 06:17:35,665 ERROR recovery.RMStateStore > (RMStateStore.java:transition(337)) - Error removing app: > application_1697410508608_140368 > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:111) > 2023-10-28 06:17:35,666 INFO recovery.RMStateStore > (RMStateStore.java:handleStoreEvent(1147)) - RMStateStore state change from > ACTIVE to FENCED > 2023-10-28 06:17:35,666 ERROR resourcemanager.ResourceManager > (ResourceManager.java:handle(898)) - Received RMFatalEvent of type > STATE_STORE_FENCED, caused by > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode > 2023-10-28 06:17:35,666 INFO resourcemanager.ResourceManager > (ResourceManager.java:transitionToStandby(1309)) - Transitioning to standby > state > {code} > h1. Solution > The NoNodeException clearly indicates that the Znode no longer exists, so we > can safely ignore this exception to avoid triggering a larger impact on the > cluster caused by ResourceManager failover. > h1. Other > We also need to discuss and optimize the same issues in safeCreate. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11667) Federation: ResourceRequestComparator occurs NPE when using low version of hadoop submit application
qiuliang created YARN-11667: --- Summary: Federation: ResourceRequestComparator occurs NPE when using low version of hadoop submit application Key: YARN-11667 URL: https://issues.apache.org/jira/browse/YARN-11667 Project: Hadoop YARN Issue Type: Bug Components: amrmproxy Affects Versions: 3.4.0 Reporter: qiuliang When a application is submitted using a lower version of hadoop and the Resource Request built by AM has no ExecutionTypeRequest. After the Resource Request is submitted to AMRMProxy, the NPE occurs when AMRMProxy reconstructs the Allocate Request to add Resource Request to its ask -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5305) Yarn Application Log Aggregation fails due to NM can not get correct HDFS delegation token III
[ https://issues.apache.org/jira/browse/YARN-5305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-5305. - Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Resolution: Fixed > Yarn Application Log Aggregation fails due to NM can not get correct HDFS > delegation token III > -- > > Key: YARN-5305 > URL: https://issues.apache.org/jira/browse/YARN-5305 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Xianyin Xin >Assignee: Peter Szucs >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > Different with YARN-5098 and YARN-5302, this problem happens when AM submits > a startContainer request with a new HDFS token (say, tokenB) which is not > managed by YARN, so two tokens exist in the credentials of the user on NM, > one is tokenB, the other is the one renewed on RM (tokenA). If tokenB is > selected when connect to HDFS and tokenB expires, exception happens. > Supplementary: this problem happen due to that AM didn't use the service name > as the token alias in credentials, so two tokens for the same service can > co-exist in one credentials. TokenSelector can only select the first matched > token, it doesn't care if the token is valid or not. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11666) NullPointerException in TestSLSRunner.testSimulatorRunning
Elen Chatikyan created YARN-11666: - Summary: NullPointerException in TestSLSRunner.testSimulatorRunning Key: YARN-11666 URL: https://issues.apache.org/jira/browse/YARN-11666 Project: Hadoop YARN Issue Type: Bug Environment: {*}Operating System{*}: macOS (Sanoma 14.2.1 (23C71)) {*}Hardware{*}: MacBook Air 2023 {*}IDE{*}: IntelliJ IDEA (2023.3.2 (Ultimate Edition)) {*}Java Version{*}: OpenJDK version "1.8.0_292" Reporter: Elen Chatikyan *What happened:* In the *TestSLSRunner* class of the Apache Hadoop YARN SLS (Simulated Load Scheduler) framework, a *NullPointerException* is thrown during the teardown process of parameterized tests. This exception is thrown when the stop method is called on the ResourceManager (rm) object in {_}RMRunner.java{_}. This issue occurs under test conditions that involve mismatches between trace types (RUMEN, SLS, SYNTH) and their corresponding trace files, leading to scenarios where the rm object may not be properly initialized before the stop method is invoked. *Buggy code:* The issue is located in the *{{RMRunner.java}}* file within the *{{stop}}* method:{+}{{+}} {code:java} public void stop() { rm.stop(); } {code} The root cause of the *{{NullPointerException}}* is the lack of a null check for the {{rm}} object before calling its {{stop}} method. Under any condition where the *{{ResourceManager}}* fails to initialize correctly, attempting to stop the *{{ResourceManager}}* leads to a null pointer dereference. After fixing in {*}RMRunner.java{*}, TaskRunner should also be fixed. +TaskRunner.java+ {code:java} public void stop() throws InterruptedException { executor.shutdownNow(); executor.awaitTermination(20, TimeUnit.SECONDS); } {code} {*}How to trigger this bug:{*}{*}{{*}} * Change the parameterized unit test's(TestSLSRunner.java) data method to include one/both of the following test cases: * {capScheduler, "SYNTH", rumenTraceFile, nodeFile } * {capScheduler, "SYNTH", slsTraceFile, nodeFile } * Execute the *TestSLSRunner* test suite, particularly the *testSimulatorRunning* method. * Observe the resulting *NullPointerException* in the test output(triggered in RMRunner.java). {panel:title=Example stack trace from the test output:} [ERROR] testSimulatorRunning[Testing with: SYNTH, org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler, (nodeFile null)](org.apache.hadoop.yarn.sls.TestSLSRunner) Time elapsed: 3.027 s <<< ERROR! java.lang.NullPointerException at org.apache.hadoop.yarn.sls.RMRunner.stop(RMRunner.java:127) at org.apache.hadoop.yarn.sls.SLSRunner.stop(SLSRunner.java:320) at org.apache.hadoop.yarn.sls.BaseSLSRunnerTest.tearDown(BaseSLSRunnerTest.java:68) ... {panel} ___ _{color:#172b4d}The bug can be fixed by implementing a null check for the {{rm}} object within the *{{RMRunner.java}}* {{stop}} method before calling any methods on it.(same for executor object in TaskRunner.java){color}_ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11665) hive jobs support aggregating logs according to real users
zeekling created YARN-11665: --- Summary: hive jobs support aggregating logs according to real users Key: YARN-11665 URL: https://issues.apache.org/jira/browse/YARN-11665 Project: Hadoop YARN Issue Type: Improvement Components: log-aggregation Reporter: zeekling Currently, hive job logs are in /tmp/logs/hive/bucket/appId ,can we aggregate logs against real users running hive jobs, like /tmp/logs/hive/\{real user}/bucket/appId -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11664) Remove HDFS Binaries/Jars Dependency From Yarn
Syed Shameerur Rahman created YARN-11664: Summary: Remove HDFS Binaries/Jars Dependency From Yarn Key: YARN-11664 URL: https://issues.apache.org/jira/browse/YARN-11664 Project: Hadoop YARN Issue Type: Improvement Components: yarn Reporter: Syed Shameerur Rahman In principle Hadoop Yarn is independent of HDFS. It can work with any filesystem. Currently there exists some code dependency for Yarn with HDFS. This dependency requires Yarn to bring in some of the HDFS binaries/jars to its class path. The idea behind this jira is to remove this dependency so that Yarn can run without HDFS binaries/jars *Scope* 1. Non test classes are considered 2. Some test classes which comes as transitive dependency are considered *Out of scope* 1. All test classes in Yarn module is not considered A quick search in Yarn module revealed following HDFS dependencies 1. Constants {code:java} import org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier; import org.apache.hadoop.hdfs.DFSConfigKeys;{code} 2. Exception {code:java} import org.apache.hadoop.hdfs.protocol.DSQuotaExceededException; import org.apache.hadoop.hdfs.protocol.QuotaExceededException; (Comes as a transitive dependency from DSQuotaExceededException){code} 3. Utility {code:java} import org.apache.hadoop.hdfs.protocol.datatransfer.IOStreamPair;{code} Both Yarn and HDFS depends on hadoop-common module, One straight forward approach is to move all these dependencies to hadoop-common module and both HDFS and Yarn can pick these imports. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11660) SingleConstraintAppPlacementAllocator performance regression
[ https://issues.apache.org/jira/browse/YARN-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-11660. --- Resolution: Fixed > SingleConstraintAppPlacementAllocator performance regression > > > Key: YARN-11660 > URL: https://issues.apache.org/jira/browse/YARN-11660 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 3.4.1 >Reporter: Junfan Zhang >Assignee: Junfan Zhang >Priority: Major > Labels: pull-request-available > Fix For: 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11663) Router cache expansion issue
Yuan Luo created YARN-11663: --- Summary: Router cache expansion issue Key: YARN-11663 URL: https://issues.apache.org/jira/browse/YARN-11663 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 3.3.6 Reporter: Yuan Luo -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11661) Adding new property to configure the "SameSite" cookie attribute on YARN UI
[ https://issues.apache.org/jira/browse/YARN-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susheel Gupta resolved YARN-11661. -- Hadoop Flags: Reviewed Resolution: Workaround Closing this as workaround exists. > Adding new property to configure the "SameSite" cookie attribute on YARN UI > > > Key: YARN-11661 > URL: https://issues.apache.org/jira/browse/YARN-11661 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Susheel Gupta >Assignee: Susheel Gupta >Priority: Major > > If we use 'SameSite=Strict,' the browser would only send the cookie for > same-site requests, rendering cross-site sessions ineffective. > However, it’s worth noting that while using SameSite=None with TLS does > enhance the security of your cookies compared to using it without TLS, it > doesn’t provide complete security. Nevertheless, considering the necessity > for cross-site sessions, utilizing SameSite=None along with TLS can provide a > reasonable level of security. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11662) RM Web API endpoint queue reference differs from JMX endpoint for CS
Ferenc Erdelyi created YARN-11662: - Summary: RM Web API endpoint queue reference differs from JMX endpoint for CS Key: YARN-11662 URL: https://issues.apache.org/jira/browse/YARN-11662 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Ferenc Erdelyi When a placement is not successful (because of the lack of a placement rule or an unsuccessful placement), the application is placed in the default queue instead of the root.default. The parent queue won't be defined when there is no placement rule. This causes an inconsistency between the JMX endpoint (reporting the app. runs under the root.default) and the RM Web API endpoint (reporting the app runs under the default queue). Similarly, when we submit an application with an unambiguous leaf queue specified, the RM Web API endpoint will report the queue as the leaf queue name instead of the full queue path. However, the full queue path is the expected value to be consistent with the JMX endpoint. I propose using the scheduler's getQueueInfo in the RMAppManager to parse the queue name and get the full queue path for the placementQueueName, which fixes the above issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11661) Adding new property to configure the "SameSite" cookie attribute on YARN UI
Susheel Gupta created YARN-11661: Summary: Adding new property to configure the "SameSite" cookie attribute on YARN UI Key: YARN-11661 URL: https://issues.apache.org/jira/browse/YARN-11661 Project: Hadoop YARN Issue Type: Improvement Components: yarn Reporter: Susheel Gupta If we use 'SameSite=Strict,' the browser would only send the cookie for same-site requests, rendering cross-site sessions ineffective. However, it’s worth noting that while using SameSite=None with TLS does enhance the security of your cookies compared to using it without TLS, it doesn’t provide complete security. Nevertheless, considering the necessity for cross-site sessions, utilizing SameSite=None along with TLS can provide a reasonable level of security. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11660) SingleConstraintAppPlacementAllocator performance regression
Junfan Zhang created YARN-11660: --- Summary: SingleConstraintAppPlacementAllocator performance regression Key: YARN-11660 URL: https://issues.apache.org/jira/browse/YARN-11660 Project: Hadoop YARN Issue Type: Bug Reporter: Junfan Zhang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11659) app submission fast fail with node label when node label is disable
Junfan Zhang created YARN-11659: --- Summary: app submission fast fail with node label when node label is disable Key: YARN-11659 URL: https://issues.apache.org/jira/browse/YARN-11659 Project: Hadoop YARN Issue Type: Improvement Reporter: Junfan Zhang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11658) ATS to make minimum HBase version 2.x
Steve Loughran created YARN-11658: - Summary: ATS to make minimum HBase version 2.x Key: YARN-11658 URL: https://issues.apache.org/jira/browse/YARN-11658 Project: Hadoop YARN Issue Type: Improvement Components: timelineserver Affects Versions: 3.4.0 Reporter: Steve Loughran following on from YARN-11657, what if we cut hbase 1.x support from ATS *entirely*? YARN-3 implies that the 2.x version might need to be bumped up -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11657) Remove protobuf-2.5 as dependency of hadoop-yarn-api
[ https://issues.apache.org/jira/browse/YARN-11657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran resolved YARN-11657. --- Fix Version/s: 3.3.9 3.4.1 3.5.0 Resolution: Fixed > Remove protobuf-2.5 as dependency of hadoop-yarn-api > > > Key: YARN-11657 > URL: https://issues.apache.org/jira/browse/YARN-11657 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 3.4.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Major > Labels: pull-request-available > Fix For: 3.3.9, 3.4.1, 3.5.0 > > > hadoop-yarn-api is still exporting protobuf-2.5. > if we can cut this, we should > {code} > [echo] [INFO] +- > org.apache.hadoop:hadoop-yarn-server-common:jar:3.4.0:compile > [echo] [INFO] | +- org.apache.hadoop:hadoop-yarn-api:jar:3.4.0:compile > [echo] [INFO] | | +- > (org.apache.hadoop.thirdparty:hadoop-shaded-guava:jar:1.2.0:compile - omitted > for duplicate) > [echo] [INFO] | | +- (javax.xml.bind:jaxb-api:jar:2.2.11:compile - > omitted for duplicate) > [echo] [INFO] | | +- > (org.apache.hadoop:hadoop-annotations:jar:3.4.0:compile - omitted for > duplicate) > [echo] [INFO] | | +- > com.google.protobuf:protobuf-java:jar:2.5.0:compile > [echo] [INFO] | | +- > (org.apache.hadoop.thirdparty:hadoop-shaded-protobuf_3_21:jar:1.2.0:compile - > omitted for duplicate) > [echo] [INFO] | | \- > (com.fasterxml.jackson.core:jackson-annotations:jar:2.12.7:compile - omitted > for duplicate) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11657) Remove protobuf-2.5 as dependency of hadoop-yarn-api
Steve Loughran created YARN-11657: - Summary: Remove protobuf-2.5 as dependency of hadoop-yarn-api Key: YARN-11657 URL: https://issues.apache.org/jira/browse/YARN-11657 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 3.4.0 Reporter: Steve Loughran Assignee: Steve Loughran hadoop-yarn-api is still exporting protobuf-2.5. if we can cut this, we should {code} [echo] [INFO] +- org.apache.hadoop:hadoop-yarn-server-common:jar:3.4.0:compile [echo] [INFO] | +- org.apache.hadoop:hadoop-yarn-api:jar:3.4.0:compile [echo] [INFO] | | +- (org.apache.hadoop.thirdparty:hadoop-shaded-guava:jar:1.2.0:compile - omitted for duplicate) [echo] [INFO] | | +- (javax.xml.bind:jaxb-api:jar:2.2.11:compile - omitted for duplicate) [echo] [INFO] | | +- (org.apache.hadoop:hadoop-annotations:jar:3.4.0:compile - omitted for duplicate) [echo] [INFO] | | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile [echo] [INFO] | | +- (org.apache.hadoop.thirdparty:hadoop-shaded-protobuf_3_21:jar:1.2.0:compile - omitted for duplicate) [echo] [INFO] | | \- (com.fasterxml.jackson.core:jackson-annotations:jar:2.12.7:compile - omitted for duplicate) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11656) RMStateStore event queue blocked
Bence Kosztolnik created YARN-11656: --- Summary: RMStateStore event queue blocked Key: YARN-11656 URL: https://issues.apache.org/jira/browse/YARN-11656 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 3.4.1 Reporter: Bence Kosztolnik Attachments: issue.png I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck). Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore). To reduce the impact of the issue: - create a dispatcher where events can persist in parallel threads - create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster {panel:title=Issue visible on UI2} {panel} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11655) modify default value of Allocated GPUs and Reserved GPUs in yarn scheduler webui from -1 to 0
wangzhongwei created YARN-11655: --- Summary: modify default value of Allocated GPUs and Reserved GPUs in yarn scheduler webui from -1 to 0 Key: YARN-11655 URL: https://issues.apache.org/jira/browse/YARN-11655 Project: Hadoop YARN Issue Type: Improvement Components: yarn-common Affects Versions: 3.3.3 Reporter: wangzhongwei Assignee: wangzhongwei Attachments: image-2024-02-20-15-15-34-996.png in yarn scheduler webui,the value of Allocated GPUs and Reserved GPUs be set to 0 by default may be better. when GPUs not used,these values should be 0 !image-2024-02-20-15-15-34-996.png|width=486,height=235! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11654) [JDK17] TestLinuxContainerExecutorWithMocks.testStartLocalizer fails
Bilwa S T created YARN-11654: Summary: [JDK17] TestLinuxContainerExecutorWithMocks.testStartLocalizer fails Key: YARN-11654 URL: https://issues.apache.org/jira/browse/YARN-11654 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.4.0 Reporter: Bilwa S T Assignee: Bilwa S T Expected size:<26> but was:<28> in: <["nobody", "test", "0", "application_0", "12345", "/bin/nmPrivateCTokensPath", "/workspace/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/tmp/nm-local-dir", "src/test/resources", "/usr/lib/jvm/jdk-17.0.9/bin/java", "-classpath", "/workspace/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-classes:/workspace/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/classes:/home/mwapp/.m2/repository/org/apache/hadoop/hadoop-common/3.3.6-13/hadoop-common-3.3.6-13.jar:/home/mwapp/.m2/repository/org/apache/hadoop/thirdparty/hadoop-shaded-protobuf_3_7/1.1.1/hadoop-shaded-protobuf_3_7-1.1.1.jar:/home/mwapp/.m2/repository/com/google/guava/guava/32.0.1-jre/guava-32.0.1-jre.jar:/home/mwapp/.m2/repository/com/google/guava/failureaccess/1.0.1/failureaccess-1.0.1.jar:/home/mwapp/.m2/repository/com/google/guava/listenablefuture/.0-empty-to-avoid-conflict-with-guava/listenablefuture-.0-empty-to-avoid-conflict-with-guava.jar:/home/mwapp/.m2/repository/org/checkerframework/checker-qual/3.33.0/checker-qual-3.33.0.jar:/home/mwapp/.m2/repository/com/google/j2objc/j2objc-annotations/2.8/j2objc-annotations-2.8.jar:/home/mwapp/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar:/home/mwapp/.m2/repository/org/apache/commons/commons-math3/3.1.1/commons-math3-3.1.1.jar:/home/mwapp/.m2/repository/org/apache/httpcomponents/httpclient/4.5.13/httpclient-4.5.13.jar:/home/mwapp/.m2/repository/org/apache/httpcomponents/httpcore/4.4.13/httpcore-4.4.13.jar:/home/mwapp/.m2/repository/commons-io/commons-io/2.8.0/commons-io-2.8.0.jar:/home/mwapp/.m2/repository/commons-net/commons-net/3.9.0/commons-net-3.9.0.jar:/home/mwapp/.m2/repository/commons-collections/commons-collections/3.2.2/commons-collections-3.2.2.jar:/home/mwapp/.m2/repository/jakarta/activation/jakarta.activation-api/1.2.1/jakarta.activation-api-1.2.1.jar:/home/mwapp/.m2/repository/org/eclipse/jetty/jetty-server/9.4.53.v20231009/jetty-server-9.4.53.v20231009.jar:/home/mwapp/.m2/repository/org/eclipse/jetty/jetty-http/9.4.53.v20231009/jetty-http-9.4.53.v20231009.jar:/home/mwapp/.m2/repository/org/eclipse/jetty/jetty-io/9.4.53.v20231009/jetty-io-9.4.53.v20231009.jar:/home/mwapp/.m2/repository/org/eclipse/jetty/jetty-servlet/9.4.53.v20231009/jetty-servlet-9.4.53.v20231009.jar:/home/mwapp/.m2/repository/org/eclipse/jetty/jetty-security/9.4.53.v20231009/jetty-security-9.4.53.v20231009.jar:/home/mwapp/.m2/repository/org/eclipse/jetty/jetty-util-ajax/9.4.53.v20231009/jetty-util-ajax-9.4.53.v20231009.jar:/home/mwapp/.m2/repository/org/eclipse/jetty/jetty-webapp/9.4.53.v20231009/jetty-webapp-9.4.53.v20231009.jar:/home/mwapp/.m2/repository/org/eclipse/jetty/jetty-xml/9.4.53.v20231009/jetty-xml-9.4.53.v20231009.jar:/home/mwapp/.m2/repository/javax/servlet/jsp/jsp-api/2.1/jsp-api-2.1.jar:/home/mwapp/.m2/repository/com/sun/jersey/jersey-servlet/1.19.4/jersey-servlet-1.19.4.jar:/home/mwapp/.m2/repository/com/sun/jersey/jersey-server/1.19.4/jersey-server-1.19.4.jar:/home/mwapp/.m2/repository/commons-logging/commons-logging/1.1.3/commons-logging-1.1.3.jar:/home/mwapp/.m2/repository/ch/qos/reload4j/reload4j/1.2.22/reload4j-1.2.22.jar:/home/mwapp/.m2/repository/commons-beanutils/commons-beanutils/1.9.4/commons-beanutils-1.9.4.jar:/home/mwapp/.m2/repository/org/apache/commons/commons-configuration2/2.8.0/commons-configuration2-2.8.0.jar:/home/mwapp/.m2/repository/org/apache/commons/commons-lang3/3.12.0/commons-lang3-3.12.0.jar:/home/mwapp/.m2/repository/org/apache/commons/commons-text/1.10.0/commons-text-1.10.0.jar:/home/mwapp/.m2/repository/org/apache/avro/avro/1.7.7/avro-1.7.7.jar:/home/mwapp/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.9.13/jackson-core-asl-1.9.13.jar:/home/mwapp/.m2/repository/org/codehaus/jackson/jackson-mapper-asl/1.9.13/jackson-mapper-asl-1.9.13.jar:/home/mwapp/.m2/repository/com/thoughtworks/paranamer/paranamer/2.3/paranamer-2.3.jar:/home/mwapp/.m2/repository/com/google/re2j/re2j/1.1/re2j-1.1.jar:/home/mwapp/.m2/repository/com/google/code/gson/gson/2.9.0/gson-2.9.0.jar:/home/mwapp/.m2/repository/org/apache/hadoop/hadoop-auth/3.3.6-13/hadoop-auth-3.3.6-13.jar:/home/mwapp/.m2/repository/com/nimbusds/nimbus-jose-jwt/9.8.1/nimbus-jose-jwt-9.8.1.jar:/home/mwapp/.m2/repository/com/github/stephenc/jcip/jcip-anno
[jira] [Resolved] (YARN-11362) Fix several typos in YARN codebase of misspelled resource
[ https://issues.apache.org/jira/browse/YARN-11362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-11362. --- Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Target Version/s: 3.5.0 Resolution: Fixed > Fix several typos in YARN codebase of misspelled resource > - > > Key: YARN-11362 > URL: https://issues.apache.org/jira/browse/YARN-11362 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Labels: newbie, newbie++, pull-request-available > Fix For: 3.5.0 > > > I noticed that in YARN's codebase, there are several occassions of misspelled > resource as "Resoure". -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11653) Add Totoal_Memory and Total_Vcores columns in Nodes page
[ https://issues.apache.org/jira/browse/YARN-11653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-11653. --- Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Target Version/s: 3.5.0 Resolution: Fixed > Add Totoal_Memory and Total_Vcores columns in Nodes page > > > Key: YARN-11653 > URL: https://issues.apache.org/jira/browse/YARN-11653 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > Currently, the RM nodes page includes used and available memory/vcore , but > it lacks a total column, which is not intuitive enough for users. When the > resource capacities of nodes in the cluster vary widely, we may need to sort > the nodes to facilitate the comparison of metrics among same types of nodes. > Therefore, it is necessary to add columns for total CPU/memory on the nodes > page. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11650) Refactoring variable names related multiNodePolicy in MultiNodePolicySpec, FiCaSchedulerApp and AbstractCSQueue
[ https://issues.apache.org/jira/browse/YARN-11650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-11650. --- Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Target Version/s: 3.5.0 Resolution: Fixed > Refactoring variable names related multiNodePolicy in MultiNodePolicySpec, > FiCaSchedulerApp and AbstractCSQueue > --- > > Key: YARN-11650 > URL: https://issues.apache.org/jira/browse/YARN-11650 > Project: Hadoop YARN > Issue Type: Improvement > Components: RM >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > In classes related to MultiNodePolicy support, some variable names do not > accurately reflect their true meanings. For instance, a variable named > *queue* is actually representing the class name of a policy, and a variable > named *policyName* denotes the class name of the policy. This may cause > confusion for the readers of the code. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11138) TestRouterWebServicesREST Junit Test Error Fix
[ https://issues.apache.org/jira/browse/YARN-11138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-11138. --- Hadoop Flags: (was: Reviewed) Resolution: Duplicate > TestRouterWebServicesREST Junit Test Error Fix > -- > > Key: YARN-11138 > URL: https://issues.apache.org/jira/browse/YARN-11138 > Project: Hadoop YARN > Issue Type: Sub-task > Components: federation, test >Affects Versions: 3.4.0 >Reporter: Shilun Fan >Assignee: Shilun Fan >Priority: Major > > [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: > 28.818 s <<< FAILURE! - in > org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST > [ERROR] org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST > Time elapsed: 28.817 s <<< FAILURE! > java.lang.AssertionError: Web app not running > at org.junit.Assert.fail(Assert.java:89) > at > org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST.waitWebAppRunning(TestRouterWebServicesREST.java:199) > at > org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST.setUp(TestRouterWebServicesREST.java:217) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.RunBefores.invokeMethod(RunBefores.java:33) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) > at org.junit.runners.ParentRunner.run(ParentRunner.java:413) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10889) [Umbrella] Queue Creation in Capacity Scheduler - Tech debts
[ https://issues.apache.org/jira/browse/YARN-10889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-10889. -- Fix Version/s: 3.4.0 Target Version/s: 3.4.0 Resolution: Fixed > [Umbrella] Queue Creation in Capacity Scheduler - Tech debts > > > Key: YARN-10889 > URL: https://issues.apache.org/jira/browse/YARN-10889 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Benjamin Teke >Priority: Major > Fix For: 3.4.0 > > > Follow-up of YARN-10496 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11041) Replace all occurences of queuePath with the new QueuePath class - followup
[ https://issues.apache.org/jira/browse/YARN-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11041. -- Resolution: Fixed > Replace all occurences of queuePath with the new QueuePath class - followup > --- > > Key: YARN-11041 > URL: https://issues.apache.org/jira/browse/YARN-11041 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Tibor Kovács >Assignee: Peter Szucs >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > The QueuePath class was introduced in YARN-10897, however, its current > adoption happened only for code changes after this JIRA. We need to adopt it > retrospectively. > > A lot of changes are introduced via ticket YARN-10982. The replacing should > be continued by touching the next comments: > > [...g/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AutoCreatedQueueTemplate.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-fde6885144b59bb06b2c3358780388d958829b13f68aceee7bb6d394bb5e0548] > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765012937] > I think this could be also refactored in a follow-up jira so the string magic > could probably be replaced with some more elegant solution. Though, I think > this would be too much in this patch, hence I do suggest the follow-up jira.| > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765013096] > [~bteke] [ |https://github.com/9uapaw] [~gandras] [ > \|https://github.com/9uapaw] Thoughts?| > |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765110750] > +1, even the QueuePath object could have some kind of support for this.| > |[~gandras] [https://github.com/apache/hadoop/pull/3660#discussion_r765131244] > Agreed, let's handle it in a followup!| > > > > [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a] > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765023717] > There are many string operations in this class: > E.g. * getQueuePrefix that works with the full queue path > * getNodeLabelPrefix that also works with the full queue path| > I suggest to create a static class, called "QueuePrefixes" or something like > that and add some static methods there to convert the QueuePath object to > those various queue prefix strings that are ultimately keys in the > Configuration object. > > > > [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a] > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765026119] > This seems hacky, just based on the constructor parameter names of QueuePath: > parent, leaf. > The AQC Template prefix is not the leaf, obviously. > Could we somehow circumvent this?| > |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765126207] > Maybe a factory method could be created, which returns a new QueuePath with > the parent set as the original queuePath. I.e > rootQueuePath.createChild(String childName) -> this could return a new > QueuePath object with root.childName path, and rootQueuePath as parent.| > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765039033] > Looking at this getQueues method, I realized almost all the callers are using > some kind of string magic that should be addressed with this patch. > For example, take a look at: > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider#addQueue > I think getQueues should also receive the QueuePath object instead of > Strings.| > > > > [.../src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java|https://github.com/apache/hadoop/pull/3660/files/0c3dd17c936260fc9c386dcabc6368b54b27aa82..39f4ec203377244f840e4593aa02386ff51cc3c4#diff-0adf8192c51cbe4671324f06f7f8cbd48898df0376bbcc516451a3bdb2b48d3b] > |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765912967] > Nit: Gets the queue path object. > The object of the queue suggests a CSQueue object.| > |
[jira] [Created] (YARN-11653) Add Totoal_Memory and Total_Vcores columns in Nodes page
Jiandan Yang created YARN-11653: Summary: Add Totoal_Memory and Total_Vcores columns in Nodes page Key: YARN-11653 URL: https://issues.apache.org/jira/browse/YARN-11653 Project: Hadoop YARN Issue Type: New Feature Reporter: Jiandan Yang Assignee: Jiandan Yang Currently, the RM nodes page includes used and available memory/vcore , but it lacks a total column, which is not intuitive enough for users. When the resource capacities of nodes in the cluster vary widely, we may need to sort the nodes to facilitate the comparison of metrics among same types of nodes. Therefore, it is necessary to add columns for total CPU/memory on the nodes page. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10888) [Umbrella] New capacity modes for CS
[ https://issues.apache.org/jira/browse/YARN-10888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-10888. -- Resolution: Fixed > [Umbrella] New capacity modes for CS > > > Key: YARN-10888 > URL: https://issues.apache.org/jira/browse/YARN-10888 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Benjamin Teke >Priority: Major > Fix For: 3.4.0 > > Attachments: capacity_scheduler_queue_capacity.pdf > > > *Investigate how resource allocation configuration could be more consistent > in CapacityScheduler* > It would be nice if everywhere where a capacity can be defined could be > defined the same way: > * With fixed amounts (e.g. 1 GB memory, 8 vcores, 3 GPU) > * With percentages > ** Percentage of all resources (eg 10% of all memory, vcore, GPU) > ** Percentage per resource type (eg 10% memory, 25% vcore, 50% GPU) > * Allow mixing different modes under one hierarchy but not under the same > parent queues. > We need to determine all configuration options where capacities can be > defined, and see if it is possible to extend the configuration, or if it > makes sense in that case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11652) [Umbrella] Follow-up after YARN-10888/YARN-10889
Benjamin Teke created YARN-11652: Summary: [Umbrella] Follow-up after YARN-10888/YARN-10889 Key: YARN-11652 URL: https://issues.apache.org/jira/browse/YARN-11652 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 3.5.0 Reporter: Benjamin Teke Assignee: Benjamin Teke Follow-up improvements after the changes in YARN-10888/YARN-10889. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11651) Fix UT TestQueueCapacityConfigParser compile error
[ https://issues.apache.org/jira/browse/YARN-11651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang resolved YARN-11651. -- Resolution: Invalid > Fix UT TestQueueCapacityConfigParser compile error > -- > > Key: YARN-11651 > URL: https://issues.apache.org/jira/browse/YARN-11651 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Labels: pull-request-available > > The following error is reported during compilation: > {code:java} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-compiler-plugin:3.10.1:testCompile > (default-testCompile) on project hadoop-yarn-server-resourcemanager: > Compilation failure > [ERROR] > /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6490/ubuntu-focal/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/TestQueueCapacityConfigParser.java:[224,80] > incompatible types: java.lang.String cannot be converted to > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.QueuePath > [ERROR] -> [Help 1] > {code} > This caused by YARN-11041 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11651) Fix UT TestQueueCapacityConfigParser
Jiandan Yang created YARN-11651: Summary: Fix UT TestQueueCapacityConfigParser Key: YARN-11651 URL: https://issues.apache.org/jira/browse/YARN-11651 Project: Hadoop YARN Issue Type: Bug Reporter: Jiandan Yang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11041) Replace all occurences of queuePath with the new QueuePath class - followup
[ https://issues.apache.org/jira/browse/YARN-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11041. -- Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Resolution: Fixed > Replace all occurences of queuePath with the new QueuePath class - followup > --- > > Key: YARN-11041 > URL: https://issues.apache.org/jira/browse/YARN-11041 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Tibor Kovács >Assignee: Peter Szucs >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > The QueuePath class was introduced in YARN-10897, however, its current > adoption happened only for code changes after this JIRA. We need to adopt it > retrospectively. > > A lot of changes are introduced via ticket YARN-10982. The replacing should > be continued by touching the next comments: > > [...g/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AutoCreatedQueueTemplate.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-fde6885144b59bb06b2c3358780388d958829b13f68aceee7bb6d394bb5e0548] > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765012937] > I think this could be also refactored in a follow-up jira so the string magic > could probably be replaced with some more elegant solution. Though, I think > this would be too much in this patch, hence I do suggest the follow-up jira.| > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765013096] > [~bteke] [ |https://github.com/9uapaw] [~gandras] [ > \|https://github.com/9uapaw] Thoughts?| > |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765110750] > +1, even the QueuePath object could have some kind of support for this.| > |[~gandras] [https://github.com/apache/hadoop/pull/3660#discussion_r765131244] > Agreed, let's handle it in a followup!| > > > > [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a] > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765023717] > There are many string operations in this class: > E.g. * getQueuePrefix that works with the full queue path > * getNodeLabelPrefix that also works with the full queue path| > I suggest to create a static class, called "QueuePrefixes" or something like > that and add some static methods there to convert the QueuePath object to > those various queue prefix strings that are ultimately keys in the > Configuration object. > > > > [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a] > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765026119] > This seems hacky, just based on the constructor parameter names of QueuePath: > parent, leaf. > The AQC Template prefix is not the leaf, obviously. > Could we somehow circumvent this?| > |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765126207] > Maybe a factory method could be created, which returns a new QueuePath with > the parent set as the original queuePath. I.e > rootQueuePath.createChild(String childName) -> this could return a new > QueuePath object with root.childName path, and rootQueuePath as parent.| > |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765039033] > Looking at this getQueues method, I realized almost all the callers are using > some kind of string magic that should be addressed with this patch. > For example, take a look at: > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider#addQueue > I think getQueues should also receive the QueuePath object instead of > Strings.| > > > > [.../src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java|https://github.com/apache/hadoop/pull/3660/files/0c3dd17c936260fc9c386dcabc6368b54b27aa82..39f4ec203377244f840e4593aa02386ff51cc3c4#diff-0adf8192c51cbe4671324f06f7f8cbd48898df0376bbcc516451a3bdb2b48d3b] > |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765912967] > Nit: Gets the queue path object. > The object of
[jira] [Resolved] (YARN-11645) Fix flaky json assert tests in TestRMWebServices
[ https://issues.apache.org/jira/browse/YARN-11645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke resolved YARN-11645. -- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > Fix flaky json assert tests in TestRMWebServices > > > Key: YARN-11645 > URL: https://issues.apache.org/jira/browse/YARN-11645 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.5.0 >Reporter: Tamas Domok >Assignee: Tamas Domok >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > TestRMWebServicesCapacitySchedDynamicConfig and > TestRMWebServicesCapacitySchedulerMixedMode are flaky due to changes in the > queue order. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11650) refactor policyName to policyClassName
Jiandan Yang created YARN-11650: Summary: refactor policyName to policyClassName Key: YARN-11650 URL: https://issues.apache.org/jira/browse/YARN-11650 Project: Hadoop YARN Issue Type: Improvement Components: RM Reporter: Jiandan Yang In classes related to MultiNodePolicy support, some variable names do not accurately reflect their true meanings. For instance, a variable named *queue* is actually representing the class name of a policy, and a variable named *policyName* denotes the class name of the policy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11649) YARN Federation getNewApplication returns different maxresourcecapability
Jeffrey Chang created YARN-11649: Summary: YARN Federation getNewApplication returns different maxresourcecapability Key: YARN-11649 URL: https://issues.apache.org/jira/browse/YARN-11649 Project: Hadoop YARN Issue Type: Bug Reporter: Jeffrey Chang Assignee: Jeffrey Chang When getNewApplication is called against YARN Router with Federation on, its possible we get different maxResourceCapabilities on different calls. This is because getNewApplication is called against a random cluster on each call, which may return different maxResourceCapability based on the cluster that the call is executed on. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11607) TestTimelineAuthFilterForV2 fails intermittently
[ https://issues.apache.org/jira/browse/YARN-11607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-11607. --- Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Target Version/s: 3.5.0 Resolution: Fixed > TestTimelineAuthFilterForV2 fails intermittently > - > > Key: YARN-11607 > URL: https://issues.apache.org/jira/browse/YARN-11607 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Ayush Saxena >Assignee: Susheel Gupta >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > Ref: > https://ci-hadoop.apache.org/view/Hadoop/job/hadoop-qbt-trunk-java8-linux-x86_64/1398/testReport/junit/org.apache.hadoop.yarn.server.timelineservice.security/TestTimelineAuthFilterForV2/testPutTimelineEntities_boolean__boolean__3_/ > {noformat} > org.opentest4j.AssertionFailedError: expected: <2> but was: <1> > at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55) > at > org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62) > at > org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150) > at > org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145) > at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:527) > at > org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2.publishAndVerifyEntity(TestTimelineAuthFilterForV2.java:324) > at > org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2.publishWithRetries(TestTimelineAuthFilterForV2.java:337) > at > org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2.testPutTimelineEntities(TestTimelineAuthFilterForV2.java:383) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11648) CapacityScheduler does not activate applications when resources are released from another Leaf Queue
Brian Goerlitz created YARN-11648: - Summary: CapacityScheduler does not activate applications when resources are released from another Leaf Queue Key: YARN-11648 URL: https://issues.apache.org/jira/browse/YARN-11648 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Reporter: Brian Goerlitz Create a queue with low minimum capacity and high maximum capacity. If multiple apps are submitted to the queue such that the Queue's Max AM resource limit is exceeded while other cluster resources are consumed by different queues, these apps will not be considered for activation when cluster resources from the other queues are freed. As the AM limit is calculated based on available resources for the queue, these apps should be activated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11647) more places to use StandardCharsets
[ https://issues.apache.org/jira/browse/YARN-11647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-11647. --- Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Target Version/s: 3.5.0 Resolution: Fixed > more places to use StandardCharsets > --- > > Key: YARN-11647 > URL: https://issues.apache.org/jira/browse/YARN-11647 > Project: Hadoop YARN > Issue Type: Task > Components: yarn >Reporter: PJ Fanning >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > A few instances missed in HADOOP-18957 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11638) [GPG] GPG Support CLI.
[ https://issues.apache.org/jira/browse/YARN-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-11638. --- Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Resolution: Fixed > [GPG] GPG Support CLI. > -- > > Key: YARN-11638 > URL: https://issues.apache.org/jira/browse/YARN-11638 > Project: Hadoop YARN > Issue Type: Sub-task > Components: federation >Affects Versions: 3.4.0 >Reporter: Shilun Fan >Assignee: Shilun Fan >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > We will add a set of command lines to GPG so that GPG can better refresh the > policy and provide some other convenient functions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11646) QueueCapacityConfigParser shouldn't ignore capacity config with 0 memory
Tamas Domok created YARN-11646: -- Summary: QueueCapacityConfigParser shouldn't ignore capacity config with 0 memory Key: YARN-11646 URL: https://issues.apache.org/jira/browse/YARN-11646 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 3.4.0 Reporter: Tamas Domok Assignee: Tamas Domok There is no reason to ignore the configured capacity if the memory is 0 in the configuration. It makes it impossible to configure a zero absolute resource capacity. Example: {noformat} root.default.capacity=[memory=0, vcores=0] root.default.maximum-capacity=[memory=2048, vcores=2] {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11645) Fix flaky json assert tests in TestRMWebServices
Tamas Domok created YARN-11645: -- Summary: Fix flaky json assert tests in TestRMWebServices Key: YARN-11645 URL: https://issues.apache.org/jira/browse/YARN-11645 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 3.4.0 Reporter: Tamas Domok Assignee: Tamas Domok TestRMWebServicesCapacitySchedDynamicConfig and TestRMWebServicesCapacitySchedulerMixedMode are flaky due to changes in the queue order. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11644) LogAggregationService can't upload log in time when application finished
Xie YiFan created YARN-11644: Summary: LogAggregationService can't upload log in time when application finished Key: YARN-11644 URL: https://issues.apache.org/jira/browse/YARN-11644 Project: Hadoop YARN Issue Type: Improvement Components: log-aggregation Reporter: Xie YiFan Assignee: Xie YiFan Attachments: image-2024-01-10-11-03-57-553.png LogAggregationService is responsible for uploading log to HDFS. It applies thread pool to execute upload task. The workflow of upload log as follow: # NM construct Applicaiton object when first container of a certain application launch, then notify LogAggregationService to init AppLogAggregationImpl. # LogAggregationService submit AppLogAggregationImpl to task queue. # The idle worker of thread pool pulls AppLogAggregationImpl from task queue. # AppLogAggregationImpl do while loop to check the application state, do upload when application finished. Suppose the following scenario: * LogAggregationService initialize thread pool with 4 threads. * 4 long running applications start on this NM, so all threads are occupied by aggregator. * The next short application starts on this NM and quickly finish, but no idle thread for this app to upload log. as a result, the following applications have to wait the previous applications finish before uploading their logs. !image-2024-01-10-11-03-57-553.png|width=599,height=195! h4. Solution Change the spin behavior of AppLogAggregationImpl. If application has not finished, just return to yield current thread and resubmit itself to executor service. So the LogAggregationService can roll the task queue and the logs of finished application can be uploaded immediately. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11643) Skip unnecessary pre-check in Multi Node Placement
Xie YiFan created YARN-11643: Summary: Skip unnecessary pre-check in Multi Node Placement Key: YARN-11643 URL: https://issues.apache.org/jira/browse/YARN-11643 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Xie YiFan Assignee: Xie YiFan When Multi Node Placement enabled, RegularContainerAllocator do a while loop to find one node from candidate set to allocate for a given scheduler key. Before do allocate, pre-check be called to check if current node satisfies check. If this node does not pass all checks, just continue to next node. {code:java} if (reservedContainer == null) { result = preCheckForNodeCandidateSet(node, schedulingMode, resourceLimits, schedulerKey); if (null != result) { continue; } } {code} But some checks are related to scheduler Key or Application which return PRIORITY_SKIPPED or APP_SKIPPED. It means that if first node does not pass check, the following nodes also do not pass. If cluster have 5000 nodes in default partition, Scheduler will waste 5000 times loop for just one scheduler key. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11553) Change the time unit of scCleanerIntervalMs in Router
[ https://issues.apache.org/jira/browse/YARN-11553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-11553. --- Hadoop Flags: Reviewed Target Version/s: 3.4.0 Resolution: Fixed > Change the time unit of scCleanerIntervalMs in Router > - > > Key: YARN-11553 > URL: https://issues.apache.org/jira/browse/YARN-11553 > Project: Hadoop YARN > Issue Type: Improvement > Components: router >Reporter: WangYuanben >Assignee: WangYuanben >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: image-2023-08-19-16-13-41-956.png > > > The time unit of scCleanerIntervalMs is written as TimeUnit.MINUTES, > resulting in overly long cleaning intervals. > !image-2023-08-19-16-13-41-956.png|width=561,height=82! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11556) Let Federation.md more standardized
[ https://issues.apache.org/jira/browse/YARN-11556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-11556. --- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Target Version/s: 3.4.0 Resolution: Fixed > Let Federation.md more standardized > --- > > Key: YARN-11556 > URL: https://issues.apache.org/jira/browse/YARN-11556 > Project: Hadoop YARN > Issue Type: Improvement > Components: documentation >Reporter: WangYuanben >Assignee: WangYuanben >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11631) [GPG] Add GPGWebServices
[ https://issues.apache.org/jira/browse/YARN-11631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-11631. --- Fix Version/s: 3.4.0 Resolution: Fixed > [GPG] Add GPGWebServices > > > Key: YARN-11631 > URL: https://issues.apache.org/jira/browse/YARN-11631 > Project: Hadoop YARN > Issue Type: Sub-task > Components: federation >Affects Versions: 3.4.0 >Reporter: Shilun Fan >Assignee: Shilun Fan >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11642) Fix Flaky Test TestTimelineAuthFilterForV2#testPutTimelineEntities
[ https://issues.apache.org/jira/browse/YARN-11642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-11642. --- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Target Version/s: 3.4.0 Resolution: Fixed > Fix Flaky Test TestTimelineAuthFilterForV2#testPutTimelineEntities > -- > > Key: YARN-11642 > URL: https://issues.apache.org/jira/browse/YARN-11642 > Project: Hadoop YARN > Issue Type: Improvement > Components: timelineservice >Affects Versions: 3.5.0 >Reporter: Shilun Fan >Assignee: Shilun Fan >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > Our current unit tests are all executed in parallel. > TestTimelineAuthFilterForV2#testPutTimelineEntities will report an error > during execution: > {code:java} > [main] collector.PerNodeTimelineCollectorsAuxService > (StringUtils.java:startupShutdownMessage(755)) - failed to register any UNIX > signal loggers: > java.lang.IllegalStateException: Can't re-install the signal handlers. > {code} > We can solve this problem by changing static initialization to new Object. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11642) Fix Flask Test TestTimelineAuthFilterForV2#testPutTimelineEntities
Shilun Fan created YARN-11642: - Summary: Fix Flask Test TestTimelineAuthFilterForV2#testPutTimelineEntities Key: YARN-11642 URL: https://issues.apache.org/jira/browse/YARN-11642 Project: Hadoop YARN Issue Type: Improvement Components: timelineservice Affects Versions: 3.5.0 Reporter: Shilun Fan Assignee: Shilun Fan Our current unit tests are all executed in parallel. TestTimelineAuthFilterForV2#testPutTimelineEntities will report an error during execution: {code:java} [main] collector.PerNodeTimelineCollectorsAuxService (StringUtils.java:startupShutdownMessage(755)) - failed to register any UNIX signal loggers: java.lang.IllegalStateException: Can't re-install the signal handlers. {code} We can solve this problem by changing static initialization to new Object. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11641) Can't update a queue hierarchy in absolute mode when the configured capacities are zero
Tamas Domok created YARN-11641: -- Summary: Can't update a queue hierarchy in absolute mode when the configured capacities are zero Key: YARN-11641 URL: https://issues.apache.org/jira/browse/YARN-11641 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 3.4.0 Reporter: Tamas Domok Assignee: Tamas Domok h2. Error symptoms It is not possible to modify a queue hierarchy in absolute mode when the parent or every child queue of the parent has 0 min resource configured. {noformat} 2024-01-05 15:38:59,016 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager: Initialized queue: root.a.c 2024-01-05 15:38:59,016 ERROR org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: Exception thrown when modifying configuration. java.io.IOException: Failed to re-init queues : Parent=root.a: When absolute minResource is used, we must make sure both parent and child all use absolute minResource {noformat} h2. Reproduction capacity-scheduler.xml {code:xml} yarn.scheduler.capacity.root.queues default,a yarn.scheduler.capacity.root.capacity [memory=40960, vcores=16] yarn.scheduler.capacity.root.default.capacity [memory=1024, vcores=1] yarn.scheduler.capacity.root.default.maximum-capacity [memory=1024, vcores=1] yarn.scheduler.capacity.root.a.capacity [memory=0, vcores=0] yarn.scheduler.capacity.root.a.maximum-capacity [memory=39936, vcores=15] yarn.scheduler.capacity.root.a.queues b,c yarn.scheduler.capacity.root.a.b.capacity [memory=0, vcores=0] yarn.scheduler.capacity.root.a.b.maximum-capacity [memory=39936, vcores=15] yarn.scheduler.capacity.root.a.c.capacity [memory=0, vcores=0] yarn.scheduler.capacity.root.a.c.maximum-capacity [memory=39936, vcores=15] {code} {code:xml} root.a capacity [memory=1024,vcores=1] maximum-capacity [memory=39936,vcores=15] {code} {code} $ curl -X PUT -H 'Content-Type: application/xml' -d @updatequeue.xml http://localhost:8088/ws/v1/cluster/scheduler-conf\?user.name\=yarn Failed to re-init queues : Parent=root.a: When absolute minResource is used, we must make sure both parent and child all use absolute minResource {code} h2. Root cause setChildQueues is called during reinit, where: {code:java} void setChildQueues(Collection childQueues) throws IOException { writeLock.lock(); try { boolean isLegacyQueueMode = queueContext.getConfiguration().isLegacyQueueMode(); if (isLegacyQueueMode) { QueueCapacityType childrenCapacityType = getCapacityConfigurationTypeForQueues(childQueues); QueueCapacityType parentCapacityType = getCapacityConfigurationTypeForQueues(ImmutableList.of(this)); if (childrenCapacityType == QueueCapacityType.ABSOLUTE_RESOURCE || parentCapacityType == QueueCapacityType.ABSOLUTE_RESOURCE) { // We don't allow any mixed absolute + {weight, percentage} between // children and parent if (childrenCapacityType != parentCapacityType && !this.getQueuePath() .equals(CapacitySchedulerConfiguration.ROOT)) { throw new IOException("Parent=" + this.getQueuePath() + ": When absolute minResource is used, we must make sure both " + "parent and child all use absolute minResource"); } {code} The parent or childrenCapacityType will be considered as PERCENTAGE, because getCapacityConfigurationTypeForQueues fails to detect the absolute mode, here: {code:java} if (!queue.getQueueResourceQuotas().getConfiguredMinResource(nodeLabel) .equals(Resources.none())) { absoluteMinResSet = true; {code} h2. Possible fixes Possible fix in AbstractParentQueue.getCapacityConfigurationTypeForQueues using the capacityVector: {code:java} for (CSQueue queue : queues) { for (String nodeLabel : queueCapacities.getExistingNodeLabels()) { Set definedCapacityTypes = queue.getConfiguredCapacityVector(nodeLabel).getDefinedCapacityTypes(); if (definedCapacityTypes.size() == 1) { QueueCapacityVector.ResourceUnitCapacityType next = definedCapacityTypes.iterator().next(); if (Objects.requireNonNull(next) == PERCENTAGE) { percentageIsSet = true; diagMsg.append("{Queue=").append(queue.getQueuePath()).append(", label=").append(nodeLabel) .append(" uses percentage mode}. "); } else if (next == QueueCapacityVector.ResourceUnitCapacityType.ABSOLUTE) { absoluteMinResSet = true;
[jira] [Created] (YARN-11640) capacity scheduler supports application priority with FairOrderingPolicy
Ming Chen created YARN-11640: Summary: capacity scheduler supports application priority with FairOrderingPolicy Key: YARN-11640 URL: https://issues.apache.org/jira/browse/YARN-11640 Project: Hadoop YARN Issue Type: Improvement Components: yarn Reporter: Ming Chen -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-2098) App priority support in Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-2098. -- Target Version/s: (was: 3.5.0) Resolution: Done > App priority support in Fair Scheduler > -- > > Key: YARN-2098 > URL: https://issues.apache.org/jira/browse/YARN-2098 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Reporter: Ashwin Shankar >Priority: Major > Labels: pull-request-available > Attachments: YARN-2098.patch, YARN-2098.patch > > > This jira is created for supporting app priorities in fair scheduler. > AppSchedulable hard codes priority of apps to 1, we should change this to get > priority from ApplicationSubmissionContext. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-11529) Add metrics for ContainerMonitorImpl.
[ https://issues.apache.org/jira/browse/YARN-11529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-11529. --- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > Add metrics for ContainerMonitorImpl. > - > > Key: YARN-11529 > URL: https://issues.apache.org/jira/browse/YARN-11529 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 3.4.0 >Reporter: Xianming Lei >Assignee: Xianming Lei >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > > In our production environment, we have ample machine resources and a > significant number of active Containers. However, the MonitoringThread in > ContainerMonitorImpl experiences significant latency during each execution. > To address this, it is highly recommended to incorporate metrics for > monitoring the duration of this time-consuming process. > > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy
Ferenc Erdelyi created YARN-11639: - Summary: ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy Key: YARN-11639 URL: https://issues.apache.org/jira/browse/YARN-11639 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Reporter: Ferenc Erdelyi When dynamic queue creation is enabled in weight mode and the deletion policy coincides with the PriorityQueueResourcesForSorting, RM stops assigning resources because of either ConcurrentModificationExceptionor NPE in PriorityUtilizationQueueOrderingPolicy. Reproduced the NPE issue in Java8 and Java11 environment: {code:java} ... INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Removing queue: root.dyn.PmvkMgrEBQppu 2024-01-02 17:00:59,399 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-11,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225) at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605) {code} Observed the ConcurrentModificationException in Java8 environment, but could not reproduce yet: {code:java} 2023-10-27 02:50:37,584 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread Thread[Thread-15,5, main] threw an Exception. java.util.ConcurrentModificationException at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtiliza ueOrderingPolicy.Java:260) {code} The immediate (temporary) remedy to keep the cluster going is to restart the RM. The workaround is to disable the deletion of dynamically created child queues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr
[jira] [Resolved] (YARN-11632) [Doc] Add allow-partial-result description to Yarn Federation documentation
[ https://issues.apache.org/jira/browse/YARN-11632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved YARN-11632. --- Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > [Doc] Add allow-partial-result description to Yarn Federation documentation > --- > > Key: YARN-11632 > URL: https://issues.apache.org/jira/browse/YARN-11632 > Project: Hadoop YARN > Issue Type: Sub-task > Components: federation >Affects Versions: 3.4.0 >Reporter: Shilun Fan >Assignee: Shilun Fan >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > Add allow-partial-result description to Yarn Federation documentation -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org