[jira] [Created] (YARN-11704) Avoid nested 'AND' placement constraint for non tags in scheduling request

2024-07-11 Thread Junfan Zhang (Jira)
Junfan Zhang created YARN-11704:
---

 Summary: Avoid nested 'AND' placement constraint for non tags in 
scheduling request
 Key: YARN-11704
 URL: https://issues.apache.org/jira/browse/YARN-11704
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler
Reporter: Junfan Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11703) Validate accessibility of Node Manager working directories

2024-06-26 Thread Bence Kosztolnik (Jira)
Bence Kosztolnik created YARN-11703:
---

 Summary: Validate accessibility of Node Manager working directories
 Key: YARN-11703
 URL: https://issues.apache.org/jira/browse/YARN-11703
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.5.0
Reporter: Bence Kosztolnik
Assignee: Bence Kosztolnik


h3. Problem:

If some subdirectory or file changes permission under 
*yarn.nodemanager.local-dirs* or {*}yarn.nodemanager.log-dirs{*}, and won't be 
accessible by the node manager, then the node manager will not reach an 
unhealthy state, but container runs would fail.
h3. Testing:
 - run an example PI job in a cluster
 - change the user cache directory of the user to not readable by the node 
manager. For example:
{noformat}
chmod 222 ./usercache/{user}
{noformat}

 - cluster state will stay healthy
 - re-run the PI job
 - containers will fail on the affected node, with

{noformat}
... Not able to initialize app-cache directories in any of the configured local 
directories for user ...{noformat}

h3. Solution:

Add an extra validation to the DirectoryCollection#testdirs to ensure the 
content of the local-dirs and log-dirs are accessible by the node manager, and 
turn the node unhealthy if not.
New flag will be introduced to enable this validation: 
*yarn.nodemanager.working-dir-content-accessibility-validation.enabled* 
(default true)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11702) Fix Yarn over allocating containers

2024-06-25 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created YARN-11702:


 Summary: Fix Yarn over allocating containers
 Key: YARN-11702
 URL: https://issues.apache.org/jira/browse/YARN-11702
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman


*Replication Steps:*

Apache Spark 3.5.1 and Apache Hadoop 3.3.6 (Capacity Scheduler)

 
{code:java}

spark.executor.memory            1024M
spark.driver.memory              2048M
spark.executor.cores             1
spark.executor.instances 20
spark.dynamicAllocation.enabled false{code}
 

Based on the setup, there should be 20 spark executors, but from the 
ResourceManager (RM) UI, i could see that 32 executors were allocated and 12 of 
them were released in seconds. On analyzing the Spark ApplicationMaster (AM) 
logs, The following logs were observed.

 
{code:java}
4/06/24 14:10:14 INFO YarnAllocator: Will request 20 executor container(s) for  
ResourceProfile Id: 0, each with 1 core(s) and 1408 MB memory. with custom 
resources: 
24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
launching executors on 8 of them.
24/06/24 14:10:14 INFO YarnAllocator: Received 8 containers from YARN, 
launching executors on 8 of them.
24/06/24 14:10:14 INFO YarnAllocator: Received 12 containers from YARN, 
launching executors on 4 of them.
24/06/24 14:10:17 INFO YarnAllocator: Received 4 containers from YARN, 
launching executors on 0 of them.

{code}
It was clear for the logs that extra allocated 12 containers are being ignored 
from Spark side. Inorder to debug this further, additional log lines were added 
to 
[AppSchedulingInfo|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java#L427]
 class in increment and decrement of container request to expose additional 
information about the request.

 
{code:java}
2024-06-24 14:10:14,075 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo (IPC 
Server handler 42 on default port 8030): Updates PendingContainers: 0 
Incremented by: 20 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_01
2024-06-24 14:10:14,077 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
20 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_01
2024-06-24 14:10:14,077 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
19 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_01
2024-06-24 14:10:14,111 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
18 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_01
2024-06-24 14:10:14,112 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
17 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_01
2024-06-24 14:10:14,112 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
16 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_01
2024-06-24 14:10:14,113 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
15 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_01
2024-06-24 14:10:14,113 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
14 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0, 
containerToUpdate=null} for: appattempt_1719234929152_0004_01
2024-06-24 14:10:14,113 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo 
(SchedulerEventDispatcher:Event Processor): Allocate Updates PendingContainers: 
13 Decremented by: 1 SchedulerRequestKey{priority=0, allocationRequestId=0

[jira] [Created] (YARN-11700) some applications on the query page display queues, usernames, and applications that appear as null

2024-06-14 Thread zhangzhanchang (Jira)
zhangzhanchang created YARN-11700:
-

 Summary: some applications on the query page display queues, 
usernames, and applications that appear as null
 Key: YARN-11700
 URL: https://issues.apache.org/jira/browse/YARN-11700
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Reporter: zhangzhanchang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10379) Refactor ContainerExecutor exit code Exception handling

2024-06-11 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi resolved YARN-10379.
---
Resolution: Won't Fix

> Refactor ContainerExecutor exit code Exception handling
> ---
>
> Key: YARN-10379
> URL: https://issues.apache.org/jira/browse/YARN-10379
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Benjamin Teke
>Assignee: Ferenc Erdelyi
>Priority: Minor
>
> **Currently every time a shell command is executed and returns with a 
> non-zero exitcode an exception gets thrown. But along the call tree this 
> exception gets catched, after some info/warn logging and other processing 
> steps rethrown, possibly packaged to another exception. For example:
>  * from PrivilegedOperationExecutor.executePrivilegedOperation - 
> ExitCodeException catch (as IOException), PrivilegedOperationException thrown
>  * then in LinuxContainerExecutor.startLocalizer - 
> PrivilegedOperationException catch, exitCode collection, logging, IOException 
> rethrown
>  * then in ResourceLocalizationService.run - generic Exception catch, but 
> there is a TODO for separate ExitCodeException handling, however that 
> information is only present here in an error message string
> This flow could be simplified and unified in the different executors. For 
> example use one specific exception till the last possible step, catch it only 
> where it is necessary and keep the exitcode as it could be used later in the 
> process. This change could help with maintainability and readability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11699) Diagnostics lacks userlimit info when user capacity has reached its maximum limit

2024-05-31 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-11699.
---
   Fix Version/s: 3.4.1
  3.5.0
Hadoop Flags: Reviewed
Target Version/s: 3.4.1, 3.5.0
Assignee: Jiandan Yang 
  Resolution: Fixed

> Diagnostics lacks userlimit info when user capacity has reached its maximum 
> limit
> -
>
> Key: YARN-11699
> URL: https://issues.apache.org/jira/browse/YARN-11699
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: image-2024-05-29-15-47-53-217.png
>
>
> Capacity scheduler supports user limit to avoid a single user use the whole 
> queue resource. but when resource used by a user reached its user limit, it 
> lacks related info in diagnostics in web page, just as shown in the figure 
> below. We may need add user limit and resource used by the user to help debug.
> !image-2024-05-29-15-47-53-217.png|width=831,height=145!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11471) FederationStateStoreFacade Cache Support Caffeine

2024-05-31 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-11471.
---
Fix Version/s: 3.4.1
   3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> FederationStateStoreFacade Cache Support Caffeine
> -
>
> Key: YARN-11471
> URL: https://issues.apache.org/jira/browse/YARN-11471
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: federation
>Affects Versions: 3.4.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
>
> FederationStateStoreFacade Cache Support Caffeine



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11699) Diagnostics

2024-05-29 Thread Jiandan Yang (Jira)
Jiandan Yang  created YARN-11699:


 Summary: Diagnostics
 Key: YARN-11699
 URL: https://issues.apache.org/jira/browse/YARN-11699
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jiandan Yang 






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11698) Finished containers shouldn't be stored indefinitely in the NM state store

2024-05-21 Thread Adam Binford (Jira)
Adam Binford created YARN-11698:
---

 Summary: Finished containers shouldn't be stored indefinitely in 
the NM state store
 Key: YARN-11698
 URL: https://issues.apache.org/jira/browse/YARN-11698
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 3.4.0
Reporter: Adam Binford


https://issues.apache.org/jira/browse/YARN-4771 updated the container tracking 
in the state store to only remove containers when their application ends, in 
order to make sure all containers logs get aggregated even during NM restarts. 
This can lead to a significant number of containers building up in the state 
store and a lot of things to recover. Since this was purely for making sure 
logs get aggregated, it could be done smarter that takes into account both 
rolling log aggregation or not having log aggregation enabled at all.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication

2024-05-20 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created YARN-11697:


 Summary: Fix fair scheduler race condition in 
removeApplicationAttempt and moveApplication
 Key: YARN-11697
 URL: https://issues.apache.org/jira/browse/YARN-11697
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.2.1
Reporter: Syed Shameerur Rahman
Assignee: Syed Shameerur Rahman


For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with the 
following exception
{code:java}
2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher 
(SchedulerEventDispatcher:Event Processor): Error in handling event type 
APP_ATTEMPT_REMOVED to the Event Dispatcher
java.lang.IllegalStateException: Given app to remove 
appattempt_1706879498319_86660_01 Alloc:  does not 
exist in queue [root.tier2.livy, demand=, 
running=, share=, w=1.0]
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139)
at 
org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
at java.lang.Thread.run(Thread.java:750)
{code}
The exception seems similar to the one mentioned in YARN-5136, but it looks 
like there is still some edge cases not covered by YARN-5136.

1. On deeper look, i could see that as mentioned in the comment here. if a call 
for a moveApplication and removeApplicationAttempt for the same attempt are 
processed in short succession the application attempt will still contain a 
queue reference but is already removed from the list of applications for the 
queue.

2. This can happen when 
[moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908]
 removes the appAttempt from the queue and 
[removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707]
 also tries to remove the same appAttempt from the queue.

3. On further checking, i could see that before doing 
[moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779]
 writeLock on appAttempt is taken where as for 
[removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665]
 , i don't see any writelock being taken which can result in race condition if 
same appAttempt is being processed.

4. Additionally as mentioned in the comment here when such scenario occurs 
ideally we should not take down RM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11692) Support mixed cgroup v1/v2 controller structure

2024-05-15 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11692.
--
Hadoop Flags: Reviewed
Target Version/s: 3.5.0
  Resolution: Fixed

> Support mixed cgroup v1/v2 controller structure
> ---
>
> Key: YARN-11692
> URL: https://issues.apache.org/jira/browse/YARN-11692
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
>
> There were heavy changes on the device side in cgroup v2. To keep supporting 
> FGPAs and GPUs short term, mixed structures where some of the cgroup 
> controllers are from v1 while others from v2 should be supported. More info: 
> https://dropbear.xyz/2023/05/23/devices-with-cgroup-v2/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11669) [Umbrella] cgroup v2 support

2024-05-13 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11669.
--
Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> [Umbrella] cgroup v2 support
> 
>
> Key: YARN-11669
> URL: https://issues.apache.org/jira/browse/YARN-11669
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Ferenc Erdelyi
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.5.0
>
>
> cgroup v2 has some fundamental changes compared to v1. RHEL9, Ubuntu 22 
> already moved to cgroup v2 as a default, hence YARN should support it. This 
> umbrella tracks the required work.
> [Documentation|https://docs.kernel.org/admin-guide/cgroup-v2.html]
> A way to test the newly added features:
> # Turn on cgroup v1 based on the current 
> [documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html].
> # System prerequisites:
> ## the file {{/etc/mtab}} should contain a mount path with the file system 
> type {{cgroup2}}, by default this could be {{/sys/fs/cgroup}} on most OS's
> ## the {{cgroup.subtree_control}} file should contain the necessary 
> controllers (update it with: {{echo "+cpu +io +memory" > 
> cgroup.subtree_control}})
> ## either create the YARN hierarchy and give recursive access to the user 
> running the NM on the node. The hierarchy is {{hadoop-yarn}} by default 
> (controller by 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy}}), and 
> recursive mode is required, because as soon as the directory is created it 
> will be filled with the controller files which YARN will try to edit.
> ### Alternatively if the NM process user has access rights on the 
> {{/sys/fs/cgroup}} directory it'll try to create the hierarchy and update the 
> {{cgroup.subtree_control}} file.
> # YARN configuration
> ## {{yarn.nodemanager.linux-container-executor.cgroups.mount-path}} should 
> point to the directory where the cgroup2 structure is mounted and the 
> {{hadoop-yarn}} hierarchy was created
> ## {{yarn.nodemanager.linux-container-executor.cgroups.v2.enabled}} should be 
> set to {{true}}
> ## Enable a cgroup controller, like {{yarn. nodemanager. resource. 
> cpu.enabled}}: {{true}}
> # Launch the NM and monitor the cgroup files on container launches (i.e: 
> {{/sys/fs/cgroup/hadoop-yarn/container_id/cpu.weight}})



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11696) Add debug-level logs in RMAppImpl#aggregateLogReport and RMAppImpl#getLogAggregationStatusForAppReport

2024-05-13 Thread Susheel Gupta (Jira)
Susheel Gupta created YARN-11696:


 Summary: Add debug-level logs in RMAppImpl#aggregateLogReport and 
RMAppImpl#getLogAggregationStatusForAppReport
 Key: YARN-11696
 URL: https://issues.apache.org/jira/browse/YARN-11696
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Reporter: Susheel Gupta
Assignee: Susheel Gupta


The events keep increasing in event-queue and many event thread are blocked.
To discover the deadlocking threads, add a few debug level logs to 
RMAppImpl#aggregateLogReport and RMAppImpl#getLogAggregationStatusForAppReport.
{code:java}
"RM Event dispatcher" #93 prio=5 os_prio=0 tid=0x7fcb67120800 nid=0x13e62 
waiting on condition [0x7fbef632a000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x7fc44cada248> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at 
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.aggregateLogReport(RMAppImpl.java:1799)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handleLogAggregationStatus(RMNodeImpl.java:1478)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.access$500(RMNodeImpl.java:104)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl$StatusUpdateWhenHealthyTransition.transition(RMNodeImpl.java:1239)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl$StatusUpdateWhenHealthyTransition.transition(RMNodeImpl.java:1195)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
        - locked <0x7fc04c0b6970> (a 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:667)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl.handle(RMNodeImpl.java:101)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:1124)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher.handle(ResourceManager.java:1108)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:219)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:133)
        at java.lang.Thread.run(Thread.java:748) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11695) Fixed non-idempotent tests in `TestTaskRunner`

2024-05-04 Thread Kaiyao Ke (Jira)
Kaiyao Ke created YARN-11695:


 Summary: Fixed non-idempotent tests in `TestTaskRunner`
 Key: YARN-11695
 URL: https://issues.apache.org/jira/browse/YARN-11695
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Kaiyao Ke


All tests in `org.apache.hadoop.yarn.sls.scheduler.TestTaskRunner` are not 
idempotent and fails upon repeated execution within the same JVM instance due 
to self-induced state pollution. Specifically, the test runs made changes to 
the static fields (e.g. `PreStartTask.first` in the task classes without 
restoring them. Therefore, repeated runs throw assertion errors.

Sample error message of `TestTaskRunner#testPreStartQueueing` in repeated test 
run:
```
java.lang.AssertionError:
at org.junit.Assert.fail(Assert.java:87)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.junit.Assert.assertTrue(Assert.java:53)
at 
org.apache.hadoop.yarn.sls.scheduler.TestTaskRunner.testPreStartQueueing(TestTaskRunner.java:244)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
``` 
The fix is done by explicitly setting (resetting) the static variables 
(countdown latches and booleans) at the start of each test.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11694) 2 tests are non-idempotent (passes in the first run but fails in repeated runs in the same JVM)

2024-05-03 Thread Kaiyao Ke (Jira)
Kaiyao Ke created YARN-11694:


 Summary: 2 tests are non-idempotent (passes in the first run but 
fails in repeated runs in the same JVM)
 Key: YARN-11694
 URL: https://issues.apache.org/jira/browse/YARN-11694
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Kaiyao Ke


## TestTimelineReaderMetrics#testTimelineReaderMetrics

`org.apache.hadoop.yarn.server.timelineservice.reader.TestTimelineReaderMetrics#testTimelineReaderMetrics`
 does not perform a source unregistration after test execution, so the 
`TimelineReaderMetrics.getInstance()` call in repeated runs will throw an error 
since the metrics source `TimelineReaderMetrics` already exists.

Error message in the 2nd run:

```

org.apache.hadoop.metrics2.MetricsException: Metrics source 
TimelineReaderMetrics already exists!

 at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)

 at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)

 at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)

 at 
org.apache.hadoop.yarn.server.timelineservice.metrics.TimelineReaderMetrics.getInstance(TimelineReaderMetrics.java:61)

 at 
org.apache.hadoop.yarn.server.timelineservice.reader.TestTimelineReaderMetrics.setup(TestTimelineReaderMetrics.java:52)

 at java.base/java.lang.reflect.Method.invoke(Method.java:568)

 at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)

 at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)

```

 

## TestFederationStateStoreClientMetrics#testSuccessfulCalls

`org.apache.hadoop.yarn.server.federation.store.metrics.TestFederationStateStoreClientMetrics#testSuccessfulCalls`
 retrieves the historical number of successful calls, but does not retrieve the 
historical average latency of those calls. For example, it asserts  
`FederationStateStoreClientMetrics.getLatencySucceededCalls()` is 100 after the 
`goodStateStore.registerSubCluster(100);` call. However, in the second 
execution of the test, 2 historical calls from the first execution (with 
latency 100 and 200 respectively) has already been recorded, so 
`FederationStateStoreClientMetrics.getLatencySucceededCalls()` will be 133. 
(mean of 100, 200 and 100)

 

Error message in the 2nd run:

```

java.lang.AssertionError: expected:<100.0> but was:<133.34>

 at org.junit.Assert.fail(Assert.java:89)

 at org.junit.Assert.failNotEquals(Assert.java:835)

 at org.junit.Assert.assertEquals(Assert.java:555)

 at org.junit.Assert.assertEquals(Assert.java:685)

 at 
org.apache.hadoop.yarn.server.federation.store.metrics.TestFederationStateStoreClientMetrics.testSuccessfulCalls(TestFederationStateStoreClientMetrics.java:63)

 at java.base/java.lang.reflect.Method.invoke(Method.java:568)

```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11693) Refactor Container scheduler

2024-04-30 Thread Mohit Gaggar (Jira)
Mohit Gaggar created YARN-11693:
---

 Summary: Refactor Container scheduler
 Key: YARN-11693
 URL: https://issues.apache.org/jira/browse/YARN-11693
 Project: Hadoop YARN
  Issue Type: Task
  Components: scheduler, scheduler preemption
Reporter: Mohit Gaggar


Container Scheduler class, responsible for scheduling containers on nodes 
handles multiple smaller responsibilities making it hard to extend the 
functionalities.
This PR works on breaking down the class responsibilities into
 * ContainerQueueManager : handles all queuing related functions, like 
adding/removing to queue
 * ContainerStarter : maintains the running queue of containers and starts new 
containers
 * ContainerPolicyManager : handles the container termination/pausing policy 
when enough resources not available
 * ContainerScheduler : main class which works with other helper classes to 
maintain container queues

!https://msdata.visualstudio.com/25bee5cc-1a60-44a1-904d-a734363b40d4/_apis/git/repositories/719ef898-e962-4b70-a49b-03c67abb2b07/pullRequests/1249358/attachments/Refactoring%20Container%20Scheduler%20%281%29.png|width=710,height=441!

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11692) Support mixed cgroup v1/v2 controller structure

2024-04-29 Thread Benjamin Teke (Jira)
Benjamin Teke created YARN-11692:


 Summary: Support mixed cgroup v1/v2 controller structure
 Key: YARN-11692
 URL: https://issues.apache.org/jira/browse/YARN-11692
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


There were heavy changes on the device side in cgroup v2. To keep supporting 
FGPAs and GPUs short term, mixed structures where some of the cgroup 
controllers are from v1 while others from v2 should be supported. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11691) The yarn web proxy doesn‘t support HTTP POST method

2024-04-28 Thread JJJJude (Jira)
ude created YARN-11691:
--

 Summary: The yarn web proxy doesn‘t support HTTP POST method
 Key: YARN-11691
 URL: https://issues.apache.org/jira/browse/YARN-11691
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webproxy
Affects Versions: 3.3.6, 3.2.2
Reporter: ude
 Fix For: 3.3.6, 3.2.2


When the flink task is running in the YARN environment, the client encounters 
an error HTTP ERROR 405 when calling the http proxy rest api.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11690) Update container executor to use CGROUP2_SUPER_MAGIC in cgroup 2 scenarios

2024-04-24 Thread Benjamin Teke (Jira)
Benjamin Teke created YARN-11690:


 Summary: Update container executor to use CGROUP2_SUPER_MAGIC in 
cgroup 2 scenarios
 Key: YARN-11690
 URL: https://issues.apache.org/jira/browse/YARN-11690
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: container-executor
Reporter: Benjamin Teke
Assignee: Benjamin Teke


The container executor function {{write_pid_to_cgroup_as_root}} writes the PID 
of the newly launched container to the correct cgroup.procs file. However it 
checks if the file is mounted on a cgroup filesystem, and does that check using 
the magic number, which differs for v1 and v2. This should handle v1 or v2 
filesystems as well. 

{code:java}
/**
 * Write the pid of the current process to the cgroup file.
 * cgroup_file: Path to cgroup file where pid needs to be written to.
 */
static int write_pid_to_cgroup_as_root(const char* cgroup_file, pid_t pid) {
  int rc = 0;
  uid_t user = geteuid();
  gid_t group = getegid();
  if (change_effective_user(0, 0) != 0) {
rc =  -1;
goto cleanup;
  }

  // statfs
  struct statfs buf;
  if (statfs(cgroup_file, ) == -1) {
fprintf(LOGFILE, "Can't statfs file %s as node manager - %s\n", cgroup_file,
   strerror(errno));
rc = -1;
goto cleanup;
  } else if (buf.f_type != CGROUP_SUPER_MAGIC) {
fprintf(LOGFILE, "Pid file %s is not located on cgroup filesystem\n", 
cgroup_file);
rc = -1;
goto cleanup;
  }

{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11689) Update getErrorWithDetails method to provide more meaningful error messages

2024-04-24 Thread Benjamin Teke (Jira)
Benjamin Teke created YARN-11689:


 Summary: Update getErrorWithDetails method to provide more 
meaningful error messages
 Key: YARN-11689
 URL: https://issues.apache.org/jira/browse/YARN-11689
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


The method AbstractCGroupsHandler.getErrorWithDetails hides quite a lot of 
information. It would be useful to show the underlying exception and it's 
message as well, by default.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11662) RM Web API endpoint queue reference differs from JMX endpoint for CS

2024-04-23 Thread Ferenc Erdelyi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi resolved YARN-11662.
---
Resolution: Duplicate

Duplicate of YARN-11538

> RM Web API endpoint queue reference differs from JMX endpoint for CS
> 
>
> Key: YARN-11662
> URL: https://issues.apache.org/jira/browse/YARN-11662
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.4.0
>Reporter: Ferenc Erdelyi
>Assignee: Ferenc Erdelyi
>Priority: Major
>
> When a placement is not successful (because of the lack of a placement rule 
> or an unsuccessful placement), the application is placed in the default queue 
> instead of the root.default. The parent queue won't be defined when there is 
> no placement rule. This causes an inconsistency between the JMX endpoint 
> (reporting the app. runs under the root.default) and the RM Web API endpoint 
> (reporting the app runs under the default queue).
> Similarly, when we submit an application with an unambiguous leaf queue 
> specified, the RM Web API endpoint will report the queue as the leaf queue 
> name instead of the full queue path. However, the full queue path is the 
> expected value to be consistent with the JMX endpoint.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11688) FS-CS converter: call System.exit replaced with ExitUtil.halt

2024-04-20 Thread wangzhihui (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihui resolved YARN-11688.
---
Resolution: Resolved

> FS-CS converter: call System.exit replaced with ExitUtil.halt
> -
>
> Key: YARN-11688
> URL: https://issues.apache.org/jira/browse/YARN-11688
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Blocker
> Fix For: 3.3.0
>
> Attachments: image-2024-04-20-22-17-49-522.png
>
>
> Added System.exit logic in YARN-10191 to avoid issues with the tool will 
> never terminate.
> Causing TestFSConfigToCSConfigConverterMain to VM terminated during running 
> test.
> ExitUtil tool in Hadoop-common facilitates process termination for tests, 
> debugging.
> Call System.exit replaced with ExitUtil.halt ,It would be more suitable for 
> this purpose.
> {code:java}
> // code placeholder
> Crashed tests:
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigConverterMain
> org.apache.maven.surefire.booter.SurefireBooterForkException: 
> ExecutionException The forked VM terminated without properly saying goodbye. 
> VM crash or System.exit called?
> Command was /bin/sh -c cd 
> /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
>  && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx2048m 
> -XX:+HeapDumpOnOutOfMemoryError -jar 
> /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/surefire/surefirebooter2247421570320659117.jar
>  
> /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/surefire
>  2024-04-17T14-34-01_743-jvmRun1 surefire5773923906402489727tmp 
> surefire_1524181064953128391099tmp
> Process Exit Code: 0
> Crashed tests:
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigConverterMain
>   at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:511)
>   at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkPerTestSet(ForkStarter.java:458)
>   at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:299)
>   at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:247)
>   at 
> org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1149)
>   at 
> org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:991)
>   at 
> org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:837)
>   at 
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:210)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:156)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:148)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81)
>   at 
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
>   at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:305)
>   at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192)
>   at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105)
>   at org.apache.maven.cli.MavenCli.execute(MavenCli.java:956)
>   at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
>   at org.apache.maven.cli.MavenCli.main(MavenCli.java:192)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor

[jira] [Created] (YARN-11688) FS-CS converter: call System.exit replaced with ExitUtil.halt

2024-04-20 Thread wangzhihui (Jira)
wangzhihui created YARN-11688:
-

 Summary: FS-CS converter: call System.exit replaced with 
ExitUtil.halt
 Key: YARN-11688
 URL: https://issues.apache.org/jira/browse/YARN-11688
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: wangzhihui
Assignee: wangzhihui
 Fix For: 3.3.0
 Attachments: image-2024-04-20-22-17-49-522.png

Added System.exit logic in YARN-10191 to avoid issues with the tool will never 
terminate.

Causing TestFSConfigToCSConfigConverterMain to VM terminated during running 
test.

ExitUtil tool in Hadoop-common facilitates process termination for tests, 
debugging.

Call System.exit replaced with ExitUtil.halt ,It would be more suitable for 
this purpose.
{code:java}
// code placeholder
Crashed tests:
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigConverterMain
org.apache.maven.surefire.booter.SurefireBooterForkException: 
ExecutionException The forked VM terminated without properly saying goodbye. VM 
crash or System.exit called?
Command was /bin/sh -c cd 
/home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx2048m 
-XX:+HeapDumpOnOutOfMemoryError -jar 
/home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/surefire/surefirebooter2247421570320659117.jar
 
/home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6352/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/surefire
 2024-04-17T14-34-01_743-jvmRun1 surefire5773923906402489727tmp 
surefire_1524181064953128391099tmp
Process Exit Code: 0
Crashed tests:
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.converter.TestFSConfigToCSConfigConverterMain
at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:511)
at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkPerTestSet(ForkStarter.java:458)
at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:299)
at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:247)
at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1149)
at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:991)
at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:837)
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:210)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:156)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:148)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81)
at 
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:305)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:956)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:192)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.surefire.booter.SurefireBooterForkException: The 
forked VM terminated without properly saying goodbye. VM crash or System.exit 
called?

[jira] [Created] (YARN-11687) Update CGroupsResourceCalculator to track usages using cgroupv2

2024-04-18 Thread Benjamin Teke (Jira)
Benjamin Teke created YARN-11687:


 Summary: Update CGroupsResourceCalculator to track usages using 
cgroupv2
 Key: YARN-11687
 URL: https://issues.apache.org/jira/browse/YARN-11687
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


[CGroupsResourceCalculator|https://github.com/apache/hadoop/blob/f609460bda0c2bd87dd3580158e549e2f34f14d5/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsResourceCalculator.java]
 should also be updated to handle the cgroup v2 changes.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11686) Correct traversing indexs when scheduling asynchronously using Capacity Scheduler

2024-04-17 Thread Yihe Li (Jira)
Yihe Li created YARN-11686:
--

 Summary: Correct traversing indexs when scheduling asynchronously 
using Capacity Scheduler
 Key: YARN-11686
 URL: https://issues.apache.org/jira/browse/YARN-11686
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Yihe Li


When scheduling asynchronously using Capacity Scheduler, the traversing indexs 
in `CapacityScheduler#schedule` will always contains `start` index twice. This 
may not in line with the original intention and needs to be corrected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11685) Create a config to enable/disable cgroup v2 functionality

2024-04-16 Thread Benjamin Teke (Jira)
Benjamin Teke created YARN-11685:


 Summary: Create a config to enable/disable cgroup v2 functionality
 Key: YARN-11685
 URL: https://issues.apache.org/jira/browse/YARN-11685
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


Various OS's mount the cgroup v2 differently, some of them mount both the v1 
and v2 structure, others mount a hybrid structure. To avoid initialization 
issues the cgroup v1/v2 functionality should be set by a config property.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11684) PriorityQueueComparator violates general contract

2024-04-12 Thread Tamas Domok (Jira)
Tamas Domok created YARN-11684:
--

 Summary: PriorityQueueComparator violates general contract
 Key: YARN-11684
 URL: https://issues.apache.org/jira/browse/YARN-11684
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 3.5.0
Reporter: Tamas Domok
Assignee: Tamas Domok


YARN-10178 tried to fix the issue but there are still 2 property that might 
change during sorting which causes an exception.

{code}
2024-04-10 12:36:56,420 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[Thread-28,5,main] threw an Exception.
java.lang.IllegalArgumentException: Comparison method violates its general 
contract!
at java.util.TimSort.mergeHi(TimSort.java:899)
at java.util.TimSort.mergeAt(TimSort.java:516)
at java.util.TimSort.mergeCollapse(TimSort.java:441)
at java.util.TimSort.sort(TimSort.java:245)
at java.util.Arrays.sort(Arrays.java:1512)
at 
java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:348)
at java.util.stream.Sink$ChainedReference.end(Sink.java:258)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:483)
at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at 
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at 
java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1719)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1654)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1811)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1557)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:539)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:591)
{code}

The `queue.getAccessibleNodeLabels()` and `queue.getPriority()` could change in 
another thread while the `queues` are being sorted. Those should be saved when 
constructing the PriorityQueueResourcesForSorting helper object.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11683) RM crash due to RELEASE_CONTAINER NPE

2024-04-10 Thread Yuan Luo (Jira)
Yuan Luo created YARN-11683:
---

 Summary: RM crash due to RELEASE_CONTAINER NPE
 Key: YARN-11683
 URL: https://issues.apache.org/jira/browse/YARN-11683
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Yuan Luo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11681) Update the cgroup documentation with v2 support

2024-04-09 Thread Benjamin Teke (Jira)
Benjamin Teke created YARN-11681:


 Summary: Update the cgroup documentation with v2 support
 Key: YARN-11681
 URL: https://issues.apache.org/jira/browse/YARN-11681
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


Update the related 
[documentation|https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html]
 with v2 support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11680) Update FpgaResourceHandler for cgroup v2 support

2024-04-09 Thread Benjamin Teke (Jira)
Benjamin Teke created YARN-11680:


 Summary: Update FpgaResourceHandler for cgroup v2 support
 Key: YARN-11680
 URL: https://issues.apache.org/jira/browse/YARN-11680
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


cgroup v2 has some changes in various controllers (some changed their 
functionality, some were removed). This task is about checking if 
FpgaResourceHandler's 
[implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/fpga/FpgaResourceHandlerImpl.java#L55]
 need any updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11679) Update GpuResourceHandler for cgroup v2 support

2024-04-09 Thread Benjamin Teke (Jira)
Benjamin Teke created YARN-11679:


 Summary: Update GpuResourceHandler for cgroup v2 support
 Key: YARN-11679
 URL: https://issues.apache.org/jira/browse/YARN-11679
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


cgroup v2 has some changes in various controllers (some changed their 
functionality, some were removed). This task is about checking if 
GpuResourceHandler's 
[implementation|https://github.com/apache/hadoop/blob/e8fa192f07b6f2e7a0b03813edca03c505a8ac1b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/gpu/GpuResourceHandlerImpl.java#L45]
 need any updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11678) Update CGroupElasticMemoryController for cgroup v2 support

2024-04-09 Thread Benjamin Teke (Jira)
Benjamin Teke created YARN-11678:


 Summary: Update CGroupElasticMemoryController for cgroup v2 support
 Key: YARN-11678
 URL: https://issues.apache.org/jira/browse/YARN-11678
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


cgroup v2 has some changes in various controllers (some changed their 
functionality, some were removed). This task is about checking if 
CGroupElasticMemoryController's 
[implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupElasticMemoryController.java#L58]
 need any updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11677) Update OutboundBandwidthResourceHandler implementation for cgroup v2 support

2024-04-09 Thread Benjamin Teke (Jira)
Benjamin Teke created YARN-11677:


 Summary: Update OutboundBandwidthResourceHandler implementation 
for cgroup v2 support
 Key: YARN-11677
 URL: https://issues.apache.org/jira/browse/YARN-11677
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


cgroup v2 has some changes in various controllers (some changed their 
functionality, some were removed). This task is about checking if 
OutboundBandwidthResourceHandler's 
[implementation|https://github.com/apache/hadoop/blob/2064ca015d1584263aac0cc20c60b925a3aff612/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/TrafficControlBandwidthHandlerImpl.java#L43]
 need any updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11676) Update CGroupsBlkioResourceHandler implementation for cgroup v2 support

2024-04-09 Thread Benjamin Teke (Jira)
Benjamin Teke created YARN-11676:


 Summary: Update CGroupsBlkioResourceHandler implementation for 
cgroup v2 support
 Key: YARN-11676
 URL: https://issues.apache.org/jira/browse/YARN-11676
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


cgroup v2 has some changes in various controllers (some changed their 
functionality, some were removed). This task is about checking if 
CGroupsBlkioResourceHandler's 
[implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsBlkioResourceHandlerImpl.java#L46]
 need any updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11675) Update MemoryResourceHandler implementation for cgroup v2 support

2024-04-09 Thread Benjamin Teke (Jira)
Benjamin Teke created YARN-11675:


 Summary: Update MemoryResourceHandler implementation for cgroup v2 
support
 Key: YARN-11675
 URL: https://issues.apache.org/jira/browse/YARN-11675
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


cgroup v2 has some changes in various controllers (some changed their 
functionality, some were removed). This task is about checking if 
MemoryResourceHandler's 
[implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsMemoryResourceHandlerImpl.java#L47-L46]
 need any updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11674) Update CpuResourceHandler implementation for cgroup v2 support

2024-04-09 Thread Benjamin Teke (Jira)
Benjamin Teke created YARN-11674:


 Summary: Update CpuResourceHandler implementation for cgroup v2 
support
 Key: YARN-11674
 URL: https://issues.apache.org/jira/browse/YARN-11674
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


cgroup v2 has some changes in various controllers (some changed their 
functionality, some were removed). This task is about checking if 
CpuResourceHandler's 
[implementation|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsCpuResourceHandlerImpl.java#L60]
 need any updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11673) Extend the cgroup mount functionality to mount the v2 structure

2024-04-09 Thread Benjamin Teke (Jira)
Benjamin Teke created YARN-11673:


 Summary: Extend the cgroup mount functionality to mount the v2 
structure
 Key: YARN-11673
 URL: https://issues.apache.org/jira/browse/YARN-11673
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke


YARN has a --mount-cgroup operation in the 
[container-executor|https://github.com/apache/hadoop/blob/9c7b8cf54ea88833d54fc71a9612c448dc0eb78d/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L2929]
 which mounts each controller's cgroup folder to a specified path. In cgroup v2 
the controller structure changed, it's flat now, so there are no more separate 
controller paths. To keep being compatible with v1 a new mount method should be 
added, but its functionality could be simplified quite a bit for v2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11672) Create a CgroupHandler implementation for cgroup v2

2024-04-09 Thread Benjamin Teke (Jira)
Benjamin Teke created YARN-11672:


 Summary: Create a CgroupHandler implementation for cgroup v2
 Key: YARN-11672
 URL: https://issues.apache.org/jira/browse/YARN-11672
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Benjamin Teke
Assignee: Benjamin Teke


[CGroupsHandler's|https://github.com/apache/hadoop/blob/69b328943edf2f61c8fc139934420e3f10bf3813/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/CGroupsHandler.java#L36]
 current implementation holds the functionality to mount and setup the YARN 
specific cgroup v1 functionality. A similar v2 implementation should be created 
that allows initialising the v2 structure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11670) Add CallerContext in NodeManager

2024-04-08 Thread Dinesh Chitlangia (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dinesh Chitlangia resolved YARN-11670.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Thanks [~yangjiandan] for contribution and [~whbing]  and [~slfan1989]  for 
reviews.

> Add CallerContext in NodeManager
> 
>
> Key: YARN-11670
> URL: https://issues.apache.org/jira/browse/YARN-11670
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Currently, MR and Spark have added caller context, enabling tracing of 
> HDFS/ResourceManager operators from Spark apps and MapReduce apps. However, 
> operators from NodeManagers cannot be identified in the audit log. For 
> example, HDFS operations issued from NodeManagers during resource 
> localization cannot be identified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11444) Improve YARN md documentation format

2024-04-07 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-11444.
---
   Fix Version/s: 3.4.1
  3.5.0
Hadoop Flags: Reviewed
Target Version/s: 3.4.1, 3.5.0  (was: 3.5.0)
  Resolution: Fixed

> Improve YARN md documentation format
> 
>
> Key: YARN-11444
> URL: https://issues.apache.org/jira/browse/YARN-11444
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.4.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
>
> 1. Modify some typo errors



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11671) Tests in hadoop-yarn-server-router are not running

2024-04-05 Thread Ayush Saxena (Jira)
Ayush Saxena created YARN-11671:
---

 Summary: Tests in hadoop-yarn-server-router are not running
 Key: YARN-11671
 URL: https://issues.apache.org/jira/browse/YARN-11671
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ayush Saxena


[https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/1549/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-router.txt]

 
{noformat}
[INFO] --- maven-surefire-plugin:3.0.0-M1:test (default-test) @ 
hadoop-yarn-server-router ---
[INFO] 
[INFO] ---
[INFO]  T E S T S
[INFO] ---
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11663) [Federation] Add Cache Entity Nums Limit.

2024-04-01 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-11663.
---
   Fix Version/s: 3.4.1
  3.5.0
Target Version/s: 3.4.0
Assignee: Shilun Fan
  Resolution: Fixed

> [Federation] Add Cache Entity Nums Limit.
> -
>
> Key: YARN-11663
> URL: https://issues.apache.org/jira/browse/YARN-11663
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: federation, yarn
>Affects Versions: 3.4.0
>Reporter: Yuan Luo
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: image-2024-03-14-18-12-28-426.png, 
> image-2024-03-14-18-12-49-950.png, image-2024-03-15-10-50-32-860.png
>
>
> !image-2024-03-14-18-12-28-426.png!
> !image-2024-03-14-18-12-49-950.png!
> hi [~slfan1989] After apply this feature to our prod env, I found the memory 
> of the router keeps growing over time. This is because after jobs finished, 
> we won't access the expired key to trigger cleanup mechanism. Is it better to 
> add cache maximum number limit?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11668) Potential concurrent modification exception for node attributes of node manager

2024-03-28 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-11668.
---
   Fix Version/s: 3.4.1
  3.5.0
Hadoop Flags: Reviewed
Target Version/s: 3.4.1
Assignee: Junfan Zhang
  Resolution: Fixed

> Potential concurrent modification exception for node attributes of node 
> manager
> ---
>
> Key: YARN-11668
> URL: https://issues.apache.org/jira/browse/YARN-11668
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
> Attachments: img_v3_029c_55ac6b50-64aa-4cbe-81a0-5f8d22c623fg.jpg
>
>
> The RM crash when encoutering the following the stacktrace in the attachment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11670) Add CallerContext in NodeManager

2024-03-28 Thread Jiandan Yang (Jira)
Jiandan Yang  created YARN-11670:


 Summary: Add CallerContext in NodeManager
 Key: YARN-11670
 URL: https://issues.apache.org/jira/browse/YARN-11670
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Jiandan Yang 


Currently, MR and Spark have added caller context, enabling tracing of 
HDFS/ResourceManager operators from Spark apps and MapReduce apps. However, 
operators from NodeManagers cannot be identified in the audit log. For example, 
HDFS operations issued from NodeManagers during resource localization cannot be 
identified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11669) cgroups v2 support for YARN

2024-03-28 Thread Ferenc Erdelyi (Jira)
Ferenc Erdelyi created YARN-11669:
-

 Summary: cgroups v2 support for YARN
 Key: YARN-11669
 URL: https://issues.apache.org/jira/browse/YARN-11669
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: yarn
Reporter: Ferenc Erdelyi


The cgroups v2 is becoming the default for OSs, like RHEL9.
Support for YARN has to be implemented.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11668) Potential concurrent modification exception for node attributes of node manager

2024-03-27 Thread Junfan Zhang (Jira)
Junfan Zhang created YARN-11668:
---

 Summary: Potential concurrent modification exception for node 
attributes of node manager
 Key: YARN-11668
 URL: https://issues.apache.org/jira/browse/YARN-11668
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junfan Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11667) Federation: ResourceRequestComparator occurs NPE when using low version of hadoop submit application

2024-03-21 Thread qiuliang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qiuliang resolved YARN-11667.
-
Resolution: Won't Do

> Federation: ResourceRequestComparator occurs NPE when using low version of 
> hadoop submit application
> 
>
> Key: YARN-11667
> URL: https://issues.apache.org/jira/browse/YARN-11667
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: amrmproxy
>Affects Versions: 3.4.0
>Reporter: qiuliang
>Priority: Major
>  Labels: pull-request-available
>
> When a application is submitted using a lower version of hadoop and the 
> Resource Request built by AM has no ExecutionTypeRequest. After the Resource 
> Request is submitted to AMRMProxy, the NPE occurs when AMRMProxy reconstructs 
> the Allocate Request to add Resource Request to its ask



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11626) Optimization of the safeDelete operation in ZKRMStateStore

2024-03-21 Thread Dinesh Chitlangia (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dinesh Chitlangia resolved YARN-11626.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

> Optimization of the safeDelete operation in ZKRMStateStore
> --
>
> Key: YARN-11626
> URL: https://issues.apache.org/jira/browse/YARN-11626
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> h1. Description 
>  * We can be observed that removing app info started at 06:17:20, but the 
> NoNodeException was received at 06:17:35. 
>  * During the 15s interval, Curator was retrying the metadata operation. Due 
> to the non-idempotent nature of the Zookeeper deletion operation, in one of 
> the retry attempts, the metadata operation was successful but no response was 
> received. In the next retry it resulted in a NoNodeException, triggering the 
> STATE_STORE_FENCED event and ultimately causing the current ResourceManager 
> to switch to standby .
> {code:java}
> 2023-10-28 06:17:20,359 INFO  recovery.RMStateStore 
> (RMStateStore.java:transition(333)) - Removing info for app: 
> application_1697410508608_140368
> 2023-10-28 06:17:20,359 INFO  resourcemanager.RMAppManager 
> (RMAppManager.java:checkAppNumCompletedLimit(303)) - Application should be 
> expired, max number of completed apps kept in memory met: 
> maxCompletedAppsInMemory = 1000, removing app 
> application_1697410508608_140368 from memory:
> 2023-10-28 06:17:35,665 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(337)) - Error removing app: 
> application_1697410508608_140368
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> 2023-10-28 06:17:35,666 INFO  recovery.RMStateStore 
> (RMStateStore.java:handleStoreEvent(1147)) - RMStateStore state change from 
> ACTIVE to FENCED
> 2023-10-28 06:17:35,666 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> STATE_STORE_FENCED, caused by 
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
> 2023-10-28 06:17:35,666 INFO  resourcemanager.ResourceManager 
> (ResourceManager.java:transitionToStandby(1309)) - Transitioning to standby 
> state
>  {code}
> h1. Solution
> The NoNodeException clearly indicates that the Znode no longer exists, so we 
> can safely ignore this exception to avoid triggering a larger impact on the 
> cluster caused by ResourceManager failover.
> h1. Other
> We also need to discuss and optimize the same issues in safeCreate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11667) Federation: ResourceRequestComparator occurs NPE when using low version of hadoop submit application

2024-03-20 Thread qiuliang (Jira)
qiuliang created YARN-11667:
---

 Summary: Federation: ResourceRequestComparator occurs NPE when 
using low version of hadoop submit application
 Key: YARN-11667
 URL: https://issues.apache.org/jira/browse/YARN-11667
 Project: Hadoop YARN
  Issue Type: Bug
  Components: amrmproxy
Affects Versions: 3.4.0
Reporter: qiuliang


When a application is submitted using a lower version of hadoop and the 
Resource Request built by AM has no ExecutionTypeRequest. After the Resource 
Request is submitted to AMRMProxy, the NPE occurs when AMRMProxy reconstructs 
the Allocate Request to add Resource Request to its ask



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5305) Yarn Application Log Aggregation fails due to NM can not get correct HDFS delegation token III

2024-03-20 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-5305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-5305.
-
Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Yarn Application Log Aggregation fails due to NM can not get correct HDFS 
> delegation token III
> --
>
> Key: YARN-5305
> URL: https://issues.apache.org/jira/browse/YARN-5305
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Xianyin Xin
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Different with YARN-5098 and YARN-5302, this problem happens when AM submits 
> a startContainer request with a new HDFS token (say, tokenB) which is not 
> managed by YARN, so two tokens exist in the credentials of the user on NM, 
> one is tokenB, the other is the one renewed on RM (tokenA). If tokenB is 
> selected when connect to HDFS and tokenB expires, exception happens.
> Supplementary: this problem happen due to that AM didn't use the service name 
> as the token alias in credentials, so two tokens for the same service can 
> co-exist in one credentials. TokenSelector can only select the first matched 
> token, it doesn't care if the token is valid or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11666) NullPointerException in TestSLSRunner.testSimulatorRunning

2024-03-19 Thread Elen Chatikyan (Jira)
Elen Chatikyan created YARN-11666:
-

 Summary: NullPointerException in TestSLSRunner.testSimulatorRunning
 Key: YARN-11666
 URL: https://issues.apache.org/jira/browse/YARN-11666
 Project: Hadoop YARN
  Issue Type: Bug
 Environment: {*}Operating System{*}: macOS (Sanoma 14.2.1 (23C71))

{*}Hardware{*}: MacBook Air 2023

{*}IDE{*}: IntelliJ IDEA (2023.3.2 (Ultimate Edition))

{*}Java Version{*}: OpenJDK version "1.8.0_292"
Reporter: Elen Chatikyan


*What happened:* 

In the *TestSLSRunner* class of the Apache Hadoop YARN SLS (Simulated Load 
Scheduler) framework, a *NullPointerException* is thrown during the teardown 
process of parameterized tests. This exception is thrown when the stop method 
is called on the ResourceManager (rm) object in {_}RMRunner.java{_}. This issue 
occurs under test conditions that involve mismatches between trace types 
(RUMEN, SLS, SYNTH) and their corresponding trace files, leading to scenarios 
where the rm object may not be properly initialized before the stop method is 
invoked.

 

 

*Buggy code:*

The issue is located in the *{{RMRunner.java}}* file within the *{{stop}}* 
method:{+}{{+}}
{code:java}
public void stop() {
  rm.stop();
}

{code}
The root cause of the *{{NullPointerException}}* is the lack of a null check 
for the {{rm}} object before calling its {{stop}} method. Under any condition 
where the *{{ResourceManager}}* fails to initialize correctly, attempting to 
stop the *{{ResourceManager}}* leads to a null pointer dereference.

 

After fixing in {*}RMRunner.java{*}, TaskRunner should also be fixed.

+TaskRunner.java+
{code:java}
public void stop() throws InterruptedException {
  executor.shutdownNow();
  executor.awaitTermination(20, TimeUnit.SECONDS);
}

{code}
 

{*}How to trigger this bug:{*}{*}{{*}}
 * Change the parameterized unit test's(TestSLSRunner.java) data method to 
include one/both of the following test cases:
 * {capScheduler, "SYNTH", rumenTraceFile, nodeFile }
 * {capScheduler, "SYNTH", slsTraceFile, nodeFile }

 * Execute the *TestSLSRunner* test suite, particularly the 
*testSimulatorRunning* method.
 * Observe the resulting *NullPointerException* in the test output(triggered in 
RMRunner.java).

 
{panel:title=Example stack trace from the test output:}
[ERROR] testSimulatorRunning[Testing with: SYNTH, 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler,
 (nodeFile null)](org.apache.hadoop.yarn.sls.TestSLSRunner) Time elapsed: 3.027 
s <<< ERROR!
java.lang.NullPointerException
at org.apache.hadoop.yarn.sls.RMRunner.stop(RMRunner.java:127)
at org.apache.hadoop.yarn.sls.SLSRunner.stop(SLSRunner.java:320)
at 
org.apache.hadoop.yarn.sls.BaseSLSRunnerTest.tearDown(BaseSLSRunnerTest.java:68)
...
{panel}
 

 
___

_{color:#172b4d}The bug can be fixed by implementing a null check for the 
{{rm}} object within the *{{RMRunner.java}}* {{stop}} method before calling any 
methods on it.(same for executor object in TaskRunner.java){color}_



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11665) hive jobs support aggregating logs according to real users

2024-03-17 Thread zeekling (Jira)
zeekling created YARN-11665:
---

 Summary: hive jobs support aggregating logs according to real users
 Key: YARN-11665
 URL: https://issues.apache.org/jira/browse/YARN-11665
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: log-aggregation
Reporter: zeekling


Currently, hive job logs are in /tmp/logs/hive/bucket/appId ,can we aggregate 
logs against real users running hive jobs, like /tmp/logs/hive/\{real 
user}/bucket/appId



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11664) Remove HDFS Binaries/Jars Dependency From Yarn

2024-03-16 Thread Syed Shameerur Rahman (Jira)
Syed Shameerur Rahman created YARN-11664:


 Summary: Remove HDFS Binaries/Jars Dependency From Yarn
 Key: YARN-11664
 URL: https://issues.apache.org/jira/browse/YARN-11664
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Reporter: Syed Shameerur Rahman


In principle Hadoop Yarn is independent of HDFS. It can work with any 
filesystem. Currently there exists some code dependency for Yarn with HDFS. 
This dependency requires Yarn to bring in some of the HDFS binaries/jars to its 
class path. The idea behind this jira is to remove this dependency so that Yarn 
can run without HDFS binaries/jars

*Scope*
1. Non test classes are considered
2. Some test classes which comes as transitive dependency are considered


*Out of scope*
1. All test classes in Yarn module is not considered

 




A quick search in Yarn module revealed following HDFS dependencies


1. Constants
{code:java}
import 
org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier;
import org.apache.hadoop.hdfs.DFSConfigKeys;{code}
 

 
2. Exception


{code:java}
import org.apache.hadoop.hdfs.protocol.DSQuotaExceededException;
import org.apache.hadoop.hdfs.protocol.QuotaExceededException;  (Comes as a 
transitive dependency from DSQuotaExceededException){code}
 

3. Utility
{code:java}
import org.apache.hadoop.hdfs.protocol.datatransfer.IOStreamPair;{code}
 

Both Yarn and HDFS depends on hadoop-common module, One straight forward 
approach is to move all these dependencies to hadoop-common module and both 
HDFS and Yarn can pick these imports.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11660) SingleConstraintAppPlacementAllocator performance regression

2024-03-14 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-11660.
---
Resolution: Fixed

> SingleConstraintAppPlacementAllocator performance regression
> 
>
> Key: YARN-11660
> URL: https://issues.apache.org/jira/browse/YARN-11660
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 3.4.1
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11663) Router cache expansion issue

2024-03-14 Thread Yuan Luo (Jira)
Yuan Luo created YARN-11663:
---

 Summary: Router cache expansion issue
 Key: YARN-11663
 URL: https://issues.apache.org/jira/browse/YARN-11663
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.3.6
Reporter: Yuan Luo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11661) Adding new property to configure the "SameSite" cookie attribute on YARN UI

2024-03-14 Thread Susheel Gupta (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Susheel Gupta resolved YARN-11661.
--
Hadoop Flags: Reviewed
  Resolution: Workaround

Closing this as workaround exists.

> Adding new property to configure the "SameSite" cookie attribute on YARN UI 
> 
>
> Key: YARN-11661
> URL: https://issues.apache.org/jira/browse/YARN-11661
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Susheel Gupta
>Assignee: Susheel Gupta
>Priority: Major
>
> If we use 'SameSite=Strict,' the browser would only send the cookie for 
> same-site requests, rendering cross-site sessions ineffective.
> However, it’s worth noting that while using SameSite=None with TLS does 
> enhance the security of your cookies compared to using it without TLS, it 
> doesn’t provide complete security. Nevertheless, considering the necessity 
> for cross-site sessions, utilizing SameSite=None along with TLS can provide a 
> reasonable level of security.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11662) RM Web API endpoint queue reference differs from JMX endpoint for CS

2024-03-13 Thread Ferenc Erdelyi (Jira)
Ferenc Erdelyi created YARN-11662:
-

 Summary: RM Web API endpoint queue reference differs from JMX 
endpoint for CS
 Key: YARN-11662
 URL: https://issues.apache.org/jira/browse/YARN-11662
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Ferenc Erdelyi


When a placement is not successful (because of the lack of a placement rule or 
an unsuccessful placement), the application is placed in the default queue 
instead of the root.default. The parent queue won't be defined when there is no 
placement rule. This causes an inconsistency between the JMX endpoint 
(reporting the app. runs under the root.default) and the RM Web API endpoint 
(reporting the app runs under the default queue).

Similarly, when we submit an application with an unambiguous leaf queue 
specified, the RM Web API endpoint will report the queue as the leaf queue name 
instead of the full queue path. However, the full queue path is the expected 
value to be consistent with the JMX endpoint.

I propose using the scheduler's getQueueInfo in the RMAppManager to parse the 
queue name and get the full queue path for the placementQueueName, which fixes 
the above issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11661) Adding new property to configure the "SameSite" cookie attribute on YARN UI

2024-03-13 Thread Susheel Gupta (Jira)
Susheel Gupta created YARN-11661:


 Summary: Adding new property to configure the "SameSite" cookie 
attribute on YARN UI 
 Key: YARN-11661
 URL: https://issues.apache.org/jira/browse/YARN-11661
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Reporter: Susheel Gupta


If we use 'SameSite=Strict,' the browser would only send the cookie for 
same-site requests, rendering cross-site sessions ineffective.
However, it’s worth noting that while using SameSite=None with TLS does enhance 
the security of your cookies compared to using it without TLS, it doesn’t 
provide complete security. Nevertheless, considering the necessity for 
cross-site sessions, utilizing SameSite=None along with TLS can provide a 
reasonable level of security.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11660) SingleConstraintAppPlacementAllocator performance regression

2024-03-12 Thread Junfan Zhang (Jira)
Junfan Zhang created YARN-11660:
---

 Summary: SingleConstraintAppPlacementAllocator performance 
regression
 Key: YARN-11660
 URL: https://issues.apache.org/jira/browse/YARN-11660
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junfan Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11659) app submission fast fail with node label when node label is disable

2024-03-12 Thread Junfan Zhang (Jira)
Junfan Zhang created YARN-11659:
---

 Summary: app submission fast fail with node label when node label 
is disable
 Key: YARN-11659
 URL: https://issues.apache.org/jira/browse/YARN-11659
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junfan Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11658) ATS to make minimum HBase version 2.x

2024-03-05 Thread Steve Loughran (Jira)
Steve Loughran created YARN-11658:
-

 Summary: ATS to make minimum HBase version 2.x
 Key: YARN-11658
 URL: https://issues.apache.org/jira/browse/YARN-11658
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineserver
Affects Versions: 3.4.0
Reporter: Steve Loughran


following on from YARN-11657, what if we cut hbase 1.x support from ATS 
*entirely*?

YARN-3 implies that the 2.x version might need to be bumped up





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11657) Remove protobuf-2.5 as dependency of hadoop-yarn-api

2024-03-05 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran resolved YARN-11657.
---
Fix Version/s: 3.3.9
   3.4.1
   3.5.0
   Resolution: Fixed

> Remove protobuf-2.5 as dependency of hadoop-yarn-api
> 
>
> Key: YARN-11657
> URL: https://issues.apache.org/jira/browse/YARN-11657
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.9, 3.4.1, 3.5.0
>
>
> hadoop-yarn-api is still exporting protobuf-2.5. 
> if we can cut this, we should
> {code}
>  [echo] [INFO] +- 
> org.apache.hadoop:hadoop-yarn-server-common:jar:3.4.0:compile
>  [echo] [INFO] |  +- org.apache.hadoop:hadoop-yarn-api:jar:3.4.0:compile
>  [echo] [INFO] |  |  +- 
> (org.apache.hadoop.thirdparty:hadoop-shaded-guava:jar:1.2.0:compile - omitted 
> for duplicate)
>  [echo] [INFO] |  |  +- (javax.xml.bind:jaxb-api:jar:2.2.11:compile - 
> omitted for duplicate)
>  [echo] [INFO] |  |  +- 
> (org.apache.hadoop:hadoop-annotations:jar:3.4.0:compile - omitted for 
> duplicate)
>  [echo] [INFO] |  |  +- 
> com.google.protobuf:protobuf-java:jar:2.5.0:compile
>  [echo] [INFO] |  |  +- 
> (org.apache.hadoop.thirdparty:hadoop-shaded-protobuf_3_21:jar:1.2.0:compile - 
> omitted for duplicate)
>  [echo] [INFO] |  |  \- 
> (com.fasterxml.jackson.core:jackson-annotations:jar:2.12.7:compile - omitted 
> for duplicate)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11657) Remove protobuf-2.5 as dependency of hadoop-yarn-api

2024-02-22 Thread Steve Loughran (Jira)
Steve Loughran created YARN-11657:
-

 Summary: Remove protobuf-2.5 as dependency of hadoop-yarn-api
 Key: YARN-11657
 URL: https://issues.apache.org/jira/browse/YARN-11657
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api
Affects Versions: 3.4.0
Reporter: Steve Loughran
Assignee: Steve Loughran


hadoop-yarn-api is still exporting protobuf-2.5. 
if we can cut this, we should

{code}
 [echo] [INFO] +- 
org.apache.hadoop:hadoop-yarn-server-common:jar:3.4.0:compile
 [echo] [INFO] |  +- org.apache.hadoop:hadoop-yarn-api:jar:3.4.0:compile
 [echo] [INFO] |  |  +- 
(org.apache.hadoop.thirdparty:hadoop-shaded-guava:jar:1.2.0:compile - omitted 
for duplicate)
 [echo] [INFO] |  |  +- (javax.xml.bind:jaxb-api:jar:2.2.11:compile - 
omitted for duplicate)
 [echo] [INFO] |  |  +- 
(org.apache.hadoop:hadoop-annotations:jar:3.4.0:compile - omitted for duplicate)
 [echo] [INFO] |  |  +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
 [echo] [INFO] |  |  +- 
(org.apache.hadoop.thirdparty:hadoop-shaded-protobuf_3_21:jar:1.2.0:compile - 
omitted for duplicate)
 [echo] [INFO] |  |  \- 
(com.fasterxml.jackson.core:jackson-annotations:jar:2.12.7:compile - omitted 
for duplicate)

{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11656) RMStateStore event queue blocked

2024-02-21 Thread Bence Kosztolnik (Jira)
Bence Kosztolnik created YARN-11656:
---

 Summary: RMStateStore event queue blocked
 Key: YARN-11656
 URL: https://issues.apache.org/jira/browse/YARN-11656
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.4.1
Reporter: Bence Kosztolnik
 Attachments: issue.png

I observed Yarn cluster has pending and available resources as well, but the 
cluster utilization is usually around ~50%. The cluster had loaded with 200 
parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 
reduce containers configured, on a 50 nodes cluster, where each node had 8 
cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to 
persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:
- create a dispatcher where events can persist in parallel threads
- create metric data for the RMStateStore event queue to be able easily to 
identify the problem if occurs on a cluster


{panel:title=Issue visible on UI2}

{panel}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11655) modify default value of Allocated GPUs and Reserved GPUs in yarn scheduler webui from -1 to 0

2024-02-19 Thread wangzhongwei (Jira)
wangzhongwei created YARN-11655:
---

 Summary: modify default value of  Allocated GPUs and Reserved GPUs 
in yarn scheduler webui from -1 to 0
 Key: YARN-11655
 URL: https://issues.apache.org/jira/browse/YARN-11655
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn-common
Affects Versions: 3.3.3
Reporter: wangzhongwei
Assignee: wangzhongwei
 Attachments: image-2024-02-20-15-15-34-996.png

in yarn scheduler webui,the value of Allocated GPUs and Reserved GPUs be set to 
0 by default may be better. when GPUs not used,these values should be 0
!image-2024-02-20-15-15-34-996.png|width=486,height=235!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11654) [JDK17] TestLinuxContainerExecutorWithMocks.testStartLocalizer fails

2024-02-04 Thread Bilwa S T (Jira)
Bilwa S T created YARN-11654:


 Summary: [JDK17] 
TestLinuxContainerExecutorWithMocks.testStartLocalizer fails
 Key: YARN-11654
 URL: https://issues.apache.org/jira/browse/YARN-11654
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.4.0
Reporter: Bilwa S T
Assignee: Bilwa S T


Expected size:<26> but was:<28> in:
<["nobody",
"test",
"0",
"application_0",
"12345",
"/bin/nmPrivateCTokensPath",

"/workspace/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/tmp/nm-local-dir",
"src/test/resources",
"/usr/lib/jvm/jdk-17.0.9/bin/java",
"-classpath",

"/workspace/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/test-classes:/workspace/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/classes:/home/mwapp/.m2/repository/org/apache/hadoop/hadoop-common/3.3.6-13/hadoop-common-3.3.6-13.jar:/home/mwapp/.m2/repository/org/apache/hadoop/thirdparty/hadoop-shaded-protobuf_3_7/1.1.1/hadoop-shaded-protobuf_3_7-1.1.1.jar:/home/mwapp/.m2/repository/com/google/guava/guava/32.0.1-jre/guava-32.0.1-jre.jar:/home/mwapp/.m2/repository/com/google/guava/failureaccess/1.0.1/failureaccess-1.0.1.jar:/home/mwapp/.m2/repository/com/google/guava/listenablefuture/.0-empty-to-avoid-conflict-with-guava/listenablefuture-.0-empty-to-avoid-conflict-with-guava.jar:/home/mwapp/.m2/repository/org/checkerframework/checker-qual/3.33.0/checker-qual-3.33.0.jar:/home/mwapp/.m2/repository/com/google/j2objc/j2objc-annotations/2.8/j2objc-annotations-2.8.jar:/home/mwapp/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar:/home/mwapp/.m2/repository/org/apache/commons/commons-math3/3.1.1/commons-math3-3.1.1.jar:/home/mwapp/.m2/repository/org/apache/httpcomponents/httpclient/4.5.13/httpclient-4.5.13.jar:/home/mwapp/.m2/repository/org/apache/httpcomponents/httpcore/4.4.13/httpcore-4.4.13.jar:/home/mwapp/.m2/repository/commons-io/commons-io/2.8.0/commons-io-2.8.0.jar:/home/mwapp/.m2/repository/commons-net/commons-net/3.9.0/commons-net-3.9.0.jar:/home/mwapp/.m2/repository/commons-collections/commons-collections/3.2.2/commons-collections-3.2.2.jar:/home/mwapp/.m2/repository/jakarta/activation/jakarta.activation-api/1.2.1/jakarta.activation-api-1.2.1.jar:/home/mwapp/.m2/repository/org/eclipse/jetty/jetty-server/9.4.53.v20231009/jetty-server-9.4.53.v20231009.jar:/home/mwapp/.m2/repository/org/eclipse/jetty/jetty-http/9.4.53.v20231009/jetty-http-9.4.53.v20231009.jar:/home/mwapp/.m2/repository/org/eclipse/jetty/jetty-io/9.4.53.v20231009/jetty-io-9.4.53.v20231009.jar:/home/mwapp/.m2/repository/org/eclipse/jetty/jetty-servlet/9.4.53.v20231009/jetty-servlet-9.4.53.v20231009.jar:/home/mwapp/.m2/repository/org/eclipse/jetty/jetty-security/9.4.53.v20231009/jetty-security-9.4.53.v20231009.jar:/home/mwapp/.m2/repository/org/eclipse/jetty/jetty-util-ajax/9.4.53.v20231009/jetty-util-ajax-9.4.53.v20231009.jar:/home/mwapp/.m2/repository/org/eclipse/jetty/jetty-webapp/9.4.53.v20231009/jetty-webapp-9.4.53.v20231009.jar:/home/mwapp/.m2/repository/org/eclipse/jetty/jetty-xml/9.4.53.v20231009/jetty-xml-9.4.53.v20231009.jar:/home/mwapp/.m2/repository/javax/servlet/jsp/jsp-api/2.1/jsp-api-2.1.jar:/home/mwapp/.m2/repository/com/sun/jersey/jersey-servlet/1.19.4/jersey-servlet-1.19.4.jar:/home/mwapp/.m2/repository/com/sun/jersey/jersey-server/1.19.4/jersey-server-1.19.4.jar:/home/mwapp/.m2/repository/commons-logging/commons-logging/1.1.3/commons-logging-1.1.3.jar:/home/mwapp/.m2/repository/ch/qos/reload4j/reload4j/1.2.22/reload4j-1.2.22.jar:/home/mwapp/.m2/repository/commons-beanutils/commons-beanutils/1.9.4/commons-beanutils-1.9.4.jar:/home/mwapp/.m2/repository/org/apache/commons/commons-configuration2/2.8.0/commons-configuration2-2.8.0.jar:/home/mwapp/.m2/repository/org/apache/commons/commons-lang3/3.12.0/commons-lang3-3.12.0.jar:/home/mwapp/.m2/repository/org/apache/commons/commons-text/1.10.0/commons-text-1.10.0.jar:/home/mwapp/.m2/repository/org/apache/avro/avro/1.7.7/avro-1.7.7.jar:/home/mwapp/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.9.13/jackson-core-asl-1.9.13.jar:/home/mwapp/.m2/repository/org/codehaus/jackson/jackson-mapper-asl/1.9.13/jackson-mapper-asl-1.9.13.jar:/home/mwapp/.m2/repository/com/thoughtworks/paranamer/paranamer/2.3/paranamer-2.3.jar:/home/mwapp/.m2/repository/com/google/re2j/re2j/1.1/re2j-1.1.jar:/home/mwapp/.m2/repository/com/google/code/gson/gson/2.9.0/gson-2.9.0.jar:/home/mwapp/.m2/repository/org/apache/hadoop/hadoop-auth/3.3.6-13/hadoop-auth-3.3.6-13.jar:/home/mwapp/.m2/repository/com/nimbusds/nimbus-jose-jwt/9.8.1/nimbus-jose-jwt-9.8.1.jar:/home/mwapp/.m2/repository/com/github/stephenc/jcip/jcip-anno

[jira] [Resolved] (YARN-11362) Fix several typos in YARN codebase of misspelled resource

2024-02-03 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-11362.
---
   Fix Version/s: 3.5.0
Hadoop Flags: Reviewed
Target Version/s: 3.5.0
  Resolution: Fixed

> Fix several typos in YARN codebase of misspelled resource
> -
>
> Key: YARN-11362
> URL: https://issues.apache.org/jira/browse/YARN-11362
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
>  Labels: newbie, newbie++, pull-request-available
> Fix For: 3.5.0
>
>
> I noticed that in YARN's codebase, there are several occassions of misspelled 
> resource as "Resoure".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11653) Add Totoal_Memory and Total_Vcores columns in Nodes page

2024-01-30 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-11653.
---
   Fix Version/s: 3.5.0
Hadoop Flags: Reviewed
Target Version/s: 3.5.0
  Resolution: Fixed

> Add Totoal_Memory and Total_Vcores columns in Nodes page
> 
>
> Key: YARN-11653
> URL: https://issues.apache.org/jira/browse/YARN-11653
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Currently, the RM nodes page includes used and available memory/vcore , but 
> it lacks a total column, which is not intuitive enough for users. When the 
> resource capacities of nodes in the cluster vary widely, we may need to sort 
> the nodes to facilitate the comparison of metrics among same types of nodes. 
> Therefore, it is necessary to add columns for total CPU/memory on the nodes 
> page.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11650) Refactoring variable names related multiNodePolicy in MultiNodePolicySpec, FiCaSchedulerApp and AbstractCSQueue

2024-01-29 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-11650.
---
   Fix Version/s: 3.5.0
Hadoop Flags: Reviewed
Target Version/s: 3.5.0
  Resolution: Fixed

> Refactoring variable names related multiNodePolicy in MultiNodePolicySpec, 
> FiCaSchedulerApp and AbstractCSQueue
> ---
>
> Key: YARN-11650
> URL: https://issues.apache.org/jira/browse/YARN-11650
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: RM
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> In classes related to MultiNodePolicy support, some variable names do not 
> accurately reflect their true meanings. For instance, a variable named 
> *queue* is actually representing the class name of a policy, and a variable 
> named *policyName* denotes the class name of the policy. This may cause 
> confusion for the readers of the code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11138) TestRouterWebServicesREST Junit Test Error Fix

2024-01-27 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-11138.
---
Hadoop Flags:   (was: Reviewed)
  Resolution: Duplicate

> TestRouterWebServicesREST Junit Test Error Fix
> --
>
> Key: YARN-11138
> URL: https://issues.apache.org/jira/browse/YARN-11138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: federation, test
>Affects Versions: 3.4.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>
> [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 28.818 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST
> [ERROR] org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST 
>  Time elapsed: 28.817 s  <<< FAILURE!
> java.lang.AssertionError: Web app not running
> at org.junit.Assert.fail(Assert.java:89)
> at 
> org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST.waitWebAppRunning(TestRouterWebServicesREST.java:199)
> at 
> org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST.setUp(TestRouterWebServicesREST.java:217)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
> at 
> org.junit.internal.runners.statements.RunBefores.invokeMethod(RunBefores.java:33)
> at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10889) [Umbrella] Queue Creation in Capacity Scheduler - Tech debts

2024-01-26 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-10889.
--
   Fix Version/s: 3.4.0
Target Version/s: 3.4.0
  Resolution: Fixed

> [Umbrella] Queue Creation in Capacity Scheduler - Tech debts
> 
>
> Key: YARN-10889
> URL: https://issues.apache.org/jira/browse/YARN-10889
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.4.0
>
>
> Follow-up of YARN-10496



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11041) Replace all occurences of queuePath with the new QueuePath class - followup

2024-01-26 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11041.
--
Resolution: Fixed

> Replace all occurences of queuePath with the new QueuePath class - followup
> ---
>
> Key: YARN-11041
> URL: https://issues.apache.org/jira/browse/YARN-11041
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Tibor Kovács
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> The QueuePath class was introduced in YARN-10897, however, its current 
> adoption happened only for code changes after this JIRA. We need to adopt it 
> retrospectively.
>  
> A lot of changes are introduced via ticket YARN-10982. The replacing should 
> be continued by touching the next comments:
>  
> [...g/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AutoCreatedQueueTemplate.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-fde6885144b59bb06b2c3358780388d958829b13f68aceee7bb6d394bb5e0548]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765012937]
> I think this could be also refactored in a follow-up jira so the string magic 
> could probably be replaced with some more elegant solution. Though, I think 
> this would be too much in this patch, hence I do suggest the follow-up jira.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765013096]
> [~bteke] [ |https://github.com/9uapaw] [~gandras] [ 
> \|https://github.com/9uapaw] Thoughts?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765110750]
> +1, even the QueuePath object could have some kind of support for this.|
> |[~gandras] [https://github.com/apache/hadoop/pull/3660#discussion_r765131244]
> Agreed, let's handle it in a followup!|
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765023717]
> There are many string operations in this class:
> E.g. * getQueuePrefix that works with the full queue path
>  * getNodeLabelPrefix that also works with the full queue path|
> I suggest to create a static class, called "QueuePrefixes" or something like 
> that and add some static methods there to convert the QueuePath object to 
> those various queue prefix strings that are ultimately keys in the 
> Configuration object.
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765026119]
> This seems hacky, just based on the constructor parameter names of QueuePath: 
> parent, leaf.
> The AQC Template prefix is not the leaf, obviously.
> Could we somehow circumvent this?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765126207]
> Maybe a factory method could be created, which returns a new QueuePath with 
> the parent set as the original queuePath. I.e 
> rootQueuePath.createChild(String childName) -> this could return a new 
> QueuePath object with root.childName path, and rootQueuePath as parent.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765039033]
> Looking at this getQueues method, I realized almost all the callers are using 
> some kind of string magic that should be addressed with this patch.
> For example, take a look at: 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider#addQueue
> I think getQueues should also receive the QueuePath object instead of 
> Strings.|
>  
> 
>  
> [.../src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java|https://github.com/apache/hadoop/pull/3660/files/0c3dd17c936260fc9c386dcabc6368b54b27aa82..39f4ec203377244f840e4593aa02386ff51cc3c4#diff-0adf8192c51cbe4671324f06f7f8cbd48898df0376bbcc516451a3bdb2b48d3b]
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765912967]
> Nit: Gets the queue path object.
> The object of the queue suggests a CSQueue object.|
> |

[jira] [Created] (YARN-11653) Add Totoal_Memory and Total_Vcores columns in Nodes page

2024-01-25 Thread Jiandan Yang (Jira)
Jiandan Yang  created YARN-11653:


 Summary: Add Totoal_Memory and Total_Vcores columns in Nodes page
 Key: YARN-11653
 URL: https://issues.apache.org/jira/browse/YARN-11653
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Jiandan Yang 
Assignee: Jiandan Yang 


Currently, the RM nodes page includes used and available memory/vcore , but it 
lacks a total column, which is not intuitive enough for users. When the 
resource capacities of nodes in the cluster vary widely, we may need to sort 
the nodes to facilitate the comparison of metrics among same types of nodes. 
Therefore, it is necessary to add columns for total CPU/memory on the nodes 
page.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10888) [Umbrella] New capacity modes for CS

2024-01-25 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-10888.
--
Resolution: Fixed

> [Umbrella] New capacity modes for CS
> 
>
> Key: YARN-10888
> URL: https://issues.apache.org/jira/browse/YARN-10888
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: capacity_scheduler_queue_capacity.pdf
>
>
> *Investigate how resource allocation configuration could be more consistent 
> in CapacityScheduler*
> It would be nice if everywhere where a capacity can be defined could be 
> defined the same way:
>  * With fixed amounts (e.g. 1 GB memory, 8 vcores, 3 GPU)
>  * With percentages
>  ** Percentage of all resources (eg 10% of all memory, vcore, GPU)
>  ** Percentage per resource type (eg 10% memory, 25% vcore, 50% GPU)
>  * Allow mixing different modes under one hierarchy but not under the same 
> parent queues.
> We need to determine all configuration options where capacities can be 
> defined, and see if it is possible to extend the configuration, or if it 
> makes sense in that case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11652) [Umbrella] Follow-up after YARN-10888/YARN-10889

2024-01-25 Thread Benjamin Teke (Jira)
Benjamin Teke created YARN-11652:


 Summary: [Umbrella] Follow-up after YARN-10888/YARN-10889
 Key: YARN-11652
 URL: https://issues.apache.org/jira/browse/YARN-11652
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 3.5.0
Reporter: Benjamin Teke
Assignee: Benjamin Teke


Follow-up improvements after the changes in YARN-10888/YARN-10889.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11651) Fix UT TestQueueCapacityConfigParser compile error

2024-01-24 Thread Jiandan Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  resolved YARN-11651.
--
Resolution: Invalid

> Fix UT TestQueueCapacityConfigParser compile error
> --
>
> Key: YARN-11651
> URL: https://issues.apache.org/jira/browse/YARN-11651
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
>  Labels: pull-request-available
>
> The following error is reported during compilation:
> {code:java}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-compiler-plugin:3.10.1:testCompile 
> (default-testCompile) on project hadoop-yarn-server-resourcemanager: 
> Compilation failure
> [ERROR] 
> /home/jenkins/jenkins-home/workspace/hadoop-multibranch_PR-6490/ubuntu-focal/src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/TestQueueCapacityConfigParser.java:[224,80]
>  incompatible types: java.lang.String cannot be converted to 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.QueuePath
> [ERROR] -> [Help 1]
> {code}
> This caused by YARN-11041



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11651) Fix UT TestQueueCapacityConfigParser

2024-01-24 Thread Jiandan Yang (Jira)
Jiandan Yang  created YARN-11651:


 Summary: Fix UT TestQueueCapacityConfigParser
 Key: YARN-11651
 URL: https://issues.apache.org/jira/browse/YARN-11651
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jiandan Yang 






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11041) Replace all occurences of queuePath with the new QueuePath class - followup

2024-01-24 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11041.
--
Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Replace all occurences of queuePath with the new QueuePath class - followup
> ---
>
> Key: YARN-11041
> URL: https://issues.apache.org/jira/browse/YARN-11041
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Tibor Kovács
>Assignee: Peter Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> The QueuePath class was introduced in YARN-10897, however, its current 
> adoption happened only for code changes after this JIRA. We need to adopt it 
> retrospectively.
>  
> A lot of changes are introduced via ticket YARN-10982. The replacing should 
> be continued by touching the next comments:
>  
> [...g/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AutoCreatedQueueTemplate.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-fde6885144b59bb06b2c3358780388d958829b13f68aceee7bb6d394bb5e0548]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765012937]
> I think this could be also refactored in a follow-up jira so the string magic 
> could probably be replaced with some more elegant solution. Though, I think 
> this would be too much in this patch, hence I do suggest the follow-up jira.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765013096]
> [~bteke] [ |https://github.com/9uapaw] [~gandras] [ 
> \|https://github.com/9uapaw] Thoughts?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765110750]
> +1, even the QueuePath object could have some kind of support for this.|
> |[~gandras] [https://github.com/apache/hadoop/pull/3660#discussion_r765131244]
> Agreed, let's handle it in a followup!|
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765023717]
> There are many string operations in this class:
> E.g. * getQueuePrefix that works with the full queue path
>  * getNodeLabelPrefix that also works with the full queue path|
> I suggest to create a static class, called "QueuePrefixes" or something like 
> that and add some static methods there to convert the QueuePath object to 
> those various queue prefix strings that are ultimately keys in the 
> Configuration object.
>  
> 
>  
> [...he/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java|https://github.com/apache/hadoop/pull/3660/files/f956918bc154d0e35fce07c5dd8be804eb007acc#diff-c4b0c5e70208f1e3cfbd5a86ffa2393e5c996cc8b45605d9d41abcb7e0bd382a]
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765026119]
> This seems hacky, just based on the constructor parameter names of QueuePath: 
> parent, leaf.
> The AQC Template prefix is not the leaf, obviously.
> Could we somehow circumvent this?|
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765126207]
> Maybe a factory method could be created, which returns a new QueuePath with 
> the parent set as the original queuePath. I.e 
> rootQueuePath.createChild(String childName) -> this could return a new 
> QueuePath object with root.childName path, and rootQueuePath as parent.|
> |[~snemeth] [https://github.com/apache/hadoop/pull/3660#discussion_r765039033]
> Looking at this getQueues method, I realized almost all the callers are using 
> some kind of string magic that should be addressed with this patch.
> For example, take a look at: 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider#addQueue
> I think getQueues should also receive the QueuePath object instead of 
> Strings.|
>  
> 
>  
> [.../src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java|https://github.com/apache/hadoop/pull/3660/files/0c3dd17c936260fc9c386dcabc6368b54b27aa82..39f4ec203377244f840e4593aa02386ff51cc3c4#diff-0adf8192c51cbe4671324f06f7f8cbd48898df0376bbcc516451a3bdb2b48d3b]
> |[~bteke] [https://github.com/apache/hadoop/pull/3660#discussion_r765912967]
> Nit: Gets the queue path object.
> The object of 

[jira] [Resolved] (YARN-11645) Fix flaky json assert tests in TestRMWebServices

2024-01-24 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke resolved YARN-11645.
--
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix flaky json assert tests in TestRMWebServices
> 
>
> Key: YARN-11645
> URL: https://issues.apache.org/jira/browse/YARN-11645
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.5.0
>Reporter: Tamas Domok
>Assignee: Tamas Domok
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> TestRMWebServicesCapacitySchedDynamicConfig and 
> TestRMWebServicesCapacitySchedulerMixedMode are flaky due to changes in the 
> queue order.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11650) refactor policyName to policyClassName

2024-01-23 Thread Jiandan Yang (Jira)
Jiandan Yang  created YARN-11650:


 Summary: refactor policyName to policyClassName
 Key: YARN-11650
 URL: https://issues.apache.org/jira/browse/YARN-11650
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: RM
Reporter: Jiandan Yang 


In classes related to MultiNodePolicy support, some variable names do not 
accurately reflect their true meanings. For instance, a variable named *queue* 
is actually representing the class name of a policy, and a variable named 
*policyName* denotes the class name of the policy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11649) YARN Federation getNewApplication returns different maxresourcecapability

2024-01-21 Thread Jeffrey Chang (Jira)
Jeffrey Chang created YARN-11649:


 Summary: YARN Federation getNewApplication returns different 
maxresourcecapability
 Key: YARN-11649
 URL: https://issues.apache.org/jira/browse/YARN-11649
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jeffrey Chang
Assignee: Jeffrey Chang


When getNewApplication is called against YARN Router with Federation on, its 
possible we get different maxResourceCapabilities on different calls. This is 
because getNewApplication is called against a random cluster on each call, 
which may return different maxResourceCapability based on the cluster that the 
call is executed on.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11607) TestTimelineAuthFilterForV2 fails intermittently

2024-01-19 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-11607.
---
   Fix Version/s: 3.5.0
Hadoop Flags: Reviewed
Target Version/s: 3.5.0
  Resolution: Fixed

> TestTimelineAuthFilterForV2 fails intermittently 
> -
>
> Key: YARN-11607
> URL: https://issues.apache.org/jira/browse/YARN-11607
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Ayush Saxena
>Assignee: Susheel Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Ref:
> https://ci-hadoop.apache.org/view/Hadoop/job/hadoop-qbt-trunk-java8-linux-x86_64/1398/testReport/junit/org.apache.hadoop.yarn.server.timelineservice.security/TestTimelineAuthFilterForV2/testPutTimelineEntities_boolean__boolean__3_/
> {noformat}
> org.opentest4j.AssertionFailedError: expected: <2> but was: <1>
>   at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55)
>   at 
> org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62)
>   at 
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
>   at 
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145)
>   at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:527)
>   at 
> org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2.publishAndVerifyEntity(TestTimelineAuthFilterForV2.java:324)
>   at 
> org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2.publishWithRetries(TestTimelineAuthFilterForV2.java:337)
>   at 
> org.apache.hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2.testPutTimelineEntities(TestTimelineAuthFilterForV2.java:383)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11648) CapacityScheduler does not activate applications when resources are released from another Leaf Queue

2024-01-18 Thread Brian Goerlitz (Jira)
Brian Goerlitz created YARN-11648:
-

 Summary: CapacityScheduler does not activate applications when 
resources are released from another Leaf Queue
 Key: YARN-11648
 URL: https://issues.apache.org/jira/browse/YARN-11648
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Reporter: Brian Goerlitz


Create a queue with low minimum capacity and high maximum capacity. If multiple 
apps are submitted to the queue such that the Queue's Max AM resource limit is 
exceeded while other cluster resources are consumed by different queues, these 
apps will not be considered for activation when cluster resources from the 
other queues are freed. As the AM limit is calculated based on available 
resources for the queue, these apps should be activated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11647) more places to use StandardCharsets

2024-01-17 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-11647.
---
   Fix Version/s: 3.5.0
Hadoop Flags: Reviewed
Target Version/s: 3.5.0
  Resolution: Fixed

> more places to use StandardCharsets
> ---
>
> Key: YARN-11647
> URL: https://issues.apache.org/jira/browse/YARN-11647
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn
>Reporter: PJ Fanning
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> A few instances missed in HADOOP-18957



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11638) [GPG] GPG Support CLI.

2024-01-16 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-11638.
---
Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> [GPG] GPG Support CLI.
> --
>
> Key: YARN-11638
> URL: https://issues.apache.org/jira/browse/YARN-11638
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: federation
>Affects Versions: 3.4.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> We will add a set of command lines to GPG so that GPG can better refresh the 
> policy and provide some other convenient functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11646) QueueCapacityConfigParser shouldn't ignore capacity config with 0 memory

2024-01-10 Thread Tamas Domok (Jira)
Tamas Domok created YARN-11646:
--

 Summary: QueueCapacityConfigParser shouldn't ignore capacity 
config with 0 memory
 Key: YARN-11646
 URL: https://issues.apache.org/jira/browse/YARN-11646
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 3.4.0
Reporter: Tamas Domok
Assignee: Tamas Domok


There is no reason to ignore the configured capacity if the memory is 0 in the 
configuration.

It makes it impossible to configure a zero absolute resource capacity.

Example:
{noformat}
root.default.capacity=[memory=0, vcores=0]
root.default.maximum-capacity=[memory=2048, vcores=2]
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11645) Fix flaky json assert tests in TestRMWebServices

2024-01-10 Thread Tamas Domok (Jira)
Tamas Domok created YARN-11645:
--

 Summary: Fix flaky json assert tests in TestRMWebServices
 Key: YARN-11645
 URL: https://issues.apache.org/jira/browse/YARN-11645
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 3.4.0
Reporter: Tamas Domok
Assignee: Tamas Domok


TestRMWebServicesCapacitySchedDynamicConfig and 
TestRMWebServicesCapacitySchedulerMixedMode are flaky due to changes in the 
queue order.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11644) LogAggregationService can't upload log in time when application finished

2024-01-09 Thread Xie YiFan (Jira)
Xie YiFan created YARN-11644:


 Summary: LogAggregationService can't upload log in time when 
application finished
 Key: YARN-11644
 URL: https://issues.apache.org/jira/browse/YARN-11644
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: log-aggregation
Reporter: Xie YiFan
Assignee: Xie YiFan
 Attachments: image-2024-01-10-11-03-57-553.png

LogAggregationService is responsible for uploading log to HDFS. It applies 
thread pool to execute upload task.

The workflow of upload log as follow:
 # NM construct Applicaiton object when first container of a certain 
application launch, then notify LogAggregationService to init 
AppLogAggregationImpl.
 # LogAggregationService submit AppLogAggregationImpl to task queue.

 # The idle worker of thread pool pulls AppLogAggregationImpl from task queue.

 # AppLogAggregationImpl do while loop to check the application state, do 
upload when application finished.

Suppose the following scenario:
 * LogAggregationService initialize thread pool with 4 threads.

 * 4 long running applications start on this NM, so all threads are occupied by 
aggregator.

 * The next short application starts on this NM and quickly finish, but no idle 
thread for this app to upload log.

as a result, the following applications have to wait the previous applications 
finish before uploading their logs.

!image-2024-01-10-11-03-57-553.png|width=599,height=195!
h4. Solution

Change the spin behavior of AppLogAggregationImpl. If application has not 
finished, just return to yield current thread and resubmit itself to executor 
service. So the LogAggregationService can roll the task queue and the logs of 
finished application can be uploaded immediately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11643) Skip unnecessary pre-check in Multi Node Placement

2024-01-08 Thread Xie YiFan (Jira)
Xie YiFan created YARN-11643:


 Summary: Skip unnecessary pre-check in Multi Node Placement
 Key: YARN-11643
 URL: https://issues.apache.org/jira/browse/YARN-11643
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Reporter: Xie YiFan
Assignee: Xie YiFan


When Multi Node Placement enabled, RegularContainerAllocator do a while loop to 
find one node from candidate set to allocate for a given scheduler key. Before 
do allocate, pre-check be called to check if current node satisfies check. If 
this node does not pass all checks, just continue to next node.
{code:java}
if (reservedContainer == null) {
  result = preCheckForNodeCandidateSet(node,
  schedulingMode, resourceLimits, schedulerKey);
  if (null != result) {
continue;
  }
} {code}
But some checks are related to scheduler Key or Application which return 
PRIORITY_SKIPPED or APP_SKIPPED. It means that if first node does not pass 
check, the following nodes also do not pass. 
If cluster have 5000 nodes in default partition, Scheduler will waste 5000 
times loop for just one scheduler key.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11553) Change the time unit of scCleanerIntervalMs in Router

2024-01-06 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-11553.
---
Hadoop Flags: Reviewed
Target Version/s: 3.4.0
  Resolution: Fixed

> Change the time unit of scCleanerIntervalMs in Router
> -
>
> Key: YARN-11553
> URL: https://issues.apache.org/jira/browse/YARN-11553
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: router
>Reporter: WangYuanben
>Assignee: WangYuanben
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: image-2023-08-19-16-13-41-956.png
>
>
> The time unit of scCleanerIntervalMs is written as TimeUnit.MINUTES, 
> resulting in overly long cleaning intervals.
> !image-2023-08-19-16-13-41-956.png|width=561,height=82!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11556) Let Federation.md more standardized

2024-01-06 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-11556.
---
   Fix Version/s: 3.4.0
Hadoop Flags: Reviewed
Target Version/s: 3.4.0
  Resolution: Fixed

> Let Federation.md more standardized
> ---
>
> Key: YARN-11556
> URL: https://issues.apache.org/jira/browse/YARN-11556
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: documentation
>Reporter: WangYuanben
>Assignee: WangYuanben
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11631) [GPG] Add GPGWebServices

2024-01-06 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-11631.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

> [GPG] Add GPGWebServices
> 
>
> Key: YARN-11631
> URL: https://issues.apache.org/jira/browse/YARN-11631
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: federation
>Affects Versions: 3.4.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11642) Fix Flaky Test TestTimelineAuthFilterForV2#testPutTimelineEntities

2024-01-06 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-11642.
---
   Fix Version/s: 3.4.0
Hadoop Flags: Reviewed
Target Version/s: 3.4.0
  Resolution: Fixed

> Fix Flaky Test TestTimelineAuthFilterForV2#testPutTimelineEntities
> --
>
> Key: YARN-11642
> URL: https://issues.apache.org/jira/browse/YARN-11642
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: timelineservice
>Affects Versions: 3.5.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Our current unit tests are all executed in parallel. 
> TestTimelineAuthFilterForV2#testPutTimelineEntities will report an error 
> during execution:
> {code:java}
> [main] collector.PerNodeTimelineCollectorsAuxService 
> (StringUtils.java:startupShutdownMessage(755)) - failed to register any UNIX 
> signal loggers: 
> java.lang.IllegalStateException: Can't re-install the signal handlers.
> {code}
> We can solve this problem by changing static initialization to new Object.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11642) Fix Flask Test TestTimelineAuthFilterForV2#testPutTimelineEntities

2024-01-05 Thread Shilun Fan (Jira)
Shilun Fan created YARN-11642:
-

 Summary: Fix Flask Test 
TestTimelineAuthFilterForV2#testPutTimelineEntities
 Key: YARN-11642
 URL: https://issues.apache.org/jira/browse/YARN-11642
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineservice
Affects Versions: 3.5.0
Reporter: Shilun Fan
Assignee: Shilun Fan


Our current unit tests are all executed in parallel. 
TestTimelineAuthFilterForV2#testPutTimelineEntities will report an error during 
execution:

{code:java}
[main] collector.PerNodeTimelineCollectorsAuxService 
(StringUtils.java:startupShutdownMessage(755)) - failed to register any UNIX 
signal loggers: 
java.lang.IllegalStateException: Can't re-install the signal handlers.
{code}

We can solve this problem by changing static initialization to new Object.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11641) Can't update a queue hierarchy in absolute mode when the configured capacities are zero

2024-01-05 Thread Tamas Domok (Jira)
Tamas Domok created YARN-11641:
--

 Summary: Can't update a queue hierarchy in absolute mode when the 
configured capacities are zero
 Key: YARN-11641
 URL: https://issues.apache.org/jira/browse/YARN-11641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 3.4.0
Reporter: Tamas Domok
Assignee: Tamas Domok


h2. Error symptoms

It is not possible to modify a queue hierarchy in absolute mode when the parent 
or every child queue of the parent has 0 min resource configured.

{noformat}
2024-01-05 15:38:59,016 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager:
 Initialized queue: root.a.c
2024-01-05 15:38:59,016 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: Exception 
thrown when modifying configuration.
java.io.IOException: Failed to re-init queues : Parent=root.a: When absolute 
minResource is used, we must make sure both parent and child all use absolute 
minResource
{noformat}

h2. Reproduction

capacity-scheduler.xml
{code:xml}


  
yarn.scheduler.capacity.root.queues
default,a
  
  
yarn.scheduler.capacity.root.capacity
[memory=40960, vcores=16]
  
  
yarn.scheduler.capacity.root.default.capacity
[memory=1024, vcores=1]
  
  
yarn.scheduler.capacity.root.default.maximum-capacity
[memory=1024, vcores=1]
  
  
yarn.scheduler.capacity.root.a.capacity
[memory=0, vcores=0]
  
  
yarn.scheduler.capacity.root.a.maximum-capacity
[memory=39936, vcores=15]
  
  
yarn.scheduler.capacity.root.a.queues
b,c
  
  
yarn.scheduler.capacity.root.a.b.capacity
[memory=0, vcores=0]
  
  
yarn.scheduler.capacity.root.a.b.maximum-capacity
[memory=39936, vcores=15]
  
  
yarn.scheduler.capacity.root.a.c.capacity
[memory=0, vcores=0]
  
  
yarn.scheduler.capacity.root.a.c.maximum-capacity
[memory=39936, vcores=15]
  

{code}

{code:xml}



  root.a
  

  capacity
  [memory=1024,vcores=1]


  maximum-capacity
  [memory=39936,vcores=15]

  


{code}

{code}
$ curl -X PUT -H 'Content-Type: application/xml' -d @updatequeue.xml 
http://localhost:8088/ws/v1/cluster/scheduler-conf\?user.name\=yarn
Failed to re-init queues : Parent=root.a: When absolute minResource is used, we 
must make sure both parent and child all use absolute minResource
{code}

h2. Root cause

setChildQueues is called during reinit, where:

{code:java}
  void setChildQueues(Collection childQueues) throws IOException {
writeLock.lock();
try {
  boolean isLegacyQueueMode = 
queueContext.getConfiguration().isLegacyQueueMode();
  if (isLegacyQueueMode) {
QueueCapacityType childrenCapacityType =
getCapacityConfigurationTypeForQueues(childQueues);
QueueCapacityType parentCapacityType =
getCapacityConfigurationTypeForQueues(ImmutableList.of(this));

if (childrenCapacityType == QueueCapacityType.ABSOLUTE_RESOURCE
|| parentCapacityType == QueueCapacityType.ABSOLUTE_RESOURCE) {
  // We don't allow any mixed absolute + {weight, percentage} between
  // children and parent
  if (childrenCapacityType != parentCapacityType && !this.getQueuePath()
  .equals(CapacitySchedulerConfiguration.ROOT)) {
throw new IOException("Parent=" + this.getQueuePath()
+ ": When absolute minResource is used, we must make sure both "
+ "parent and child all use absolute minResource");
  }
{code}

The parent or childrenCapacityType will be considered as PERCENTAGE, because 
getCapacityConfigurationTypeForQueues fails to detect the absolute mode, here:

{code:java}
if (!queue.getQueueResourceQuotas().getConfiguredMinResource(nodeLabel)
.equals(Resources.none())) {
  absoluteMinResSet = true;
{code}

h2. Possible fixes

Possible fix in AbstractParentQueue.getCapacityConfigurationTypeForQueues using 
the capacityVector:
{code:java}
for (CSQueue queue : queues) {
  for (String nodeLabel : queueCapacities.getExistingNodeLabels()) {
Set definedCapacityTypes =

queue.getConfiguredCapacityVector(nodeLabel).getDefinedCapacityTypes();
if (definedCapacityTypes.size() == 1) {
  QueueCapacityVector.ResourceUnitCapacityType next = 
definedCapacityTypes.iterator().next();
  if (Objects.requireNonNull(next) == PERCENTAGE) {
percentageIsSet = true;
diagMsg.append("{Queue=").append(queue.getQueuePath()).append(", 
label=").append(nodeLabel)
.append(" uses percentage mode}. ");
  } else if (next == 
QueueCapacityVector.ResourceUnitCapacityType.ABSOLUTE) {
absoluteMinResSet = true;
 

[jira] [Created] (YARN-11640) capacity scheduler supports application priority with FairOrderingPolicy

2024-01-04 Thread Ming Chen (Jira)
Ming Chen created YARN-11640:


 Summary: capacity scheduler supports application priority with 
FairOrderingPolicy
 Key: YARN-11640
 URL: https://issues.apache.org/jira/browse/YARN-11640
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Reporter: Ming Chen






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-2098) App priority support in Fair Scheduler

2024-01-04 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-2098.
--
Target Version/s:   (was: 3.5.0)
  Resolution: Done

> App priority support in Fair Scheduler
> --
>
> Key: YARN-2098
> URL: https://issues.apache.org/jira/browse/YARN-2098
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Ashwin Shankar
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-2098.patch, YARN-2098.patch
>
>
> This jira is created for supporting app priorities in fair scheduler. 
> AppSchedulable hard codes priority of apps to 1, we should change this to get 
> priority from ApplicationSubmissionContext.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11529) Add metrics for ContainerMonitorImpl.

2024-01-04 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-11529.
---
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Add metrics for ContainerMonitorImpl.
> -
>
> Key: YARN-11529
> URL: https://issues.apache.org/jira/browse/YARN-11529
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.4.0
>Reporter: Xianming Lei
>Assignee: Xianming Lei
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> In our production environment, we have ample machine resources and a 
> significant number of active Containers. However, the MonitoringThread in 
> ContainerMonitorImpl experiences significant latency during each execution. 
> To address this, it is highly recommended to incorporate metrics for 
> monitoring the duration of this time-consuming process.
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-11639) ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy

2024-01-03 Thread Ferenc Erdelyi (Jira)
Ferenc Erdelyi created YARN-11639:
-

 Summary: ConcurrentModificationException and NPE in 
PriorityUtilizationQueueOrderingPolicy
 Key: YARN-11639
 URL: https://issues.apache.org/jira/browse/YARN-11639
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Reporter: Ferenc Erdelyi


When dynamic queue creation is enabled in weight mode and the deletion policy 
coincides with the PriorityQueueResourcesForSorting, RM stops assigning 
resources because of either ConcurrentModificationExceptionor NPE in 
PriorityUtilizationQueueOrderingPolicy.

Reproduced the NPE issue in Java8 and Java11 environment:
{code:java}
... INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Removing queue: root.dyn.PmvkMgrEBQppu
2024-01-02 17:00:59,399 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
Thread[Thread-11,5,main] threw an Exception.
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.(PriorityUtilizationQueueOrderingPolicy.java:225)
at 
java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at 
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
at 
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
at 
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
at 
java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
at 
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at 
java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605)
{code}

Observed the ConcurrentModificationException in Java8 environment, but could 
not reproduce yet:
{code:java}
2023-10-27 02:50:37,584 ERROR 
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread Thread[Thread-15,5, 
main] threw an Exception.
java.util.ConcurrentModificationException
at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtiliza
ueOrderingPolicy.Java:260)
{code}

The immediate (temporary) remedy to keep the cluster going is to restart the RM.
The workaround is to disable the deletion of dynamically created child queues. 






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr

[jira] [Resolved] (YARN-11632) [Doc] Add allow-partial-result description to Yarn Federation documentation

2024-01-02 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved YARN-11632.
---
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> [Doc] Add allow-partial-result description to Yarn Federation documentation
> ---
>
> Key: YARN-11632
> URL: https://issues.apache.org/jira/browse/YARN-11632
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: federation
>Affects Versions: 3.4.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Add allow-partial-result description to Yarn Federation documentation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >