[jira] [Comment Edited] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-15 Thread Siddharth Ahuja (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250153#comment-17250153
 ] 

Siddharth Ahuja edited comment on YARN-10528 at 12/16/20, 7:52 AM:
---

I have made the behaviour similar to the {{reservation}} element in code.

Performed the following testing on the single node cluster:

Have FS XML as follows:

{code}


    
        1.0
        drf
        *
        *
        
            1.0
            drf
        
        
            1.0
            drf
            0.76 
<- root.users is a parent queue 
with maxAMShare set. This should not be possible.
        
        
            1.0
            drf
            
                1.0
                drf
            
        
        
            1.0
            drf
            
                1.0
                drf
            
        
    
    fair
    0.75
    
        
        
            
        
    

{code}

Refresh YARN queues and observe the RM logs:

{code}
% bin/yarn rmadmin -refreshQueues
{code}

{code}
2020-12-16 18:12:29,665 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService:
 Failed to reload fair scheduler config file - will use existing allocations.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.users are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128)
at java.lang.Thread.run(Thread.java:748)


2020-12-16 18:15:04,056 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Failed to reload allocations file
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.users are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409)
at 
org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120)
at 
org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
{code}

Now, update FS XML such that {{maxAMShare}} is not set for root.users but set 
for a parent queue which is not explicitly tagged as one with "type=parent":

{code}


    
        1.0
        drf
        *
        *
        
            1.0
            drf
        
        
            1.0
     

[jira] [Comment Edited] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-15 Thread Siddharth Ahuja (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250153#comment-17250153
 ] 

Siddharth Ahuja edited comment on YARN-10528 at 12/16/20, 7:51 AM:
---

I have made the behaviour similar to the {{reservation}} element in code.

Performed the following testing on the single node cluster:

Have FS XML as follows:

{code}


    
        1.0
        drf
        *
        *
        
            1.0
            drf
        
        
            1.0
            drf
            0.76 
<- root.users is a parent queue 
with maxAMShare set. This should not be possible.
        
        
            1.0
            drf
            
                1.0
                drf
            
        
        
            1.0
            drf
            
                1.0
                drf
            
        
    
    fair
    0.75
    
        
        
            
        
    

{code}

Refresh YARN queues and observe the RM logs:

{code}
% bin/yarn rmadmin -refreshQueues
{code}

{code}
2020-12-16 18:12:29,665 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService:
 Failed to reload fair scheduler config file - will use existing allocations.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.users are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128)
at java.lang.Thread.run(Thread.java:748)


2020-12-16 18:15:04,056 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Failed to reload allocations file
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.users are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409)
at 
org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120)
at 
org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
{code}

Now, update FS XML such that {{maxAMShare}} is not set for root.users but set 
for a parent queue which is not explicitly tagged as one with "type=parent":

{code}


    
        1.0
        drf
        *
        *
        
            1.0
            drf
        
        
            1.0
     

[jira] [Commented] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-15 Thread Siddharth Ahuja (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250153#comment-17250153
 ] 

Siddharth Ahuja commented on YARN-10528:


I have made the behaviour similar to the reservation element in code.

Performed the following testing on the single node cluster:

Have FS XML as follows:

{code}


    
        1.0
        drf
        *
        *
        
            1.0
            drf
        
        
            1.0
            drf
            0.76 
<- root.users is a parent queue 
with maxAMShare set. This should not be possible.
        
        
            1.0
            drf
            
                1.0
                drf
            
        
        
            1.0
            drf
            
                1.0
                drf
            
        
    
    fair
    0.75
    
        
        
            
        
    

{code}

Refresh YARN queues and observe the RM logs:

{code}
% bin/yarn rmadmin -refreshQueues
{code}

{code}
2020-12-16 18:12:29,665 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService:
 Failed to reload fair scheduler config file - will use existing allocations.
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.users are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.lambda$serviceInit$0(AllocationFileLoaderService.java:128)
at java.lang.Thread.run(Thread.java:748)


2020-12-16 18:15:04,056 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Failed to reload allocations file
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException:
 The configuration settings for root.users are invalid. A queue element that 
contains child queue elements or that has the type='parent' attribute cannot 
also include a maxAMShare element.
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:238)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.loadQueue(AllocationFileQueueParser.java:221)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.allocation.AllocationFileQueueParser.parse(AllocationFileQueueParser.java:97)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:257)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.reinitialize(FairScheduler.java:1571)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:438)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:409)
at 
org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:120)
at 
org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:293)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:537)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
{code}

Now, update FS XML such that maxAMShare is not set for root.users but set for a 
parent queue which is not explicitly tagged as one with "type=parent":

{code}


    
        1.0
        drf
        *
        *
        
            1.0
            drf
        
        
            1.0
            drf
        
        
            1.0
            drf

[jira] [Updated] (YARN-10528) maxAMShare should only be accepted for leaf queues, not parent queues

2020-12-15 Thread Siddharth Ahuja (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Ahuja updated YARN-10528:
---
Attachment: YARN-10528.001.patch

> maxAMShare should only be accepted for leaf queues, not parent queues
> -
>
> Key: YARN-10528
> URL: https://issues.apache.org/jira/browse/YARN-10528
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: YARN-10528.001.patch, maxAMShare for root.users (parent 
> queue) has no effect as child queue does not inherit it.png
>
>
> Based on [Hadoop 
> documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html],
>  it is clear that {{maxAMShare}} property can only be used for *leaf queues*. 
> This is similar to the {{reservation}} setting.
> However, existing code only ensures that the reservation setting is not 
> accepted for "parent" queues (see 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L226
>  and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/allocation/AllocationFileQueueParser.java#L233)
>  but it is missing the checks for {{maxAMShare}}. Due to this, it is 
> currently possible to have an allocation similar to below:
> {code}
> 
> 
> 
> 1.0
> drf
> *
> *
> 
> 1.0
> drf
> 
> 
> 1.0
> drf
> 1.0
> 
> 
> fair
> 
> 
> 
> 
> 
> 
> 
> 
> {code}
> where {{maxAMShare}} is 1.0f meaning, it is possible allocate 100% of the 
> queue's resources for Application Masters. Notice above that root.users is a 
> parent queue, however, it still gladly accepts {{maxAMShare}}. This is 
> contrary to the documentation and in fact, it is very misleading because the 
> child queues like root.users. actually do not inherit this setting at 
> all and they still go on and use the default of 0.5 instead of 1.0, see the 
> attached screenshot as an example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10007) YARN logs contain environment variables, which is a security risk

2020-12-15 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10007:

Issue Type: New Feature  (was: Bug)

> YARN logs contain environment variables, which is a security risk
> -
>
> Key: YARN-10007
> URL: https://issues.apache.org/jira/browse/YARN-10007
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: john lilley
>Priority: Major
>
> In most environments it is standard practice to relay "secrets" via 
> environment variables when spawning a process, because the alternatives 
> (command-line args or storing in a file) are insecure.  However, in a YARN 
> application, this also appears to be insecure because the environment is 
> logged.  While YARN has the ability to relay delegation tokens in the launch 
> context, it is unclear how to use this facility for generalized "secrets" 
> that may not conform to security-token structure.  
> For example, the RPDM_KEYSTORE_PASSWORDS env var is found in the aggregated 
> YARN logs:
> {{Container: container_e06_1574362398372_0023_01_01 on 
> node6..com_45454}}
> {{LogAggregationType: AGGREGATED}}
> {{}}
> {{LogType:launch_container.sh}}
> {{LogLastModifiedTime:Sat Nov 23 14:58:12 -0700 2019}}
> {{LogLength:4043}}
> {{LogContents:}}
> {{#!/bin/bash}}{{set -o pipefail -e}}
> {{[...]export 
> HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-"/usr/hdp/2.6.5.1175-1/hadoop-yarn"}}}
> {{export 
> RPDM_KEYSTORE_PASSWORDS="eyJnZW5lcmFsIjoiZmtQZllubmVLRVo4c1Z0V0REQ3gxaHJzRnVjdVN5b1NBTE9OUTF1dEZpZ1x1MDAzZCJ9"}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10535) Make changes in queue placement policy to use auto-queue-placement API in CapacityScheduler

2020-12-15 Thread Wangda Tan (Jira)
Wangda Tan created YARN-10535:
-

 Summary: Make changes in queue placement policy to use 
auto-queue-placement API in CapacityScheduler
 Key: YARN-10535
 URL: https://issues.apache.org/jira/browse/YARN-10535
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacity scheduler
Reporter: Wangda Tan


Once YARN-10506 is done, we need to call the API from the queue placement 
policy to create queues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10535) Make changes in queue placement policy to use auto-queue-placement API in CapacityScheduler

2020-12-15 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250003#comment-17250003
 ] 

Wangda Tan commented on YARN-10535:
---

cc: [~shuzirra], [~pbacsko]

> Make changes in queue placement policy to use auto-queue-placement API in 
> CapacityScheduler
> ---
>
> Key: YARN-10535
> URL: https://issues.apache.org/jira/browse/YARN-10535
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Priority: Major
>
> Once YARN-10506 is done, we need to call the API from the queue placement 
> policy to create queues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10506) Update queue creation logic to use weight mode and allow the flexible static/dynamic creation

2020-12-15 Thread Wangda Tan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-10506:
-

Assignee: Andras Gyori

> Update queue creation logic to use weight mode and allow the flexible 
> static/dynamic creation
> -
>
> Key: YARN-10506
> URL: https://issues.apache.org/jira/browse/YARN-10506
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10506.001.patch
>
>
> The queue creation logic should be updated to use weight mode and support the 
> flexible creation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources

2020-12-15 Thread Wangda Tan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249870#comment-17249870
 ] 

Wangda Tan commented on YARN-10295:
---

[~snemeth], just come across the issue, I saw a bunch of failed unit tests in 
the above jenkins output, is it patch related?

> CapacityScheduler NPE can cause apps to get stuck without resources
> ---
>
> Key: YARN-10295
> URL: https://issues.apache.org/jira/browse/YARN-10295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.2.2, 3.1.5
>
> Attachments: YARN-10295.001.branch-3.1.patch, 
> YARN-10295.001.branch-3.2.patch, YARN-10295.002.branch-3.1.patch, 
> YARN-10295.002.branch-3.2.patch
>
>
> When the CapacityScheduler Asynchronous scheduling is enabled and log level 
> is set to DEBUG there is an edge-case where a NullPointerException can cause 
> the scheduler thread to exit and the apps to get stuck without allocated 
> resources. Consider the following log:
> {code:java}
> 2020-05-27 10:13:49,106 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:apply(681)) - Reserved 
> container=container_e10_1590502305306_0660_01_000115, on node=host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used= with 
> resource=
> 2020-05-27 10:13:49,134 INFO  fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:internalUnreserve(743)) - Application 
> application_1590502305306_0660 unreserved  on node host: 
> ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 
> available= used=, currently 
> has 0 at priority 11; currentReservation  on node-label=
> 2020-05-27 10:13:49,134 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted
> 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler 
> (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread 
> Thread[Thread-4953,5,main] threw an Exception.
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593)
> {code}
> A container gets allocated on a host, but the host doesn't have enough 
> memory, so after a short while it gets unreserved. However because the 
> scheduler thread is running asynchronously it might have entered into the 
> following if block located in 
> [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602],
>  because at the time _node.getReservedContainer()_ wasn't null. Calling it a 
> second time for getting the ApplicationAttemptId would be an NPE, as the 
> container got unreserved in the meantime.
> {code:java}
> // Do not schedule if there are any reservations to fulfill on the node
> if (node.getReservedContainer() != null) {
> if (LOG.isDebugEnabled()) {
> LOG.debug("Skipping scheduling since node " + node.getNodeID()
> + " is reserved by application " + node.getReservedContainer()
> .getContainerId().getApplicationAttemptId());
>  }
>  return null;
> }
> {code}
> A fix would be to store the container object before the if block. 
> Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 
> which indirectly fixed this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10534) Enable runC container transformations

2020-12-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-10534:
--
Labels: pull-request-available  (was: )

> Enable runC container transformations
> -
>
> Key: YARN-10534
> URL: https://issues.apache.org/jira/browse/YARN-10534
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Matthew Sharp
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The goal of this Jira is to provide an optional plugin to apply runC 
> container transformations. Enabling runC container transformations will 
> provide an easy way to apply site specific customizations to all containers.
> An example of one transformation that many clusters may need could be a 
> Kerberos transformation. This would apply cluster Kerberos configurations and 
> mount them to all runC containers that are submitted, without requiring users 
> to manage them within their own images.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10534) Enable runC container transformations

2020-12-15 Thread Matthew Sharp (Jira)
Matthew Sharp created YARN-10534:


 Summary: Enable runC container transformations
 Key: YARN-10534
 URL: https://issues.apache.org/jira/browse/YARN-10534
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Matthew Sharp


The goal of this Jira is to provide an optional plugin to apply runC container 
transformations. Enabling runC container transformations will provide an easy 
way to apply site specific customizations to all containers.

An example of one transformation that many clusters may need could be a 
Kerberos transformation. This would apply cluster Kerberos configurations and 
mount them to all runC containers that are submitted, without requiring users 
to manage them within their own images.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10526) RMAppManager CS Placement ignores parent path

2020-12-15 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249779#comment-17249779
 ] 

Szilard Nemeth commented on YARN-10526:
---

Thanks [~shuzirra] for working on this,

Latest patch LGTM, committed to trunk.

Can you make sure other branches are not affected by this bug (3.2 / 3.1)? 

Thanks.

> RMAppManager CS Placement ignores parent path
> -
>
> Key: YARN-10526
> URL: https://issues.apache.org/jira/browse/YARN-10526
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-10526.001.patch, YARN-10526.002.patch, 
> YARN-10526.003.patch, YARN-10526.004.patch
>
>
> When RMAppManager creates the RMApp object using the placementContext's 
> results, it only uses the getQueue method which will return only the name of 
> the leaf queue in the case of CapacityScheduler.
> If a queue exists with this name, then the application will be placed into 
> the queue. If the queue does not exists, then CS will take the parent path 
> into consideration during the auto queue creation, however this only happens 
> if there is no queue with the leaf name.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10526) RMAppManager CS Placement ignores parent path

2020-12-15 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10526:
--
Fix Version/s: 3.4.0

> RMAppManager CS Placement ignores parent path
> -
>
> Key: YARN-10526
> URL: https://issues.apache.org/jira/browse/YARN-10526
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10526.001.patch, YARN-10526.002.patch, 
> YARN-10526.003.patch, YARN-10526.004.patch
>
>
> When RMAppManager creates the RMApp object using the placementContext's 
> results, it only uses the getQueue method which will return only the name of 
> the leaf queue in the case of CapacityScheduler.
> If a queue exists with this name, then the application will be placed into 
> the queue. If the queue does not exists, then CS will take the parent path 
> into consideration during the auto queue creation, however this only happens 
> if there is no queue with the leaf name.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10532) Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is not being used

2020-12-15 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249653#comment-17249653
 ] 

Hadoop QA commented on YARN-10532:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
51s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 2 
new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 
27s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
4s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
0s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
45s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
59s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m  1s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
46s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
40s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
52s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs 
config; considering switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
48s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
51s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
53s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
53s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
45s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
45s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 38s{color} | 
{color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/390/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt{color}
 | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 9 new + 255 unchanged - 0 fixed = 264 total (was 255) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
49s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 18s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| 

[jira] [Assigned] (YARN-10525) Support FS Convert to CS with weight mode enabled in CS.

2020-12-15 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi reassigned YARN-10525:


Assignee: zhuqi

> Support FS Convert to CS with weight mode enabled in CS.
> 
>
> Key: YARN-10525
> URL: https://issues.apache.org/jira/browse/YARN-10525
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10532) Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is not being used

2020-12-15 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249512#comment-17249512
 ] 

zhuqi edited comment on YARN-10532 at 12/15/20, 9:35 AM:
-

[~wangda] 

I want to take it , if no one to take.

Submit a patch for review.

And, it will change some logic, when weight mode done.

Thanks.:)


was (Author: zhuqi):
[~wangda] 

I want to take it , if no one to take.

Thanks.:)

> Capacity Scheduler Auto Queue Creation: Allow auto delete queue when queue is 
> not being used
> 
>
> Key: YARN-10532
> URL: https://issues.apache.org/jira/browse/YARN-10532
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10532.001.patch
>
>
> It's better if we can delete auto-created queues when they are not in use for 
> a period of time (like 5 mins). It will be helpful when we have a large 
> number of auto-created queues (e.g. from 500 users), but only a small subset 
> of queues are actively used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10430) Log improvements in NodeStatusUpdaterImpl

2020-12-15 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated YARN-10430:
---
Fix Version/s: 3.2.2

Add fix version 3.2.2 + 3.2.3, Please update fix version tag in time when 
commit and backport, Thanks.

> Log improvements in NodeStatusUpdaterImpl
> -
>
> Key: YARN-10430
> URL: https://issues.apache.org/jira/browse/YARN-10430
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Fix For: 3.2.2, 3.3.1, 3.1.5, 3.4.1, 3.2.3
>
> Attachments: YARN-10430.001.patch
>
>
> I think in below places log should be printed only if list size is not zero.
> {code:java}
> if (LOG.isDebugEnabled()) {
>   LOG.debug("The cache log aggregation status size:"
>   + logAggregationReports.size());
>  }
> {code}
> {code:java}
> LOG.info("Sending out " + containerStatuses.size()
>   + " NM container statuses: " + containerStatuses);
> {code}
> {code:java}
> if (LOG.isDebugEnabled()) {
>   LOG.debug("Sending out " + containerStatuses.size()
>   + " container statuses: " + containerStatuses);
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org