[jira] [Updated] (YARN-10010) NM upload log cost too much time

2019-12-02 Thread zhoukang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10010:

Attachment: (was: 选区_002.png)

> NM upload log cost too much time
> 
>
> Key: YARN-10010
> URL: https://issues.apache.org/jira/browse/YARN-10010
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: notfound.png
>
>
> Since thread pool size of log service is 100.
> Some times the log uploading service will delay for some apps.like below
>  !选区_002.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10010) NM upload log cost too much time

2019-12-02 Thread zhoukang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10010:

Description: 
Since thread pool size of log service is 100.
Some times the log uploading service will delay for some apps.like below
 !notfound.png! 

  was:
Since thread pool size of log service is 100.
Some times the log uploading service will delay for some apps.like below
 !选区_002.png! 


> NM upload log cost too much time
> 
>
> Key: YARN-10010
> URL: https://issues.apache.org/jira/browse/YARN-10010
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: notfound.png
>
>
> Since thread pool size of log service is 100.
> Some times the log uploading service will delay for some apps.like below
>  !notfound.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10010) NM upload log cost too much time

2019-12-02 Thread zhoukang (Jira)
zhoukang created YARN-10010:
---

 Summary: NM upload log cost too much time
 Key: YARN-10010
 URL: https://issues.apache.org/jira/browse/YARN-10010
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhoukang
Assignee: zhoukang
 Attachments: notfound.png

Since thread pool size of log service is 100.
Some times the log uploading service will delay for some apps.like below
 !选区_002.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10010) NM upload log cost too much time

2019-12-02 Thread zhoukang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated YARN-10010:

Attachment: notfound.png

> NM upload log cost too much time
> 
>
> Key: YARN-10010
> URL: https://issues.apache.org/jira/browse/YARN-10010
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: notfound.png
>
>
> Since thread pool size of log service is 100.
> Some times the log uploading service will delay for some apps.like below
>  !notfound.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8364) NM aggregation thread should be able to exempt pool

2019-12-02 Thread zhoukang (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986613#comment-16986613
 ] 

zhoukang commented on YARN-8364:


I will work on this

> NM aggregation thread should be able to exempt pool
> ---
>
> Key: YARN-8364
> URL: https://issues.apache.org/jira/browse/YARN-8364
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Reporter: Oleksandr Shevchenko
>Priority: Major
>
> For now, we have limited NM aggregation thread pool that can be configured by 
> the property yarn.nodemanager.logaggregation.threadpool-size-max=100. 
> When some application is starting it use one unit of the pool. And locks this 
> unit until the application is finished. As the result, another application 
> can aggregate their logs only when the previous application is finished.
> Just for example:
> yarn.nodemanager.logaggregation.threadpool-size-max=1
> 1. Start long-running application app1
> 2. Start short application app2
> 3. Finished app2
> 4. Finished app1
> 5. Aggregating logs of app1
> 6. Aggregating logs of app2
> In the real cluster, we can have many long running jobs (for example Spark 
> streaming), therefore short-running application do not aggregate their logs a 
> long time. It problem appears if the average number of jobs exceeds thread 
> pool size. All threads occupied by some applications, as the result we have 
> the huge delay between application finishing and logs uploading.
> Will be good if we improve this behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9985) Unsupported "transitionToObserver" option displaying for rmadmin command

2019-12-02 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986506#comment-16986506
 ] 

Akira Ajisaka commented on YARN-9985:
-

Thanks [~ayushtkn] for providing the patch.

I'm thinking it would be better to fix the problem in HDFS. Can we move 
HDFS-specific command options from HAAdmin to DFSHAAdmin?

> Unsupported "transitionToObserver" option displaying for rmadmin command
> 
>
> Key: YARN-9985
> URL: https://issues.apache.org/jira/browse/YARN-9985
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM, yarn
>Affects Versions: 3.2.1
>Reporter: Souryakanta Dwivedy
>Assignee: Ayush Saxena
>Priority: Minor
> Attachments: YARN-9985-01.patch, YARN-9985-02.patch, 
> image-2019-11-18-18-31-17-755.png, image-2019-11-18-18-35-54-688.png
>
>
> Unsupported "transitionToObserver" option displaying for rmadmin command
> Check the options for Yarn rmadmin command
> It will display the "-transitionToObserver " option which is not 
> supported 
>  by yarn rmadmin command which is wrong behavior.
>  But if you check the yarn rmadmin -help it will not display any option  
> "-transitionToObserver "
>  
> !image-2019-11-18-18-31-17-755.png!
>  
> ==
> install/hadoop/resourcemanager/bin> ./yarn rmadmin -help
> rmadmin is the command to execute YARN administrative commands.
> The full syntax is:
> yarn rmadmin [-refreshQueues] [-refreshNodes [-g|graceful [timeout in 
> seconds] -client|server]] [-refreshNodesResources] 
> [-refreshSuperUserGroupsConfiguration] [-refreshUserToGroupsMappings] 
> [-refreshAdminAcls] [-refreshServiceAcl] [-getGroup [username]] 
> [-addToClusterNodeLabels 
> <"label1(exclusive=true),label2(exclusive=false),label3">] 
> [-removeFromClusterNodeLabels ] [-replaceLabelsOnNode 
> <"node1[:port]=label1,label2 node2[:port]=label1"> [-failOnUnknownNodes]] 
> [-directlyAccessNodeLabelStore] [-refreshClusterMaxPriority] 
> [-updateNodeResource [NodeID] [MemSize] [vCores] ([OvercommitTimeout]) or 
> -updateNodeResource [NodeID] [ResourceTypes] ([OvercommitTimeout])] 
> *{color:#FF}[-transitionToActive [--forceactive] ]{color} 
> {color:#FF}[-transitionToStandby ]{color}* [-getServiceState 
> ] [-getAllServiceState] [-checkHealth ] [-help [cmd]]
> -refreshQueues: Reload the queues' acls, states and scheduler specific 
> properties.
>  ResourceManager will reload the mapred-queues configuration file.
>  -refreshNodes [-g|graceful [timeout in seconds] -client|server]: Refresh the 
> hosts information at the ResourceManager. Here [-g|graceful [timeout in 
> seconds] -client|server] is optional, if we specify the timeout then 
> ResourceManager will wait for timeout before marking the NodeManager as 
> decommissioned. The -client|server indicates if the timeout tracking should 
> be handled by the client or the ResourceManager. The client-side tracking is 
> blocking, while the server-side tracking is not. Omitting the timeout, or a 
> timeout of -1, indicates an infinite timeout. Known Issue: the server-side 
> tracking will immediately decommission if an RM HA failover occurs.
>  -refreshNodesResources: Refresh resources of NodeManagers at the 
> ResourceManager.
>  -refreshSuperUserGroupsConfiguration: Refresh superuser proxy groups mappings
>  -refreshUserToGroupsMappings: Refresh user-to-groups mappings
>  -refreshAdminAcls: Refresh acls for administration of ResourceManager
>  -refreshServiceAcl: Reload the service-level authorization policy file.
>  ResourceManager will reload the authorization policy file.
>  -getGroups [username]: Get the groups which given user belongs to.
>  -addToClusterNodeLabels 
> <"label1(exclusive=true),label2(exclusive=false),label3">: add to cluster 
> node labels. Default exclusivity is true
>  -removeFromClusterNodeLabels  (label splitted by ","): 
> remove from cluster node labels
>  -replaceLabelsOnNode <"node1[:port]=label1,label2 
> node2[:port]=label1,label2"> [-failOnUnknownNodes] : replace labels on nodes 
> (please note that we do not support specifying multiple labels on a single 
> host for now.)
>  [-failOnUnknownNodes] is optional, when we set this option, it will fail if 
> specified nodes are unknown.
>  -directlyAccessNodeLabelStore: This is DEPRECATED, will be removed in future 
> releases. Directly access node label store, with this option, all node label 
> related operations will not connect RM. Instead, they will access/modify 
> stored node labels directly. By default, it is false (access via RM). AND 
> PLEASE NOTE: if you configured yarn.node-labels.fs-store.root-dir to a local 
> directory (instead of NFS or HDFS), this option will only work when the 
> command run on the machine where RM is running.
>  -refreshClusterMaxPriority: 

[jira] [Commented] (YARN-10009) DRF can treat minimum user limit percent as a max when custom resource is defined

2019-12-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986499#comment-16986499
 ] 

Hadoop QA commented on YARN-10009:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m  
6s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 
37s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
37s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 20s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
20s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
30s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 28s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 4 new + 9 unchanged - 0 fixed = 13 total (was 9) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 1s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 33s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 87m 27s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
26s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}151m 27s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.yarn.server.resourcemanager.TestRMRestart |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerWithMultiResourceTypes
 |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:104ccca9169 |
| JIRA Issue | YARN-10009 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12987327/YARN-10009.UT.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux ce62f74fe7f4 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 
05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 6b2d6d4 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/25253/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
 |
| unit | 

[jira] [Commented] (YARN-9992) Max allocation per queue is zero for custom resource types on RM startup

2019-12-02 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986454#comment-16986454
 ] 

Eric Payne commented on YARN-9992:
--

[~jhung], it looks like this is only a problem on branch-2 and branch-2.10. Is 
that your analysis as well?

> Max allocation per queue is zero for custom resource types on RM startup
> 
>
> Key: YARN-9992
> URL: https://issues.apache.org/jira/browse/YARN-9992
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-9992.001.patch
>
>
> Found an issue where trying to request GPUs on a newly booted RM cannot 
> schedule. It throws the exception in 
> SchedulerUtils#throwInvalidResourceException:
> {noformat}
> throw new InvalidResourceRequestException(
> "Invalid resource request, requested resource type=[" + reqResourceName
> + "] < 0 or greater than maximum allowed allocation. Requested "
> + "resource=" + reqResource + ", maximum allowed allocation="
> + availableResource
> + ", please note that maximum allowed allocation is calculated "
> + "by scheduler based on maximum resource of registered "
> + "NodeManagers, which might be less than configured "
> + "maximum allocation="
> + ResourceUtils.getResourceTypesMaximumAllocation());{noformat}
> Upon refreshing scheduler (e.g. via refreshQueues), GPU scheduling works 
> again.
> I think the RC is that upon scheduler refresh, resource-types.xml is loaded 
> in CapacitySchedulerConfiguration (as part of YARN-7738), so when we call 
> ResourceUtils#fetchMaximumAllocationFromConfig in 
> CapacitySchedulerConfiguration#getMaximumAllocationPerQueue, it's able to 
> fetch the {{yarn.resource-types}} config. But resource-types.xml is not 
> loaded into the conf in CapacityScheduler#initScheduler, so it doesn't find 
> the custom resource when computing max allocations, and the custom resource 
> max allocation is 0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10009) DRF can treat minimum user limit percent as a max when custom resource is defined

2019-12-02 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10009:
--
Description: 
| |Memory|Vcores|res_1|
|Queue1 Totals|20GB|100|80|
|Resources requested by App1 in Queue1|8GB (40% of total)|8 (8% of total)|80 
(100% of total)|

In the previous use case:
 - Queue1 has a value of 25 for {{miminum-user-limit-percent}}
 - User1 has requested 8 containers with {{}} 
each
 - {{res_1}} will be the dominant resource this case.

All 8 containers should be assigned by the capacity scheduler, but with min 
user limit pct set to 25, only 2 containers are assigned.

  was:
| | Memory | Vcores | res_1 |
| Queue1 Totals | 20GB | 100 | 80 |
| Resources requested by App1 in Queue1 | 8GB (40% of total) | 8 (8% of total) 
| 80 (100% of total) |

In the previous use case:
- Queue1 has a value of 25 for {{miminum-user-limit-percent}}
- User1 has requested 8 containers with {{}} 
each
- {{res_1}} will be the dominant resource this case.

All 8 containers should be assigned by the capacity scheduler, but with min 
user limit pct set to 25, only 3 containers are assigned.


> DRF can treat minimum user limit percent as a max when custom resource is 
> defined
> -
>
> Key: YARN-10009
> URL: https://issues.apache.org/jira/browse/YARN-10009
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.10.0, 3.3.0, 3.2.1, 3.1.3, 2.11.0
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: YARN-10009.UT.patch
>
>
> | |Memory|Vcores|res_1|
> |Queue1 Totals|20GB|100|80|
> |Resources requested by App1 in Queue1|8GB (40% of total)|8 (8% of total)|80 
> (100% of total)|
> In the previous use case:
>  - Queue1 has a value of 25 for {{miminum-user-limit-percent}}
>  - User1 has requested 8 containers with {{}} 
> each
>  - {{res_1}} will be the dominant resource this case.
> All 8 containers should be assigned by the capacity scheduler, but with min 
> user limit pct set to 25, only 2 containers are assigned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10009) DRF can treat minimum user limit percent as a max when custom resource is defined

2019-12-02 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10009:
--
Attachment: YARN-10009.UT.patch

> DRF can treat minimum user limit percent as a max when custom resource is 
> defined
> -
>
> Key: YARN-10009
> URL: https://issues.apache.org/jira/browse/YARN-10009
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: YARN-10009.UT.patch
>
>
> | | Memory | Vcores | res_1 |
> | Queue1 Totals | 20GB | 100 | 80 |
> | Resources requested by App1 in Queue1 | 8GB (40% of total) | 8 (8% of 
> total) | 80 (100% of total) |
> In the previous use case:
> - Queue1 has a value of 25 for {{miminum-user-limit-percent}}
> - User1 has requested 8 containers with {{}} 
> each
> - {{res_1}} will be the dominant resource this case.
> All 8 containers should be assigned by the capacity scheduler, but with min 
> user limit pct set to 25, only 3 containers are assigned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10009) DRF can treat minimum user limit percent as a max when custom resource is defined

2019-12-02 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986411#comment-16986411
 ] 

Eric Payne commented on YARN-10009:
---

The root cause is here:
{code:UsersManager#computeUserLimit}
/*
 * User limit resource is determined by: max(currentCapacity / #activeUsers,
 * currentCapacity * user-limit-percentage%)
 */
Resource userLimitResource = Resources.max(resourceCalculator,
partitionResource,
Resources.divideAndCeil(resourceCalculator, resourceUsed,
usersSummedByWeight),
Resources.divideAndCeil(resourceCalculator,
Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()),
100));
{code}
When calculating the user resource limit, {{divideAndCeil}} is used to take the 
max of either (queue capacity / # of active users) or (queue capacity / min 
user limit pct). However, they are not the same divideAndCeil methods. The 
first takes a {{Resource}} and a {{float}} and the second takes a {{Resource}} 
and an {{int}}. The method with the {{Resource}} {{float}} signature was never 
updated to handle custom resources.

The only place that calls {{difideAndCeil(Resource, float)}} is here in 
{{UsersManager#computeUserLimit}}

> DRF can treat minimum user limit percent as a max when custom resource is 
> defined
> -
>
> Key: YARN-10009
> URL: https://issues.apache.org/jira/browse/YARN-10009
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
>
> | | Memory | Vcores | res_1 |
> | Queue1 Totals | 20GB | 100 | 80 |
> | Resources requested by App1 in Queue1 | 8GB (40% of total) | 8 (8% of 
> total) | 80 (100% of total) |
> In the previous use case:
> - Queue1 has a value of 25 for {{miminum-user-limit-percent}}
> - User1 has requested 8 containers with {{}} 
> each
> - {{res_1}} will be the dominant resource this case.
> All 8 containers should be assigned by the capacity scheduler, but with min 
> user limit pct set to 25, only 3 containers are assigned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10009) DRF can treat minimum user limit percent as a max when custom resource is defined

2019-12-02 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10009:
--
Attachment: (was: YARN-10009.POC.patch)

> DRF can treat minimum user limit percent as a max when custom resource is 
> defined
> -
>
> Key: YARN-10009
> URL: https://issues.apache.org/jira/browse/YARN-10009
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
>
> | | Memory | Vcores | res_1 |
> | Queue1 Totals | 20GB | 100 | 80 |
> | Resources requested by App1 in Queue1 | 8GB (40% of total) | 8 (8% of 
> total) | 80 (100% of total) |
> In the previous use case:
> - Queue1 has a value of 25 for {{miminum-user-limit-percent}}
> - User1 has requested 8 containers with {{}} 
> each
> - {{res_1}} will be the dominant resource this case.
> All 8 containers should be assigned by the capacity scheduler, but with min 
> user limit pct set to 25, only 3 containers are assigned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10009) DRF can treat minimum user limit percent as a max when custom resource is defined

2019-12-02 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10009:
--
Attachment: YARN-10009.POC.patch

> DRF can treat minimum user limit percent as a max when custom resource is 
> defined
> -
>
> Key: YARN-10009
> URL: https://issues.apache.org/jira/browse/YARN-10009
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Attachments: YARN-10009.POC.patch
>
>
> | | Memory | Vcores | res_1 |
> | Queue1 Totals | 20GB | 100 | 80 |
> | Resources requested by App1 in Queue1 | 8GB (40% of total) | 8 (8% of 
> total) | 80 (100% of total) |
> In the previous use case:
> - Queue1 has a value of 25 for {{miminum-user-limit-percent}}
> - User1 has requested 8 containers with {{}} 
> each
> - {{res_1}} will be the dominant resource this case.
> All 8 containers should be assigned by the capacity scheduler, but with min 
> user limit pct set to 25, only 3 containers are assigned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10009) DRF can treat minimum user limit percent as a max when custom resource is defined

2019-12-02 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne reassigned YARN-10009:
-

Assignee: Eric Payne

> DRF can treat minimum user limit percent as a max when custom resource is 
> defined
> -
>
> Key: YARN-10009
> URL: https://issues.apache.org/jira/browse/YARN-10009
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
>
> | | Memory | Vcores | res_1 |
> | Queue1 Totals | 20GB | 100 | 80 |
> | Resources requested by App1 in Queue1 | 8GB (40% of total) | 8 (8% of 
> total) | 80 (100% of total) |
> In the previous use case:
> - Queue1 has a value of 25 for {{miminum-user-limit-percent}}
> - User1 has requested 8 containers with {{}} 
> each
> - {{res_1}} will be the dominant resource this case.
> All 8 containers should be assigned by the capacity scheduler, but with min 
> user limit pct set to 25, only 3 containers are assigned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10009) DRF can treat minimum user limit percent as a max when custom resource is defined

2019-12-02 Thread Eric Payne (Jira)
Eric Payne created YARN-10009:
-

 Summary: DRF can treat minimum user limit percent as a max when 
custom resource is defined
 Key: YARN-10009
 URL: https://issues.apache.org/jira/browse/YARN-10009
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Eric Payne


| | Memory | Vcores | res_1 |
| Queue1 Totals | 20GB | 100 | 80 |
| Resources requested by App1 in Queue1 | 8GB (40% of total) | 8 (8% of total) 
| 80 (100% of total) |

In the previous use case:
- Queue1 has a value of 25 for {{miminum-user-limit-percent}}
- User1 has requested 8 containers with {{}} 
each
- {{res_1}} will be the dominant resource this case.

All 8 containers should be assigned by the capacity scheduler, but with min 
user limit pct set to 25, only 3 containers are assigned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9561) Add C changes for the new RuncContainerRuntime

2019-12-02 Thread Eric Badger (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986353#comment-16986353
 ] 

Eric Badger commented on YARN-9561:
---

[~shaneku...@gmail.com], [~eyang], hope the holidays have treated you well. 
When you get a chance, could you check out the latest patch that fixes the 
Hadoop QA maven issue?

> Add C changes for the new RuncContainerRuntime
> --
>
> Key: YARN-9561
> URL: https://issues.apache.org/jira/browse/YARN-9561
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9561.001.patch, YARN-9561.002.patch, 
> YARN-9561.003.patch, YARN-9561.004.patch, YARN-9561.005.patch, 
> YARN-9561.006.patch, YARN-9561.007.patch, YARN-9561.008.patch, 
> YARN-9561.009.patch, YARN-9561.010.patch, YARN-9561.011.patch, 
> YARN-9561.012.patch, YARN-9561.013.patch, YARN-9561.014.patch, 
> YARN-9561.015.patch
>
>
> This JIRA will be used to add the C changes to the container-executor native 
> binary that are necessary for the new RuncContainerRuntime. There should be 
> no changes to existing code paths. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9992) Max allocation per queue is zero for custom resource types on RM startup

2019-12-02 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986312#comment-16986312
 ] 

Eric Payne commented on YARN-9992:
--

Thanks [~jhung] for reporting this issue and putting up a patch. I encountered 
this problem as well. I'll take a look at the patch soon.

> Max allocation per queue is zero for custom resource types on RM startup
> 
>
> Key: YARN-9992
> URL: https://issues.apache.org/jira/browse/YARN-9992
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-9992.001.patch
>
>
> Found an issue where trying to request GPUs on a newly booted RM cannot 
> schedule. It throws the exception in 
> SchedulerUtils#throwInvalidResourceException:
> {noformat}
> throw new InvalidResourceRequestException(
> "Invalid resource request, requested resource type=[" + reqResourceName
> + "] < 0 or greater than maximum allowed allocation. Requested "
> + "resource=" + reqResource + ", maximum allowed allocation="
> + availableResource
> + ", please note that maximum allowed allocation is calculated "
> + "by scheduler based on maximum resource of registered "
> + "NodeManagers, which might be less than configured "
> + "maximum allocation="
> + ResourceUtils.getResourceTypesMaximumAllocation());{noformat}
> Upon refreshing scheduler (e.g. via refreshQueues), GPU scheduling works 
> again.
> I think the RC is that upon scheduler refresh, resource-types.xml is loaded 
> in CapacitySchedulerConfiguration (as part of YARN-7738), so when we call 
> ResourceUtils#fetchMaximumAllocationFromConfig in 
> CapacitySchedulerConfiguration#getMaximumAllocationPerQueue, it's able to 
> fetch the {{yarn.resource-types}} config. But resource-types.xml is not 
> loaded into the conf in CapacityScheduler#initScheduler, so it doesn't find 
> the custom resource when computing max allocations, and the custom resource 
> max allocation is 0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9985) Unsupported "transitionToObserver" option displaying for rmadmin command

2019-12-02 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986089#comment-16986089
 ] 

Ayush Saxena commented on YARN-9985:


[~aajisaka] [~tasanuma] can you help review.

> Unsupported "transitionToObserver" option displaying for rmadmin command
> 
>
> Key: YARN-9985
> URL: https://issues.apache.org/jira/browse/YARN-9985
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM, yarn
>Affects Versions: 3.2.1
>Reporter: Souryakanta Dwivedy
>Assignee: Ayush Saxena
>Priority: Minor
> Attachments: YARN-9985-01.patch, YARN-9985-02.patch, 
> image-2019-11-18-18-31-17-755.png, image-2019-11-18-18-35-54-688.png
>
>
> Unsupported "transitionToObserver" option displaying for rmadmin command
> Check the options for Yarn rmadmin command
> It will display the "-transitionToObserver " option which is not 
> supported 
>  by yarn rmadmin command which is wrong behavior.
>  But if you check the yarn rmadmin -help it will not display any option  
> "-transitionToObserver "
>  
> !image-2019-11-18-18-31-17-755.png!
>  
> ==
> install/hadoop/resourcemanager/bin> ./yarn rmadmin -help
> rmadmin is the command to execute YARN administrative commands.
> The full syntax is:
> yarn rmadmin [-refreshQueues] [-refreshNodes [-g|graceful [timeout in 
> seconds] -client|server]] [-refreshNodesResources] 
> [-refreshSuperUserGroupsConfiguration] [-refreshUserToGroupsMappings] 
> [-refreshAdminAcls] [-refreshServiceAcl] [-getGroup [username]] 
> [-addToClusterNodeLabels 
> <"label1(exclusive=true),label2(exclusive=false),label3">] 
> [-removeFromClusterNodeLabels ] [-replaceLabelsOnNode 
> <"node1[:port]=label1,label2 node2[:port]=label1"> [-failOnUnknownNodes]] 
> [-directlyAccessNodeLabelStore] [-refreshClusterMaxPriority] 
> [-updateNodeResource [NodeID] [MemSize] [vCores] ([OvercommitTimeout]) or 
> -updateNodeResource [NodeID] [ResourceTypes] ([OvercommitTimeout])] 
> *{color:#FF}[-transitionToActive [--forceactive] ]{color} 
> {color:#FF}[-transitionToStandby ]{color}* [-getServiceState 
> ] [-getAllServiceState] [-checkHealth ] [-help [cmd]]
> -refreshQueues: Reload the queues' acls, states and scheduler specific 
> properties.
>  ResourceManager will reload the mapred-queues configuration file.
>  -refreshNodes [-g|graceful [timeout in seconds] -client|server]: Refresh the 
> hosts information at the ResourceManager. Here [-g|graceful [timeout in 
> seconds] -client|server] is optional, if we specify the timeout then 
> ResourceManager will wait for timeout before marking the NodeManager as 
> decommissioned. The -client|server indicates if the timeout tracking should 
> be handled by the client or the ResourceManager. The client-side tracking is 
> blocking, while the server-side tracking is not. Omitting the timeout, or a 
> timeout of -1, indicates an infinite timeout. Known Issue: the server-side 
> tracking will immediately decommission if an RM HA failover occurs.
>  -refreshNodesResources: Refresh resources of NodeManagers at the 
> ResourceManager.
>  -refreshSuperUserGroupsConfiguration: Refresh superuser proxy groups mappings
>  -refreshUserToGroupsMappings: Refresh user-to-groups mappings
>  -refreshAdminAcls: Refresh acls for administration of ResourceManager
>  -refreshServiceAcl: Reload the service-level authorization policy file.
>  ResourceManager will reload the authorization policy file.
>  -getGroups [username]: Get the groups which given user belongs to.
>  -addToClusterNodeLabels 
> <"label1(exclusive=true),label2(exclusive=false),label3">: add to cluster 
> node labels. Default exclusivity is true
>  -removeFromClusterNodeLabels  (label splitted by ","): 
> remove from cluster node labels
>  -replaceLabelsOnNode <"node1[:port]=label1,label2 
> node2[:port]=label1,label2"> [-failOnUnknownNodes] : replace labels on nodes 
> (please note that we do not support specifying multiple labels on a single 
> host for now.)
>  [-failOnUnknownNodes] is optional, when we set this option, it will fail if 
> specified nodes are unknown.
>  -directlyAccessNodeLabelStore: This is DEPRECATED, will be removed in future 
> releases. Directly access node label store, with this option, all node label 
> related operations will not connect RM. Instead, they will access/modify 
> stored node labels directly. By default, it is false (access via RM). AND 
> PLEASE NOTE: if you configured yarn.node-labels.fs-store.root-dir to a local 
> directory (instead of NFS or HDFS), this option will only work when the 
> command run on the machine where RM is running.
>  -refreshClusterMaxPriority: Refresh cluster max priority
>  -updateNodeResource [NodeID] [MemSize] [vCores] ([OvercommitTimeout])
>  or
>  [NodeID] 

[jira] [Commented] (YARN-5259) Add two metrics at FSOpDurations for doing container assign and completed Performance statistical analysis

2019-12-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-5259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986009#comment-16986009
 ] 

Hadoop QA commented on YARN-5259:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
34s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
 1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
33s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m  7s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
10s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 28s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 2 new + 33 unchanged - 0 fixed = 35 total (was 33) {color} 
|
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 50s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 87m 
37s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
33s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}143m 25s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:104ccca9169 |
| JIRA Issue | YARN-5259 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12987278/YARN-5259_5.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 633a53a14681 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 
05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 6b2d6d4 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/25252/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/25252/testReport/ |
| Max. process+thread count | 821 (vs. 

[jira] [Updated] (YARN-5259) Add two metrics at FSOpDurations for doing container assign and completed Performance statistical analysis

2019-12-02 Thread Shen Yinjie (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-5259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shen Yinjie updated YARN-5259:
--
Attachment: YARN-5259_5.patch

> Add two metrics at FSOpDurations for doing container assign and completed 
> Performance statistical analysis
> --
>
> Key: YARN-5259
> URL: https://issues.apache.org/jira/browse/YARN-5259
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Reporter: ChenFolin
>Assignee: Íñigo Goiri
>Priority: Major
>  Labels: oct16-easy
> Attachments: YARN-5259-001.patch, YARN-5259-002.patch, 
> YARN-5259-003.patch, YARN-5259-004.patch, YARN-5259_5.patch
>
>
> If cluster is slow , we can not know Whether it is caused by container assign 
> or completed performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org