[jira] [Updated] (YARN-10504) Implement weight mode in Capacity Scheduler

2020-12-10 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-10504:
-
Attachment: (was: YARN-10504.001.patch)

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10504.001.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10504) Implement weight mode in Capacity Scheduler

2020-12-10 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-10504:
-
Attachment: YARN-10504.001.patch

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10504.001.patch
>
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10504) Implement weight mode in Capacity Scheduler

2020-12-10 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi reassigned YARN-10504:


Assignee: zhuqi  (was: Benjamin Teke)

> Implement weight mode in Capacity Scheduler
> ---
>
> Key: YARN-10504
> URL: https://issues.apache.org/jira/browse/YARN-10504
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Benjamin Teke
>Assignee: zhuqi
>Priority: Major
>
> To allow the possibility to flexibly create queues in Capacity Scheduler a 
> weight mode should be introduced. The existing \{{capacity }}property should 
> be used with a different syntax, i.e:
> root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0
> root.users.capacity = 1.0w
> root.users.capacity = w:1.0
> Weight support should not impact the existing functionality.
>  
> The new functionality should: 
>  * accept and validate the new weight values
>  * enforce a singular mode on the whole queue tree
>  * (re)calculate the relative (percentage-based) capacities based on the 
> weights during launch and every time the queue structure changes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler

2020-12-09 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-10380:
-
Attachment: (was: YARN-10380.002.patch)

> Import logic of multi-node allocation in CapacityScheduler
> --
>
> Key: YARN-10380
> URL: https://issues.apache.org/jira/browse/YARN-10380
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> *1) Entry point:* 
> When we do multi-node allocation, we're using the same logic of async 
> scheduling:
> {code:java}
> // Allocate containers of node [start, end)
>  for (FiCaSchedulerNode node : nodes) {
>   if (current++ >= start) {
>      if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
>         continue;
>      }
>      cs.allocateContainersToNode(node.getNodeID(), false);
>   }
>  } {code}
> Is it the most effective way to do multi-node scheduling? Should we allocate 
> based on partitions? In above logic, if we have thousands of node in one 
> partition, we will repeatly access all nodes of the partition thousands of 
> times.
> I would suggest looking at making entry-point for node-heartbeat, 
> async-scheduling (single node), and async-scheduling (multi-node) to be 
> different.
> Node-heartbeat and async-scheduling (single node) can be still similar and 
> share most of the code. 
> async-scheduling (multi-node): should iterate partition first, using pseudo 
> code like: 
> {code:java}
> for (partition : all partitions) {
>   allocateContainersOnMultiNodes(getCandidate(partition))
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler

2020-12-09 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-10380:
-
Attachment: (was: YARN-10380.001.patch)

> Import logic of multi-node allocation in CapacityScheduler
> --
>
> Key: YARN-10380
> URL: https://issues.apache.org/jira/browse/YARN-10380
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> *1) Entry point:* 
> When we do multi-node allocation, we're using the same logic of async 
> scheduling:
> {code:java}
> // Allocate containers of node [start, end)
>  for (FiCaSchedulerNode node : nodes) {
>   if (current++ >= start) {
>      if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
>         continue;
>      }
>      cs.allocateContainersToNode(node.getNodeID(), false);
>   }
>  } {code}
> Is it the most effective way to do multi-node scheduling? Should we allocate 
> based on partitions? In above logic, if we have thousands of node in one 
> partition, we will repeatly access all nodes of the partition thousands of 
> times.
> I would suggest looking at making entry-point for node-heartbeat, 
> async-scheduling (single node), and async-scheduling (multi-node) to be 
> different.
> Node-heartbeat and async-scheduling (single node) can be still similar and 
> share most of the code. 
> async-scheduling (multi-node): should iterate partition first, using pseudo 
> code like: 
> {code:java}
> for (partition : all partitions) {
>   allocateContainersOnMultiNodes(getCandidate(partition))
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10525) Support FS Convert to CS with weight mode enabled in CS.

2020-12-09 Thread zhuqi (Jira)
zhuqi created YARN-10525:


 Summary: Support FS Convert to CS with weight mode enabled in CS.
 Key: YARN-10525
 URL: https://issues.apache.org/jira/browse/YARN-10525
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: zhuqi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10524) Support multi resource type based weight mode in CS.

2020-12-09 Thread zhuqi (Jira)
zhuqi created YARN-10524:


 Summary: Support multi resource type based weight mode in CS.
 Key: YARN-10524
 URL: https://issues.apache.org/jira/browse/YARN-10524
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: zhuqi
Assignee: zhuqi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10522) Document for Flexible Auto Queue Creation in Capacity Scheduler.

2020-12-08 Thread zhuqi (Jira)
zhuqi created YARN-10522:


 Summary: Document for Flexible Auto Queue Creation in Capacity 
Scheduler.
 Key: YARN-10522
 URL: https://issues.apache.org/jira/browse/YARN-10522
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: zhuqi


We should update document to support this feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10521) To support the mixed mode on different levels(optional) or disabled all.

2020-12-08 Thread zhuqi (Jira)
zhuqi created YARN-10521:


 Summary: To support the mixed mode on different levels(optional) 
or disabled all.
 Key: YARN-10521
 URL: https://issues.apache.org/jira/browse/YARN-10521
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: zhuqi
Assignee: zhuqi


*Mixed* percentage / weight / absolute resource should *not* be allowed *at the 
same time* _at all hierarchical levels_. Restricting this to all levels of the 
hierarchy is not absolutely necessary, as theoretically it is possible to 
support the mixed mode on different levels, but whether it is worth it is up 
for debate. 

*Mixed* static and auto created under the parent will not be supported if the 
queues are defined by percentages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9443) Fast RM Failover using Ratis (Raft protocol)

2020-12-06 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244973#comment-17244973
 ] 

zhuqi commented on YARN-9443:
-

[~prabhujoseph] [~leftnoteasy] 

It's a greate improvement,  i am looking forward to the design. And i can help 
to finish some sub-tasks when i am free, and i wish it can apply to our 
production cluster with thousands of nodes, i will be very helpful.

Thanks a lot.

> Fast RM Failover using Ratis (Raft protocol)
> 
>
> Key: YARN-9443
> URL: https://issues.apache.org/jira/browse/YARN-9443
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> During Failover, the RM Standby will have a lag as it has to recover from 
> Zookeeper / FileSystem StateStore. RM HA using Ratis (Raft Protocol) can 
> achieve Fast failover as all RMs are in sync already. This is used by Ozone - 
> HDDS-505.
>  
> cc [~nandakumar131]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10514) Introduce a dominant resource based schedule policy to increase the resource utilization, avoid heavy cluster resource fragments.

2020-12-04 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-10514:
-
Description: 
When we schedule in multi node lookup policy for async scheduling, or just use 
heartbeat update based scheduling, we both meet scheduling fragments. When 
cpu-intensive jobs or gpu-intensive or memory-intensive etc, the cluster will 
meet heavy waste of resources, so this issue will help to move scheduler 
support dominant resource based schedule, to help our cluster get better 
resource utilization, also in order to load balance nodemanager resource 
distribution.

 

  was:When we schedule in multi node lookup policy for async scheduling, or 
just use heartbeat update based scheduling, we both meet scheduling fragments. 
When cpu-intensive jobs or gpu-intensive or memory-intensive etc, the cluster 
will meet heavy waste of resources, so this issue will help to move scheduler 
support dominant resource based schedule, to help our cluster get better 
resource utilization.


> Introduce a dominant resource based schedule policy to increase the resource 
> utilization, avoid heavy cluster resource fragments.
> -
>
> Key: YARN-10514
> URL: https://issues.apache.org/jira/browse/YARN-10514
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0, 3.4.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10514.001.patch
>
>
> When we schedule in multi node lookup policy for async scheduling, or just 
> use heartbeat update based scheduling, we both meet scheduling fragments. 
> When cpu-intensive jobs or gpu-intensive or memory-intensive etc, the cluster 
> will meet heavy waste of resources, so this issue will help to move scheduler 
> support dominant resource based schedule, to help our cluster get better 
> resource utilization, also in order to load balance nodemanager resource 
> distribution.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10514) Introduce a dominant resource based schedule policy to increase the resource utilization, avoid heavy cluster resource fragments.

2020-12-03 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243071#comment-17243071
 ] 

zhuqi edited comment on YARN-10514 at 12/3/20, 10:17 AM:
-

[~leftnoteasy] [~tangzhankun] [~prabhujoseph] [~sunil.gov...@gmail.com] [~jiwq]

If you any advice about this proposal.

I submit a draft patch to support async multi node scheduling mode in CS, but i 
think in heartbeat schedule mode in CS/FS, we also should to handle resource 
resource fragments, to help get better resource utilization.


was (Author: zhuqi):
[~leftnoteasy] [~tangzhankun] [~prabhujoseph] [~sunil.gov...@gmail.com] [~jiwq]

If you any advice about this proposal.

I submit a draft patch to support async multi node scheduling mode, but i think 
in heartbeat schedule mode we also should to handle resource resource 
fragments, to help get better resource utilization.

> Introduce a dominant resource based schedule policy to increase the resource 
> utilization, avoid heavy cluster resource fragments.
> -
>
> Key: YARN-10514
> URL: https://issues.apache.org/jira/browse/YARN-10514
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0, 3.4.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10514.001.patch
>
>
> When we schedule in multi node lookup policy for async scheduling, or just 
> use heartbeat update based scheduling, we both meet scheduling fragments. 
> When cpu-intensive jobs or gpu-intensive or memory-intensive etc, the cluster 
> will meet heavy waste of resources, so this issue will help to move scheduler 
> support dominant resource based schedule, to help our cluster get better 
> resource utilization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10514) Introduce a dominant resource based schedule policy to increase the resource utilization, avoid heavy cluster resource fragments.

2020-12-03 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243071#comment-17243071
 ] 

zhuqi commented on YARN-10514:
--

[~leftnoteasy] [~tangzhankun] [~prabhujoseph] [~sunil.gov...@gmail.com] [~jiwq]

If you any advice about this proposal.

I submit a draft patch to support async multi node scheduling mode, but i think 
in heartbeat schedule mode we also should to handle resource resource 
fragments, to help get better resource utilization.

> Introduce a dominant resource based schedule policy to increase the resource 
> utilization, avoid heavy cluster resource fragments.
> -
>
> Key: YARN-10514
> URL: https://issues.apache.org/jira/browse/YARN-10514
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.3.0, 3.4.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10514.001.patch
>
>
> When we schedule in multi node lookup policy for async scheduling, or just 
> use heartbeat update based scheduling, we both meet scheduling fragments. 
> When cpu-intensive jobs or gpu-intensive or memory-intensive etc, the cluster 
> will meet heavy waste of resources, so this issue will help to move scheduler 
> support dominant resource based schedule, to help our cluster get better 
> resource utilization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10514) Introduce a dominant resource based schedule policy to increase the resource utilization, avoid heavy cluster resource fragments.

2020-12-03 Thread zhuqi (Jira)
zhuqi created YARN-10514:


 Summary: Introduce a dominant resource based schedule policy to 
increase the resource utilization, avoid heavy cluster resource fragments.
 Key: YARN-10514
 URL: https://issues.apache.org/jira/browse/YARN-10514
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 3.3.0, 3.4.0
Reporter: zhuqi
Assignee: zhuqi


When we schedule in multi node lookup policy for async scheduling, or just use 
heartbeat update based scheduling, we both meet scheduling fragments. When 
cpu-intensive jobs or gpu-intensive or memory-intensive etc, the cluster will 
meet heavy waste of resources, so this issue will help to move scheduler 
support dominant resource based schedule, to help our cluster get better 
resource utilization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10496) [Umbrella] Support Flexible Auto Queue Creation in Capacity Scheduler

2020-12-02 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242887#comment-17242887
 ] 

zhuqi edited comment on YARN-10496 at 12/3/20, 3:53 AM:


Thanks [~wangda] for putting this proposal.

As an old user of FS, i think option #1 would be the way to go, i agree with 
[~epayne] said.

We should discuss how to define max capacity in CS, in FS max is used by 
absolute resource in regular. If we can restrict the max capacity to two 
choices(or three):

1: Use absolute resources.

2: Use percentage of the immediate parent.(Such as percentage : weight * 1.5 
etc)

3(optional): Use percentage of the cluster.

This will help FS user to migrate, also CS users will be adapted to it.

Thanks a lot.


was (Author: zhuqi):
Thanks [~wangda] for putting this proposal.

As an old user of FS, i think option #1 would be the way to go, i agree with 
[~epayne] said.

We should discuss how to define max capacity in CS, in FS max is used by 
absolute resource in regular. If we can restrict the max capacity to two 
choices(or three):

1: Use absolute resources.

2: Use percentage of the immediate parent.

3(optional): Use percentage of the cluster.

This will help FS user to migrate, also CS users will be adapted to it.

Thanks a lot.

> [Umbrella] Support Flexible Auto Queue Creation in Capacity Scheduler
> -
>
> Key: YARN-10496
> URL: https://issues.apache.org/jira/browse/YARN-10496
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Priority: Major
>
> CapacityScheduler today doesn’t support an auto queue creation which is 
> flexible enough. The current constraints: 
>  * Only leaf queues can be auto-created
>  * A parent can only have either static queues or dynamic ones. This causes 
> multiple constraints. For example:
>  * It isn’t possible to have a VIP user like Alice with a static queue 
> root.user.alice with 50% capacity while the other user queues (under 
> root.user) are created dynamically and they share the remaining 50% of 
> resources.
>  
>  * In comparison, FairScheduler allows the following scenarios, Capacity 
> Scheduler doesn’t:
>  ** This implies that there is no possibility to have both dynamically 
> created and static queues at the same time under root
>  * A new queue needs to be created under an existing parent, while the parent 
> already has static queues
>  * Nested queue mapping policy, like in the following example: 
> |
> 
> |
>  * Here two levels of queues may need to be created 
> If an application belongs to user _alice_ (who has the primary_group of 
> _engineering_), the scheduler checks whether _root.engineering_ exists, if it 
> doesn’t,  it’ll be created. Then scheduler checks whether 
> _root.engineering.alice_ exists, and creates it if it doesn't.
>  
> When we try to move users from FairScheduler to CapacityScheduler, we face 
> feature gaps which blocks users migrate from FS to CS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10496) [Umbrella] Support Flexible Auto Queue Creation in Capacity Scheduler

2020-12-02 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242887#comment-17242887
 ] 

zhuqi commented on YARN-10496:
--

Thanks [~wangda] for putting this proposal.

As an old user of FS, i think option #1 would be the way to go, i agree with 
[~epayne] said.

We should discuss how to define max capacity in CS, in FS max is used by 
absolute resource in regular. If we can restrict the max capacity to two 
choices(or three):

1: Use absolute resources.

2: Use percentage of the immediate parent.

3(optional): Use percentage of the cluster.

This will help FS user to migrate, also CS users will be adapted to it.

Thanks a lot.

> [Umbrella] Support Flexible Auto Queue Creation in Capacity Scheduler
> -
>
> Key: YARN-10496
> URL: https://issues.apache.org/jira/browse/YARN-10496
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: capacity scheduler
>Reporter: Wangda Tan
>Priority: Major
>
> CapacityScheduler today doesn’t support an auto queue creation which is 
> flexible enough. The current constraints: 
>  * Only leaf queues can be auto-created
>  * A parent can only have either static queues or dynamic ones. This causes 
> multiple constraints. For example:
>  * It isn’t possible to have a VIP user like Alice with a static queue 
> root.user.alice with 50% capacity while the other user queues (under 
> root.user) are created dynamically and they share the remaining 50% of 
> resources.
>  
>  * In comparison, FairScheduler allows the following scenarios, Capacity 
> Scheduler doesn’t:
>  ** This implies that there is no possibility to have both dynamically 
> created and static queues at the same time under root
>  * A new queue needs to be created under an existing parent, while the parent 
> already has static queues
>  * Nested queue mapping policy, like in the following example: 
> |
> 
> |
>  * Here two levels of queues may need to be created 
> If an application belongs to user _alice_ (who has the primary_group of 
> _engineering_), the scheduler checks whether _root.engineering_ exists, if it 
> doesn’t,  it’ll be created. Then scheduler checks whether 
> _root.engineering.alice_ exists, and creates it if it doesn't.
>  
> When we try to move users from FairScheduler to CapacityScheduler, we face 
> feature gaps which blocks users migrate from FS to CS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10169) Mixed absolute resource value and percentage-based resource value in CapacityScheduler should fail

2020-12-02 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240742#comment-17240742
 ] 

zhuqi edited comment on YARN-10169 at 12/2/20, 11:45 AM:
-

[~leftnoteasy] [~sunil.gov...@gmail.com] [~BilwaST] [~prabhujoseph]  

I have submitted a patch to fix it, if anyone can review it. 

Thanks.


was (Author: zhuqi):
[~leftnoteasy] [~sunil.gov...@gmail.com] [~BilwaST] [~prabhujoseph] @ 

I have submitted a patch to fix it, if anyone can review it. 

Thanks.

> Mixed absolute resource value and percentage-based resource value in 
> CapacityScheduler should fail
> --
>
> Key: YARN-10169
> URL: https://issues.apache.org/jira/browse/YARN-10169
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Blocker
> Attachments: YARN-10169.001.patch, YARN-10169.002.patch, 
> YARN-10169.003.patch
>
>
> To me this is a bug: if there's a queue has capacity set to float, and 
> maximum-capacity set to absolute value. Existing logic allows the behavior.
> For example:
> {code:java}
> queue.capacity = 0.8 
> queue.maximum-capacity = [mem=x, vcore=y] {code}
> We should throw exception when configured like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9618) NodeListManager event improvement

2020-12-02 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242179#comment-17242179
 ] 

zhuqi edited comment on YARN-9618 at 12/2/20, 9:05 AM:
---

[~bibinchundatt] [~leftnoteasy]

This is a big improvement for nodemanager scalability.

I submit a draft patch, to change event trigger to app itself. Avoid async 
dispatcher boom.

If you any advice. 

Thanks.


was (Author: zhuqi):
[~bibinchundatt] [~leftnoteasy]

This is a big improvement for nodemanager scalability.

I submit a draft patch, to change event trigger to app itself.

If you any advice. 

Thanks.

> NodeListManager event improvement
> -
>
> Key: YARN-9618
> URL: https://issues.apache.org/jira/browse/YARN-9618
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: zhuqi
>Priority: Critical
> Attachments: YARN-9618.001.patch
>
>
> Current implementation nodelistmanager event blocks async dispacher and can 
> cause RM crash and slowing down event processing.
> # Cluster restart with 1K running apps . Each usable event will create 1K 
> events over all events could be 5k*1k events for 5K cluster
> # Event processing is blocked till new events are added to queue.
> Solution :
> # Add another async Event handler similar to scheduler.
> # Instead of adding events to dispatcher directly call RMApp event handler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9618) NodeListManager event improvement

2020-12-02 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242179#comment-17242179
 ] 

zhuqi commented on YARN-9618:
-

[~bibinchundatt] [~leftnoteasy]

This is a big improvement for nodemanager scalability.

I submit a draft patch, to change event trigger to app itself.

If you any advice. 

Thanks.

> NodeListManager event improvement
> -
>
> Key: YARN-9618
> URL: https://issues.apache.org/jira/browse/YARN-9618
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: zhuqi
>Priority: Critical
> Attachments: YARN-9618.001.patch
>
>
> Current implementation nodelistmanager event blocks async dispacher and can 
> cause RM crash and slowing down event processing.
> # Cluster restart with 1K running apps . Each usable event will create 1K 
> events over all events could be 5k*1k events for 5K cluster
> # Event processing is blocked till new events are added to queue.
> Solution :
> # Add another async Event handler similar to scheduler.
> # Instead of adding events to dispatcher directly call RMApp event handler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler

2020-12-01 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242013#comment-17242013
 ] 

zhuqi commented on YARN-10380:
--

[~jiwq]

Thanks a lot for your review, i have updated in latest PR.

[~ztang] , if you any other advice for commit.

Thanks.

> Import logic of multi-node allocation in CapacityScheduler
> --
>
> Key: YARN-10380
> URL: https://issues.apache.org/jira/browse/YARN-10380
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Critical
>  Labels: pull-request-available
> Attachments: YARN-10380.001.patch, YARN-10380.002.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> *1) Entry point:* 
> When we do multi-node allocation, we're using the same logic of async 
> scheduling:
> {code:java}
> // Allocate containers of node [start, end)
>  for (FiCaSchedulerNode node : nodes) {
>   if (current++ >= start) {
>      if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
>         continue;
>      }
>      cs.allocateContainersToNode(node.getNodeID(), false);
>   }
>  } {code}
> Is it the most effective way to do multi-node scheduling? Should we allocate 
> based on partitions? In above logic, if we have thousands of node in one 
> partition, we will repeatly access all nodes of the partition thousands of 
> times.
> I would suggest looking at making entry-point for node-heartbeat, 
> async-scheduling (single node), and async-scheduling (multi-node) to be 
> different.
> Node-heartbeat and async-scheduling (single node) can be still similar and 
> share most of the code. 
> async-scheduling (multi-node): should iterate partition first, using pseudo 
> code like: 
> {code:java}
> for (partition : all partitions) {
>   allocateContainersOnMultiNodes(getCandidate(partition))
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9618) NodeListManager event improvement

2020-12-01 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi reassigned YARN-9618:
---

Assignee: zhuqi

> NodeListManager event improvement
> -
>
> Key: YARN-9618
> URL: https://issues.apache.org/jira/browse/YARN-9618
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: zhuqi
>Priority: Critical
>
> Current implementation nodelistmanager event blocks async dispacher and can 
> cause RM crash and slowing down event processing.
> # Cluster restart with 1K running apps . Each usable event will create 1K 
> events over all events could be 5k*1k events for 5K cluster
> # Event processing is blocked till new events are added to queue.
> Solution :
> # Add another async Event handler similar to scheduler.
> # Instead of adding events to dispatcher directly call RMApp event handler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10500) TestDelegationTokenRenewer fails intermittently

2020-12-01 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241575#comment-17241575
 ] 

zhuqi commented on YARN-10500:
--

[~aajisaka] i meet it too, when submit a patch base the trunk.

> TestDelegationTokenRenewer fails intermittently
> ---
>
> Key: YARN-10500
> URL: https://issues.apache.org/jira/browse/YARN-10500
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Akira Ajisaka
>Priority: Major
>  Labels: flaky-test
>
> TestDelegationTokenRenewer sometimes timeouts.
> https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/334/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
> {noformat}
> [INFO] Running 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer
> [ERROR] Tests run: 23, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 83.675 s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer
> [ERROR] 
> testTokenThreadTimeout(org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer)
>   Time elapsed: 30.065 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 3 
> milliseconds
>   at java.lang.Thread.sleep(Native Method)
>   at 
> org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:394)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testTokenThreadTimeout(TestDelegationTokenRenewer.java:1769)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler

2020-12-01 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241453#comment-17241453
 ] 

zhuqi commented on YARN-10380:
--

[~ztang], I have manually tested with async test, and also add a unit test in 
TestCapacitySchedulerAsyncScheduling class in latest PR.

Thanks a lot.

> Import logic of multi-node allocation in CapacityScheduler
> --
>
> Key: YARN-10380
> URL: https://issues.apache.org/jira/browse/YARN-10380
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Critical
> Attachments: YARN-10380.001.patch, YARN-10380.002.patch
>
>
> *1) Entry point:* 
> When we do multi-node allocation, we're using the same logic of async 
> scheduling:
> {code:java}
> // Allocate containers of node [start, end)
>  for (FiCaSchedulerNode node : nodes) {
>   if (current++ >= start) {
>      if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
>         continue;
>      }
>      cs.allocateContainersToNode(node.getNodeID(), false);
>   }
>  } {code}
> Is it the most effective way to do multi-node scheduling? Should we allocate 
> based on partitions? In above logic, if we have thousands of node in one 
> partition, we will repeatly access all nodes of the partition thousands of 
> times.
> I would suggest looking at making entry-point for node-heartbeat, 
> async-scheduling (single node), and async-scheduling (multi-node) to be 
> different.
> Node-heartbeat and async-scheduling (single node) can be still similar and 
> share most of the code. 
> async-scheduling (multi-node): should iterate partition first, using pseudo 
> code like: 
> {code:java}
> for (partition : all partitions) {
>   allocateContainersOnMultiNodes(getCandidate(partition))
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10169) Mixed absolute resource value and percentage-based resource value in CapacityScheduler should fail

2020-11-30 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240742#comment-17240742
 ] 

zhuqi edited comment on YARN-10169 at 12/1/20, 6:11 AM:


[~leftnoteasy] [~sunil.gov...@gmail.com] [~BilwaST] [~prabhujoseph] @ 

I have submitted a patch to fix it, if anyone can review it. 

Thanks.


was (Author: zhuqi):
[~leftnoteasy] [~sunil.gov...@gmail.com] [~BilwaST] [~prabhujoseph]

I have submitted a patch to fix it, if anyone can review it. 

Thanks.

> Mixed absolute resource value and percentage-based resource value in 
> CapacityScheduler should fail
> --
>
> Key: YARN-10169
> URL: https://issues.apache.org/jira/browse/YARN-10169
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Blocker
> Attachments: YARN-10169.001.patch, YARN-10169.002.patch, 
> YARN-10169.003.patch
>
>
> To me this is a bug: if there's a queue has capacity set to float, and 
> maximum-capacity set to absolute value. Existing logic allows the behavior.
> For example:
> {code:java}
> queue.capacity = 0.8 
> queue.maximum-capacity = [mem=x, vcore=y] {code}
> We should throw exception when configured like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler

2020-11-30 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241257#comment-17241257
 ] 

zhuqi commented on YARN-10380:
--

[~ztang]

Thanks a lot for your patient review.

I have fixed the check style in updated PR.

[~wangda] , if we need add new unit tests here.

 

> Import logic of multi-node allocation in CapacityScheduler
> --
>
> Key: YARN-10380
> URL: https://issues.apache.org/jira/browse/YARN-10380
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Critical
> Attachments: YARN-10380.001.patch, YARN-10380.002.patch
>
>
> *1) Entry point:* 
> When we do multi-node allocation, we're using the same logic of async 
> scheduling:
> {code:java}
> // Allocate containers of node [start, end)
>  for (FiCaSchedulerNode node : nodes) {
>   if (current++ >= start) {
>      if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
>         continue;
>      }
>      cs.allocateContainersToNode(node.getNodeID(), false);
>   }
>  } {code}
> Is it the most effective way to do multi-node scheduling? Should we allocate 
> based on partitions? In above logic, if we have thousands of node in one 
> partition, we will repeatly access all nodes of the partition thousands of 
> times.
> I would suggest looking at making entry-point for node-heartbeat, 
> async-scheduling (single node), and async-scheduling (multi-node) to be 
> different.
> Node-heartbeat and async-scheduling (single node) can be still similar and 
> share most of the code. 
> async-scheduling (multi-node): should iterate partition first, using pseudo 
> code like: 
> {code:java}
> for (partition : all partitions) {
>   allocateContainersOnMultiNodes(getCandidate(partition))
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10169) Mixed absolute resource value and percentage-based resource value in CapacityScheduler should fail

2020-11-30 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240765#comment-17240765
 ] 

zhuqi commented on YARN-10169:
--

Fix testComplexValidateAbsoluteResourceConfig not trigger mixed case in patch 
002.

> Mixed absolute resource value and percentage-based resource value in 
> CapacityScheduler should fail
> --
>
> Key: YARN-10169
> URL: https://issues.apache.org/jira/browse/YARN-10169
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Blocker
> Attachments: YARN-10169.001.patch, YARN-10169.002.patch
>
>
> To me this is a bug: if there's a queue has capacity set to float, and 
> maximum-capacity set to absolute value. Existing logic allows the behavior.
> For example:
> {code:java}
> queue.capacity = 0.8 
> queue.maximum-capacity = [mem=x, vcore=y] {code}
> We should throw exception when configured like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10169) Mixed absolute resource value and percentage-based resource value in CapacityScheduler should fail

2020-11-30 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240742#comment-17240742
 ] 

zhuqi commented on YARN-10169:
--

[~leftnoteasy] [~sunil.gov...@gmail.com] [~BilwaST] [~prabhujoseph]

I have submitted a patch to fix it, if anyone can review it. 

Thanks.

> Mixed absolute resource value and percentage-based resource value in 
> CapacityScheduler should fail
> --
>
> Key: YARN-10169
> URL: https://issues.apache.org/jira/browse/YARN-10169
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Blocker
> Attachments: YARN-10169.001.patch
>
>
> To me this is a bug: if there's a queue has capacity set to float, and 
> maximum-capacity set to absolute value. Existing logic allows the behavior.
> For example:
> {code:java}
> queue.capacity = 0.8 
> queue.maximum-capacity = [mem=x, vcore=y] {code}
> We should throw exception when configured like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10169) Mixed absolute resource value and percentage-based resource value in CapacityScheduler should fail

2020-11-29 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi reassigned YARN-10169:


Assignee: zhuqi  (was: Tanu Ajmera)

> Mixed absolute resource value and percentage-based resource value in 
> CapacityScheduler should fail
> --
>
> Key: YARN-10169
> URL: https://issues.apache.org/jira/browse/YARN-10169
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Blocker
>
> To me this is a bug: if there's a queue has capacity set to float, and 
> maximum-capacity set to absolute value. Existing logic allows the behavior.
> For example:
> {code:java}
> queue.capacity = 0.8 
> queue.maximum-capacity = [mem=x, vcore=y] {code}
> We should throw exception when configured like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10503) Support queue capacity in terms of absolute resources with more resourceTypes.

2020-11-28 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239997#comment-17239997
 ] 

zhuqi commented on YARN-10503:
--

cc [~wangda] [~sunilg] [~tangzhankun]

If you any advice about it.

> Support queue capacity in terms of absolute resources with more resourceTypes.
> --
>
> Key: YARN-10503
> URL: https://issues.apache.org/jira/browse/YARN-10503
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Fix For: 3.3.1, 3.4.1
>
>
> Now the absolute resources are memory and cores.
> {code:java}
> /**
>  * Different resource types supported.
>  */
> public enum AbsoluteResourceType {
>   MEMORY, VCORES;
> }{code}
> But in our GPU production clusters, we need to support more resourceTypes.
> It's very import for cluster scaling when with different resourceType 
> absolute demands.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10503) Support queue capacity in terms of absolute resources with more resourceTypes.

2020-11-28 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239997#comment-17239997
 ] 

zhuqi edited comment on YARN-10503 at 11/28/20, 4:05 PM:
-

cc [~wangda] [~sunilg] [~tangzhankun]

If you any advice about it.

Thanks.


was (Author: zhuqi):
cc [~wangda] [~sunilg] [~tangzhankun]

If you any advice about it.

> Support queue capacity in terms of absolute resources with more resourceTypes.
> --
>
> Key: YARN-10503
> URL: https://issues.apache.org/jira/browse/YARN-10503
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Fix For: 3.3.1, 3.4.1
>
>
> Now the absolute resources are memory and cores.
> {code:java}
> /**
>  * Different resource types supported.
>  */
> public enum AbsoluteResourceType {
>   MEMORY, VCORES;
> }{code}
> But in our GPU production clusters, we need to support more resourceTypes.
> It's very import for cluster scaling when with different resourceType 
> absolute demands.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10503) Support queue capacity in terms of absolute resources with more resourceTypes.

2020-11-28 Thread zhuqi (Jira)
zhuqi created YARN-10503:


 Summary: Support queue capacity in terms of absolute resources 
with more resourceTypes.
 Key: YARN-10503
 URL: https://issues.apache.org/jira/browse/YARN-10503
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhuqi
Assignee: zhuqi
 Fix For: 3.3.1, 3.4.1


Now the absolute resources are memory and cores.
{code:java}
/**
 * Different resource types supported.
 */
public enum AbsoluteResourceType {
  MEMORY, VCORES;
}{code}
But in our GPU production clusters, we need to support more resourceTypes.

It's very import for cluster scaling when with different resourceType absolute 
demands.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6224) Should consider utilization of each ResourceType on node while scheduling

2020-11-27 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239571#comment-17239571
 ] 

zhuqi edited comment on YARN-6224 at 11/27/20, 9:41 AM:


cc [~wangda] [~bibinchundatt]  [~sunilg] [~tangzhankun]

It's a good policy to choose node based dominant node resource, when with more 
and more resource type in our production cluster, such as GPU/FPGA etc.
 # We should pass ResourceInfo which includes DominantInfo to multi node sort 
policy.
 # We should sort the Node according to the Dominant Resource which passed by.  

 

 


was (Author: zhuqi):
cc [~wangda] [~bibinchundatt]  [~sunilg] [~tangzhankun]

It's a good policy to choose node based dominant node resource, when with more 
and more resource type in our production cluster, such as GPU/FPGA etc.

 

 

> Should consider utilization of each ResourceType on node while scheduling
> -
>
> Key: YARN-6224
> URL: https://issues.apache.org/jira/browse/YARN-6224
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Jie
>Assignee: zhuqi
>Priority: Major
>
> In situation like YARN-6101, if we consider all type of resource(vcore, 
> memory) utilization on node rather than just answer we can allocate or not, 
> we are more likely to have better resource utilization as a whole.
> It is possible that we have a set of candidate nodes, then find the most 
> promising node to assign to one request considering node resource utilization 
> with global scheduling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6224) Should consider utilization of each ResourceType on node while scheduling

2020-11-27 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi reassigned YARN-6224:
---

Assignee: zhuqi

> Should consider utilization of each ResourceType on node while scheduling
> -
>
> Key: YARN-6224
> URL: https://issues.apache.org/jira/browse/YARN-6224
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Jie
>Assignee: zhuqi
>Priority: Major
>
> In situation like YARN-6101, if we consider all type of resource(vcore, 
> memory) utilization on node rather than just answer we can allocate or not, 
> we are more likely to have better resource utilization as a whole.
> It is possible that we have a set of candidate nodes, then find the most 
> promising node to assign to one request considering node resource utilization 
> with global scheduling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6224) Should consider utilization of each ResourceType on node while scheduling

2020-11-27 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239571#comment-17239571
 ] 

zhuqi edited comment on YARN-6224 at 11/27/20, 8:30 AM:


cc [~wangda] [~bibinchundatt]  [~sunilg] [~tangzhankun]

It's a good policy to choose node based dominant node resource, when with more 
and more resource type in our production cluster, such as GPU/FPGA etc.

 

 


was (Author: zhuqi):
cc [~wangda] [~bibinchundatt]  [~sunilg]

It's a good policy to choose node based dominant node resource, when with more 
and more resource type in our production cluster, such as GPU/FPGA etc.

 

 

> Should consider utilization of each ResourceType on node while scheduling
> -
>
> Key: YARN-6224
> URL: https://issues.apache.org/jira/browse/YARN-6224
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Jie
>Priority: Major
>
> In situation like YARN-6101, if we consider all type of resource(vcore, 
> memory) utilization on node rather than just answer we can allocate or not, 
> we are more likely to have better resource utilization as a whole.
> It is possible that we have a set of candidate nodes, then find the most 
> promising node to assign to one request considering node resource utilization 
> with global scheduling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6224) Should consider utilization of each ResourceType on node while scheduling

2020-11-27 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239571#comment-17239571
 ] 

zhuqi commented on YARN-6224:
-

cc [~wangda] [~bibinchundatt]  [~sunilg]

It's a good policy to choose node based dominant node resource, when with more 
and more resource type in our production cluster, such as GPU/FPGA etc.

 

 

> Should consider utilization of each ResourceType on node while scheduling
> -
>
> Key: YARN-6224
> URL: https://issues.apache.org/jira/browse/YARN-6224
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Jie
>Priority: Major
>
> In situation like YARN-6101, if we consider all type of resource(vcore, 
> memory) utilization on node rather than just answer we can allocate or not, 
> we are more likely to have better resource utilization as a whole.
> It is possible that we have a set of candidate nodes, then find the most 
> promising node to assign to one request considering node resource utilization 
> with global scheduling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8557) Exclude lagged/unhealthy/decommissioned nodes in async allocating thread

2020-11-26 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239526#comment-17239526
 ] 

zhuqi commented on YARN-8557:
-

[~cheersyang] [~wangda] [~tangzhankun] [~sunilg] [~BilwaST] [~Tao Yang]

I add the patch , if you can review for it:

1 Add configurable HB lag.

2 Support exclude unhealthy and decommissioned and decommissing nodes.

Thanks.

> Exclude lagged/unhealthy/decommissioned nodes in async allocating thread
> 
>
> Key: YARN-8557
> URL: https://issues.apache.org/jira/browse/YARN-8557
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.4.0
>Reporter: Weiwei Yang
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-8557.001.patch
>
>
> Currently only HB-lagged is handled, with hard-coded 2 times of HB lag which 
> we should make it configurable. And more over, we need to exclude unhealthy 
> and decommissioned nodes too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8557) Exclude lagged/unhealthy/decommissioned nodes in async allocating thread

2020-11-26 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi reassigned YARN-8557:
---

Assignee: zhuqi

> Exclude lagged/unhealthy/decommissioned nodes in async allocating thread
> 
>
> Key: YARN-8557
> URL: https://issues.apache.org/jira/browse/YARN-8557
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Weiwei Yang
>Assignee: zhuqi
>Priority: Major
>
> Currently only HB-lagged is handled, with hard-coded 2 times of HB lag which 
> we should make it configurable. And more over, we need to exclude unhealthy 
> and decommissioned nodes too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler

2020-11-26 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239499#comment-17239499
 ] 

zhuqi commented on YARN-10380:
--

It seems not trigger jenkins, add a pull request to trigger it. 

> Import logic of multi-node allocation in CapacityScheduler
> --
>
> Key: YARN-10380
> URL: https://issues.apache.org/jira/browse/YARN-10380
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Critical
> Attachments: YARN-10380.001.patch, YARN-10380.002.patch
>
>
> *1) Entry point:* 
> When we do multi-node allocation, we're using the same logic of async 
> scheduling:
> {code:java}
> // Allocate containers of node [start, end)
>  for (FiCaSchedulerNode node : nodes) {
>   if (current++ >= start) {
>      if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
>         continue;
>      }
>      cs.allocateContainersToNode(node.getNodeID(), false);
>   }
>  } {code}
> Is it the most effective way to do multi-node scheduling? Should we allocate 
> based on partitions? In above logic, if we have thousands of node in one 
> partition, we will repeatly access all nodes of the partition thousands of 
> times.
> I would suggest looking at making entry-point for node-heartbeat, 
> async-scheduling (single node), and async-scheduling (multi-node) to be 
> different.
> Node-heartbeat and async-scheduling (single node) can be still similar and 
> share most of the code. 
> async-scheduling (multi-node): should iterate partition first, using pseudo 
> code like: 
> {code:java}
> for (partition : all partitions) {
>   allocateContainersOnMultiNodes(getCandidate(partition))
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler

2020-11-26 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239076#comment-17239076
 ] 

zhuqi edited comment on YARN-10380 at 11/27/20, 2:25 AM:
-

[~wangda] [~tangzhankun] [~sunilg] [~BilwaST] [~Tao Yang]

I have attached the draft patch, if you can review it.

Thanks.

 


was (Author: zhuqi):
[~wangda] [~sunilg] [~BilwaST] [~Tao Yang]

I have attached the draft patch, if you can review it.

Thanks.

 

> Import logic of multi-node allocation in CapacityScheduler
> --
>
> Key: YARN-10380
> URL: https://issues.apache.org/jira/browse/YARN-10380
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Critical
> Attachments: YARN-10380.001.patch, YARN-10380.002.patch
>
>
> *1) Entry point:* 
> When we do multi-node allocation, we're using the same logic of async 
> scheduling:
> {code:java}
> // Allocate containers of node [start, end)
>  for (FiCaSchedulerNode node : nodes) {
>   if (current++ >= start) {
>      if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
>         continue;
>      }
>      cs.allocateContainersToNode(node.getNodeID(), false);
>   }
>  } {code}
> Is it the most effective way to do multi-node scheduling? Should we allocate 
> based on partitions? In above logic, if we have thousands of node in one 
> partition, we will repeatly access all nodes of the partition thousands of 
> times.
> I would suggest looking at making entry-point for node-heartbeat, 
> async-scheduling (single node), and async-scheduling (multi-node) to be 
> different.
> Node-heartbeat and async-scheduling (single node) can be still similar and 
> share most of the code. 
> async-scheduling (multi-node): should iterate partition first, using pseudo 
> code like: 
> {code:java}
> for (partition : all partitions) {
>   allocateContainersOnMultiNodes(getCandidate(partition))
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9449) Non-exclusive labels can create reservation loop on cluster without unlabeled node

2020-11-26 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239269#comment-17239269
 ] 

zhuqi commented on YARN-9449:
-

[~aditya9277]

If you apply the patch.

https://issues.apache.org/jira/browse/YARN-9903

It may be helpful.

 

> Non-exclusive labels can create reservation loop on cluster without unlabeled 
> node
> --
>
> Key: YARN-9449
> URL: https://issues.apache.org/jira/browse/YARN-9449
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.5
>Reporter: Brandon Scheller
>Priority: Major
>
> https://issues.apache.org/jira/browse/YARN-5342 Added a counter to Yarn so 
> that unscheduled resource requests were attempted to be scheduled on 
> unlabeled nodes first.
>  This counter is reset only when an attempt to schedule happens on an 
> unlabeled node.
> On hadoop clusters with only labeled nodes, this counter can never be reset 
> and therefore it will block skipping that node.
>  Because the node will not be skipped, it creates the loop shown below in the 
> Yarn RM logs.
> This can block scheduling of a spark executor for example and cause the spark 
> application to get stuck.
>  
> {{_2019-02-18 23:54:22,591 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
> (ResourceManager Event Processor): container_1550533628872_0003_01_23 
> Container Transitioned from NEW to RESERVED 2019-02-18 23:54:22,591 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator
>  (ResourceManager Event Processor): Reserved container 
> application=application_1550533628872_0003 resource= 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@6ffe0dc3
>  cluster= 2019-02-18 23:54:22,592 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue 
> (ResourceManager Event Processor): assignedContainer queue=root 
> usedCapacity=0.0 absoluteUsedCapacity=0.0 used= 
> cluster= 2019-02-18 23:54:23,592 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
>  (ResourceManager Event Processor): Trying to fulfill reservation for 
> application application_1550533628872_0003 on node: 
> ip-10-0-0-122.ec2.internal:8041 2019-02-18 23:54:23,592 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp
>  (ResourceManager Event Processor): Application 
> application_1550533628872_0003 unreserved on node host: 
> ip-10-0-0-122.ec2.internal:8041 #containers=1 available= vCores:7> used=, currently has 0 at priority 1; 
> currentReservation  on node-label=LABELED 2019-02-18 
> 23:54:23,593 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
> (ResourceManager Event Processor): container_1550533628872_0003_01_24 
> Container Transitioned from NEW to RESERVED 2019-02-18 23:54:23,593 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator
>  (ResourceManager Event Processor): Reserved container 
> application=application_1550533628872_0003 resource= 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@6ffe0dc3
>  cluster= 2019-02-18 23:54:23,593 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue 
> (ResourceManager Event Processor): assignedContainer queue=root 
> usedCapacity=0.0 absoluteUsedCapacity=0.0 used= 
> cluster= 2019-02-18 23:54:24,593 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
>  (ResourceManager Event Processor): Trying to fulfill reservation for 
> application application_1550533628872_0003 on node: 
> ip-10-0-0-122.ec2.internal:8041 2019-02-18 23:54:24,593 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp
>  (ResourceManager Event Processor): Application 
> application_1550533628872_0003 unreserved on node host: 
> ip-10-0-0-122.ec2.internal:8041 #containers=1 available= vCores:7> used=, currently has 0 at priority 1; 
> currentReservation  on node-label=LABELED 2019-02-18 
> 23:54:24,594 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
> (ResourceManager Event Processor): container_1550533628872_0003_01_25 
> Container Transitioned from NEW to RESERVED 2019-02-18 23:54:24,594 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator
>  (ResourceManager Event Processor): Reserved container 
> application=application_1550533628872_0003 resource= 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@6ffe0dc3
>  

[jira] [Comment Edited] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler

2020-11-26 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239076#comment-17239076
 ] 

zhuqi edited comment on YARN-10380 at 11/26/20, 10:55 AM:
--

[~wangda] [~sunilg] [~BilwaST] [~Tao Yang]

I have attached the draft patch, if you can review it.

Thanks.

 


was (Author: zhuqi):
[~wangda] [~sunilg] [~BilwaST]

I have attached the draft patch, if you any advice.

Thanks.

 

> Import logic of multi-node allocation in CapacityScheduler
> --
>
> Key: YARN-10380
> URL: https://issues.apache.org/jira/browse/YARN-10380
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Critical
> Attachments: YARN-10380.001.patch
>
>
> *1) Entry point:* 
> When we do multi-node allocation, we're using the same logic of async 
> scheduling:
> {code:java}
> // Allocate containers of node [start, end)
>  for (FiCaSchedulerNode node : nodes) {
>   if (current++ >= start) {
>      if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
>         continue;
>      }
>      cs.allocateContainersToNode(node.getNodeID(), false);
>   }
>  } {code}
> Is it the most effective way to do multi-node scheduling? Should we allocate 
> based on partitions? In above logic, if we have thousands of node in one 
> partition, we will repeatly access all nodes of the partition thousands of 
> times.
> I would suggest looking at making entry-point for node-heartbeat, 
> async-scheduling (single node), and async-scheduling (multi-node) to be 
> different.
> Node-heartbeat and async-scheduling (single node) can be still similar and 
> share most of the code. 
> async-scheduling (multi-node): should iterate partition first, using pseudo 
> code like: 
> {code:java}
> for (partition : all partitions) {
>   allocateContainersOnMultiNodes(getCandidate(partition))
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler

2020-11-25 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239076#comment-17239076
 ] 

zhuqi commented on YARN-10380:
--

[~wangda] [~sunilg] [~BilwaST]

I have attached the draft patch, if you any advice.

Thanks.

 

> Import logic of multi-node allocation in CapacityScheduler
> --
>
> Key: YARN-10380
> URL: https://issues.apache.org/jira/browse/YARN-10380
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Critical
> Attachments: YARN-10380.001.patch
>
>
> *1) Entry point:* 
> When we do multi-node allocation, we're using the same logic of async 
> scheduling:
> {code:java}
> // Allocate containers of node [start, end)
>  for (FiCaSchedulerNode node : nodes) {
>   if (current++ >= start) {
>      if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
>         continue;
>      }
>      cs.allocateContainersToNode(node.getNodeID(), false);
>   }
>  } {code}
> Is it the most effective way to do multi-node scheduling? Should we allocate 
> based on partitions? In above logic, if we have thousands of node in one 
> partition, we will repeatly access all nodes of the partition thousands of 
> times.
> I would suggest looking at making entry-point for node-heartbeat, 
> async-scheduling (single node), and async-scheduling (multi-node) to be 
> different.
> Node-heartbeat and async-scheduling (single node) can be still similar and 
> share most of the code. 
> async-scheduling (multi-node): should iterate partition first, using pseudo 
> code like: 
> {code:java}
> for (partition : all partitions) {
>   allocateContainersOnMultiNodes(getCandidate(partition))
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler

2020-11-25 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi reassigned YARN-10380:


Assignee: zhuqi

> Import logic of multi-node allocation in CapacityScheduler
> --
>
> Key: YARN-10380
> URL: https://issues.apache.org/jira/browse/YARN-10380
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: zhuqi
>Priority: Critical
> Attachments: YARN-10380.001.patch
>
>
> *1) Entry point:* 
> When we do multi-node allocation, we're using the same logic of async 
> scheduling:
> {code:java}
> // Allocate containers of node [start, end)
>  for (FiCaSchedulerNode node : nodes) {
>   if (current++ >= start) {
>      if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) {
>         continue;
>      }
>      cs.allocateContainersToNode(node.getNodeID(), false);
>   }
>  } {code}
> Is it the most effective way to do multi-node scheduling? Should we allocate 
> based on partitions? In above logic, if we have thousands of node in one 
> partition, we will repeatly access all nodes of the partition thousands of 
> times.
> I would suggest looking at making entry-point for node-heartbeat, 
> async-scheduling (single node), and async-scheduling (multi-node) to be 
> different.
> Node-heartbeat and async-scheduling (single node) can be still similar and 
> share most of the code. 
> async-scheduling (multi-node): should iterate partition first, using pseudo 
> code like: 
> {code:java}
> for (partition : all partitions) {
>   allocateContainersOnMultiNodes(getCandidate(partition))
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10132) For Federation,yarn applicationattempt fail command throws an exception

2020-11-23 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236015#comment-17236015
 ] 

zhuqi edited comment on YARN-10132 at 11/23/20, 9:55 AM:
-

cc [~BilwaST]  [~subru] :

I have attached a patch for this, if you can review it for the merge.

Thanks.


was (Author: zhuqi):
cc [~BilwaST] :

I have attached a patch for this, if you can review it for the merge.

Thanks.

> For Federation,yarn applicationattempt fail command throws an exception
> ---
>
> Key: YARN-10132
> URL: https://issues.apache.org/jira/browse/YARN-10132
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Sushanta Sen
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10132.001.patch, YARN-10132.002.patch, 
> YARN-10132.003.patch, YARN-10132.004.patch
>
>
> yarn applicationattempt fail command is failing with exception  
> “org.apache.commons.lang.NotImplementedException: Code is not implemented”.
> {noformat}
>  ./yarn applicationattempt -fail appattempt_1581497870689_0001_01
> Failing attempt appattempt_1581497870689_0001_01 of application 
> application_1581497870689_0001
> 2020-02-12 20:48:48,530 INFO impl.YarnClientImpl: Failing application attempt 
> appattempt_1581497870689_0001_01
> Exception in thread "main" org.apache.commons.lang.NotImplementedException: 
> Code is not implemented
> at 
> org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor.failApplicationAttempt(FederationClientInterceptor.java:980)
> at 
> org.apache.hadoop.yarn.server.router.clientrm.RouterClientRMService.failApplicationAttempt(RouterClientRMService.java:388)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.failApplicationAttempt(ApplicationClientProtocolPBServiceImpl.java:210)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:581)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2793)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.failApplicationAttempt(ApplicationClientProtocolPBClientImpl.java:223)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy8.failApplicationAttempt(Unknown Source)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.failApplicationAttempt(YarnClientImpl.java:447)
> at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.failApplicationAttempt(ApplicationCLI.java:985)
> at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:455)
> at 

[jira] [Comment Edited] (YARN-10111) In Federation cluster Distributed Shell Application submission fails as YarnClient#getQueueInfo is not implemented

2020-11-23 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17237256#comment-17237256
 ] 

zhuqi edited comment on YARN-10111 at 11/23/20, 9:54 AM:
-

cc [~BilwaST] [~ayushtkn] [~tangzhankun]

Now i add the draft patch with the consistent queue name across all sub 
clusters.

We should discuss how to get the capacity related values. Because all sub 
clusters have their own total resources , and the get queue info api just can 
get single RM related value, but the capacity related values should get all sub 
clusters resources info first.

If you any advice?

Thanks. 

 


was (Author: zhuqi):
cc [~BilwaST]

Now i add the draft patch with the consistent queue name across all sub 
clusters.

We should discuss how to get the capacity related values. Because all sub 
clusters have their own total resources , and the get queue info api just can 
get single RM related value, but the capacity related values should get all sub 
clusters resources info first.

If you any advice?

Thanks. 

 

> In Federation cluster Distributed Shell Application submission fails as 
> YarnClient#getQueueInfo is not implemented
> --
>
> Key: YARN-10111
> URL: https://issues.apache.org/jira/browse/YARN-10111
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Sushanta Sen
>Assignee: zhuqi
>Priority: Blocker
> Attachments: YARN-10111.001.patch
>
>
> In Federation cluster Distributed Shell Application submission fails as 
> YarnClient#getQueueInfo is not implemented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10132) For Federation,yarn applicationattempt fail command throws an exception

2020-11-23 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236015#comment-17236015
 ] 

zhuqi edited comment on YARN-10132 at 11/23/20, 9:53 AM:
-

cc [~BilwaST] :

I have attached a patch for this, if you can review it for the merge.

Thanks.


was (Author: zhuqi):
cc [~bilwa_st] :

I have attached a patch for this, if you can review it for the merge.

Thanks.

> For Federation,yarn applicationattempt fail command throws an exception
> ---
>
> Key: YARN-10132
> URL: https://issues.apache.org/jira/browse/YARN-10132
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Sushanta Sen
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10132.001.patch, YARN-10132.002.patch, 
> YARN-10132.003.patch, YARN-10132.004.patch
>
>
> yarn applicationattempt fail command is failing with exception  
> “org.apache.commons.lang.NotImplementedException: Code is not implemented”.
> {noformat}
>  ./yarn applicationattempt -fail appattempt_1581497870689_0001_01
> Failing attempt appattempt_1581497870689_0001_01 of application 
> application_1581497870689_0001
> 2020-02-12 20:48:48,530 INFO impl.YarnClientImpl: Failing application attempt 
> appattempt_1581497870689_0001_01
> Exception in thread "main" org.apache.commons.lang.NotImplementedException: 
> Code is not implemented
> at 
> org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor.failApplicationAttempt(FederationClientInterceptor.java:980)
> at 
> org.apache.hadoop.yarn.server.router.clientrm.RouterClientRMService.failApplicationAttempt(RouterClientRMService.java:388)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.failApplicationAttempt(ApplicationClientProtocolPBServiceImpl.java:210)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:581)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2793)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.failApplicationAttempt(ApplicationClientProtocolPBClientImpl.java:223)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy8.failApplicationAttempt(Unknown Source)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.failApplicationAttempt(YarnClientImpl.java:447)
> at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.failApplicationAttempt(ApplicationCLI.java:985)
> at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:455)
> at 

[jira] [Comment Edited] (YARN-10111) In Federation cluster Distributed Shell Application submission fails as YarnClient#getQueueInfo is not implemented

2020-11-23 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17237256#comment-17237256
 ] 

zhuqi edited comment on YARN-10111 at 11/23/20, 9:52 AM:
-

cc [~BilwaST]

Now i add the draft patch with the consistent queue name across all sub 
clusters.

We should discuss how to get the capacity related values. Because all sub 
clusters have their own total resources , and the get queue info api just can 
get single RM related value, but the capacity related values should get all sub 
clusters resources info first.

If you any advice?

Thanks. 

 


was (Author: zhuqi):
cc [~bilwa_st]

Now i add the draft patch with the consistent queue name across all sub 
clusters.

We should discuss how to get the capacity related values. Because all sub 
clusters have their own total resources , and the get queue info api just can 
get single RM related value, but the capacity related values should get all sub 
clusters resources info first.

If you any advice?

Thanks. 

 

> In Federation cluster Distributed Shell Application submission fails as 
> YarnClient#getQueueInfo is not implemented
> --
>
> Key: YARN-10111
> URL: https://issues.apache.org/jira/browse/YARN-10111
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Sushanta Sen
>Assignee: zhuqi
>Priority: Blocker
> Attachments: YARN-10111.001.patch
>
>
> In Federation cluster Distributed Shell Application submission fails as 
> YarnClient#getQueueInfo is not implemented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10111) In Federation cluster Distributed Shell Application submission fails as YarnClient#getQueueInfo is not implemented

2020-11-23 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17237256#comment-17237256
 ] 

zhuqi commented on YARN-10111:
--

cc [~bilwa_st]

Now i add the draft patch with the consistent queue name across all sub 
clusters.

We should discuss how to get the capacity related values. Because all sub 
clusters have their own total resources , and the get queue info api just can 
get single RM related value, but the capacity related values should get all sub 
clusters resources info first.

If you any advice?

Thanks. 

 

> In Federation cluster Distributed Shell Application submission fails as 
> YarnClient#getQueueInfo is not implemented
> --
>
> Key: YARN-10111
> URL: https://issues.apache.org/jira/browse/YARN-10111
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Sushanta Sen
>Assignee: zhuqi
>Priority: Blocker
> Attachments: YARN-10111.001.patch
>
>
> In Federation cluster Distributed Shell Application submission fails as 
> YarnClient#getQueueInfo is not implemented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10111) In Federation cluster Distributed Shell Application submission fails as YarnClient#getQueueInfo is not implemented

2020-11-23 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-10111:
-
Attachment: (was: YARN-10111.001.patch)

> In Federation cluster Distributed Shell Application submission fails as 
> YarnClient#getQueueInfo is not implemented
> --
>
> Key: YARN-10111
> URL: https://issues.apache.org/jira/browse/YARN-10111
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Sushanta Sen
>Assignee: zhuqi
>Priority: Blocker
>
> In Federation cluster Distributed Shell Application submission fails as 
> YarnClient#getQueueInfo is not implemented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10111) In Federation cluster Distributed Shell Application submission fails as YarnClient#getQueueInfo is not implemented

2020-11-23 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-10111:
-
Attachment: YARN-10111.001.patch

> In Federation cluster Distributed Shell Application submission fails as 
> YarnClient#getQueueInfo is not implemented
> --
>
> Key: YARN-10111
> URL: https://issues.apache.org/jira/browse/YARN-10111
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Sushanta Sen
>Assignee: zhuqi
>Priority: Blocker
>
> In Federation cluster Distributed Shell Application submission fails as 
> YarnClient#getQueueInfo is not implemented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10111) In Federation cluster Distributed Shell Application submission fails as YarnClient#getQueueInfo is not implemented

2020-11-23 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi reassigned YARN-10111:


Assignee: zhuqi

> In Federation cluster Distributed Shell Application submission fails as 
> YarnClient#getQueueInfo is not implemented
> --
>
> Key: YARN-10111
> URL: https://issues.apache.org/jira/browse/YARN-10111
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Sushanta Sen
>Assignee: zhuqi
>Priority: Blocker
>
> In Federation cluster Distributed Shell Application submission fails as 
> YarnClient#getQueueInfo is not implemented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10311) Yarn Service should support obtaining tokens from multiple name services

2020-11-20 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236588#comment-17236588
 ] 

zhuqi commented on YARN-10311:
--

[~BilwaST]  It is a value improvement, is it going on for merge.

> Yarn Service should support obtaining tokens from multiple name services
> 
>
> Key: YARN-10311
> URL: https://issues.apache.org/jira/browse/YARN-10311
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Major
> Attachments: YARN-10311.001.patch, YARN-10311.002.patch
>
>
> Currently yarn services support single name service tokens. We can add a new 
> conf called
> "yarn.service.hdfs-servers" for supporting this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10132) For Federation,yarn applicationattempt fail command throws an exception

2020-11-20 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236015#comment-17236015
 ] 

zhuqi commented on YARN-10132:
--

cc [~bilwa_st] :

I have attached a patch for this, if you can review it for the merge.

Thanks.

> For Federation,yarn applicationattempt fail command throws an exception
> ---
>
> Key: YARN-10132
> URL: https://issues.apache.org/jira/browse/YARN-10132
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Sushanta Sen
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10132.001.patch, YARN-10132.002.patch
>
>
> yarn applicationattempt fail command is failing with exception  
> “org.apache.commons.lang.NotImplementedException: Code is not implemented”.
> {noformat}
>  ./yarn applicationattempt -fail appattempt_1581497870689_0001_01
> Failing attempt appattempt_1581497870689_0001_01 of application 
> application_1581497870689_0001
> 2020-02-12 20:48:48,530 INFO impl.YarnClientImpl: Failing application attempt 
> appattempt_1581497870689_0001_01
> Exception in thread "main" org.apache.commons.lang.NotImplementedException: 
> Code is not implemented
> at 
> org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor.failApplicationAttempt(FederationClientInterceptor.java:980)
> at 
> org.apache.hadoop.yarn.server.router.clientrm.RouterClientRMService.failApplicationAttempt(RouterClientRMService.java:388)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.failApplicationAttempt(ApplicationClientProtocolPBServiceImpl.java:210)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:581)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2793)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.failApplicationAttempt(ApplicationClientProtocolPBClientImpl.java:223)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy8.failApplicationAttempt(Unknown Source)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.failApplicationAttempt(YarnClientImpl.java:447)
> at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.failApplicationAttempt(ApplicationCLI.java:985)
> at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:455)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
> at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:119)
> {noformat}



--

[jira] [Assigned] (YARN-10132) For Federation,yarn applicationattempt fail command throws an exception

2020-11-20 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi reassigned YARN-10132:


Assignee: zhuqi

> For Federation,yarn applicationattempt fail command throws an exception
> ---
>
> Key: YARN-10132
> URL: https://issues.apache.org/jira/browse/YARN-10132
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Sushanta Sen
>Assignee: zhuqi
>Priority: Major
>
> yarn applicationattempt fail command is failing with exception  
> “org.apache.commons.lang.NotImplementedException: Code is not implemented”.
> {noformat}
>  ./yarn applicationattempt -fail appattempt_1581497870689_0001_01
> Failing attempt appattempt_1581497870689_0001_01 of application 
> application_1581497870689_0001
> 2020-02-12 20:48:48,530 INFO impl.YarnClientImpl: Failing application attempt 
> appattempt_1581497870689_0001_01
> Exception in thread "main" org.apache.commons.lang.NotImplementedException: 
> Code is not implemented
> at 
> org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor.failApplicationAttempt(FederationClientInterceptor.java:980)
> at 
> org.apache.hadoop.yarn.server.router.clientrm.RouterClientRMService.failApplicationAttempt(RouterClientRMService.java:388)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.failApplicationAttempt(ApplicationClientProtocolPBServiceImpl.java:210)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:581)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2793)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85)
> at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.failApplicationAttempt(ApplicationClientProtocolPBClientImpl.java:223)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy8.failApplicationAttempt(Unknown Source)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.failApplicationAttempt(YarnClientImpl.java:447)
> at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.failApplicationAttempt(ApplicationCLI.java:985)
> at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:455)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
> at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:119)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Comment Edited] (YARN-10463) For Federation, we should support getApplicationAttemptReport.

2020-11-05 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227197#comment-17227197
 ] 

zhuqi edited comment on YARN-10463 at 11/6/20, 7:18 AM:


Hi [~BilwaST] 
Rebase and fixed all checkstyle in latest patch. 

Thanks for your review.


was (Author: zhuqi):
Hi [~BilwaST]
Fixed all checkstyle in latest patch.

Thanks.

> For Federation, we should support getApplicationAttemptReport.
> --
>
> Key: YARN-10463
> URL: https://issues.apache.org/jira/browse/YARN-10463
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10463.001.patch, YARN-10463.002.patch, 
> YARN-10463.003.patch, YARN-10463.004.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10463) For Federation, we should support getApplicationAttemptReport.

2020-11-05 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227197#comment-17227197
 ] 

zhuqi commented on YARN-10463:
--

Hi [~BilwaST]
Fixed all checkstyle in latest patch.

Thanks.

> For Federation, we should support getApplicationAttemptReport.
> --
>
> Key: YARN-10463
> URL: https://issues.apache.org/jira/browse/YARN-10463
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10463.001.patch, YARN-10463.002.patch, 
> YARN-10463.003.patch, YARN-10463.004.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10463) For Federation, we should support getApplicationAttemptReport.

2020-11-05 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227129#comment-17227129
 ] 

zhuqi commented on YARN-10463:
--

Hi [~BilwaST]

I have fixed the above problem, and the getApplicationAttemptReportLatency not 
inited bug in the new patch.

Thanks.

> For Federation, we should support getApplicationAttemptReport.
> --
>
> Key: YARN-10463
> URL: https://issues.apache.org/jira/browse/YARN-10463
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10463.001.patch, YARN-10463.002.patch, 
> YARN-10463.003.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10470) When building new web ui with root user, the bower install should support it.

2020-10-22 Thread zhuqi (Jira)
zhuqi created YARN-10470:


 Summary: When building new web ui with root user, the bower 
install should support it.
 Key: YARN-10470
 URL: https://issues.apache.org/jira/browse/YARN-10470
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Reporter: zhuqi
Assignee: zhuqi
 Attachments: image-2020-10-22-22-39-48-709.png

!image-2020-10-22-22-39-48-709.png|width=1120,height=342!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10463) For Federation, we should support getApplicationAttemptReport.

2020-10-21 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218198#comment-17218198
 ] 

zhuqi edited comment on YARN-10463 at 10/22/20, 5:50 AM:
-

Hi [~BilwaST]

I have fixed your comments in the new patch, but how can we fill the 
appAttemptId to the RMApp in RMClientService. The test will failed in 
RMClientService: 
{code:java}
// code placeholder
RMAppAttempt appAttempt = application.getAppAttempts().get(appAttemptId); 
if (appAttempt == null) {
 throw new ApplicationAttemptNotFoundException(
 "ApplicationAttempt with id '" + appAttemptId +
 "' doesn't exist in RM.");
}
{code}
Thanks.

 


was (Author: zhuqi):
Hi [~BilwaST]

I have fixed your comments in the new patch, but how can we fill the 
appAttemptId to the RMClientService. The test will failed in RMClientService: 
{code:java}
// code placeholder
RMAppAttempt appAttempt = application.getAppAttempts().get(appAttemptId); 
if (appAttempt == null) {
 throw new ApplicationAttemptNotFoundException(
 "ApplicationAttempt with id '" + appAttemptId +
 "' doesn't exist in RM.");
}
{code}
Thanks.

 

> For Federation, we should support getApplicationAttemptReport.
> --
>
> Key: YARN-10463
> URL: https://issues.apache.org/jira/browse/YARN-10463
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10463.001.patch, YARN-10463.002.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10463) For Federation, we should support getApplicationAttemptReport.

2020-10-21 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218198#comment-17218198
 ] 

zhuqi commented on YARN-10463:
--

Hi [~BilwaST]

I have fixed your comments in the new patch, but how can we fill the 
appAttemptId to the RMClientService. The test will failed in RMClientService: 
{code:java}
// code placeholder
RMAppAttempt appAttempt = application.getAppAttempts().get(appAttemptId); 
if (appAttempt == null) {
 throw new ApplicationAttemptNotFoundException(
 "ApplicationAttempt with id '" + appAttemptId +
 "' doesn't exist in RM.");
}
{code}
Thanks.

 

> For Federation, we should support getApplicationAttemptReport.
> --
>
> Key: YARN-10463
> URL: https://issues.apache.org/jira/browse/YARN-10463
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10463.001.patch, YARN-10463.002.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10144) Federation: Add missing FederationClientInterceptor APIs

2020-10-16 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi reassigned YARN-10144:


Assignee: zhuqi

> Federation: Add missing FederationClientInterceptor APIs
> 
>
> Key: YARN-10144
> URL: https://issues.apache.org/jira/browse/YARN-10144
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: federation
>Reporter: D M Murali Krishna Reddy
>Assignee: zhuqi
>Priority: Major
>
> In FederationClientInterceptor, many API's are not Implemented.
>  * getClusterNodes
>  * getQueueInfo
>  * getQueueUserAcls
>  * moveApplicationAcrossQueues
>  * getNewReservation
>  * submitReservation
>  * listReservations
>  * updateReservation
>  * deleteReservation
>  * getNodeToLabels
>  * getLabelsToNodes
>  * getClusterNodeLabels
>  * getApplicationAttemptReport
>  * getApplicationAttempts
>  * getContainerReport
>  * getContainers
>  * getDelegationToken
>  * renewDelegationToken
>  * cancelDelegationToken
>  * failApplicationAttempt
>  * updateApplicationPriority
>  * signalToContainer
>  * updateApplicationTimeouts
>  * getResourceProfiles
>  * getResourceProfile
>  * getResourceTypeInfo
>  * getAttributesToNodes
>  * getClusterNodeAttributes
>  * getNodesToAttributes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10463) For Federation, we should support getApplicationAttemptReport.

2020-10-16 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215259#comment-17215259
 ] 

zhuqi commented on YARN-10463:
--

Hi [~BilwaST] 

Thanks for your quickly reply and review.

I understand it now, but another question:

When we call the getQueueInfo of one queue absolute path, and there are more 
than two sub clusters which have the same absolute path, if it mean we will 
return a map with subcluster ID as a key and  queue info as values?

 

> For Federation, we should support getApplicationAttemptReport.
> --
>
> Key: YARN-10463
> URL: https://issues.apache.org/jira/browse/YARN-10463
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10463.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10463) For Federation, we should support getApplicationAttemptReport.

2020-10-16 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215245#comment-17215245
 ] 

zhuqi commented on YARN-10463:
--

cc [~adam.antal] [~Tao Yang]  [~dmmkr] [~BilwaST]  

I add the getApplicationAttemptReport for Federation. 

And i can take the remaining missing FederationClientInterceptor APIs, but i 
have one question:

If we pass the subcluster Id to list some subcluster related result, or we list 
all the subcluster related result to client?

For example:

Such as getClusterNodes , if we list all nodes of all subclusters one by one , 
or just pass a subcluster id which we want to query?

Thanks.

> For Federation, we should support getApplicationAttemptReport.
> --
>
> Key: YARN-10463
> URL: https://issues.apache.org/jira/browse/YARN-10463
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10463.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10463) For Federation, we should support getApplicationAttemptReport.

2020-10-16 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-10463:
-
Summary: For Federation, we should support getApplicationAttemptReport.  
(was: For Federation, we should support getApplicationAttemptReport。)

> For Federation, we should support getApplicationAttemptReport.
> --
>
> Key: YARN-10463
> URL: https://issues.apache.org/jira/browse/YARN-10463
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10463.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10463) For Federation, we should support getApplicationAttemptReport。

2020-10-16 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-10463:
-
Parent: YARN-10144
Issue Type: Sub-task  (was: Task)

> For Federation, we should support getApplicationAttemptReport。
> --
>
> Key: YARN-10463
> URL: https://issues.apache.org/jira/browse/YARN-10463
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10463) For Federation, we should support getApplicationAttemptReport。

2020-10-16 Thread zhuqi (Jira)
zhuqi created YARN-10463:


 Summary: For Federation, we should support 
getApplicationAttemptReport。
 Key: YARN-10463
 URL: https://issues.apache.org/jira/browse/YARN-10463
 Project: Hadoop YARN
  Issue Type: Task
Reporter: zhuqi
Assignee: zhuqi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10448) SLS should set default user to handle SYNTH format

2020-10-12 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212437#comment-17212437
 ] 

zhuqi commented on YARN-10448:
--

CC [~adam.antal] 

Thanks for your patient review and commit.

The unit tests failures are not related to it, and i have fixed the checkstyle 
warning in the new patch.

> SLS should set default user to handle SYNTH format
> --
>
> Key: YARN-10448
> URL: https://issues.apache.org/jira/browse/YARN-10448
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.2.1, 3.4.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10448.001.patch, YARN-10448.002.patch, 
> YARN-10448.003.patch, YARN-10448.004.patch, 
> image-2020-10-11-22-01-37-227.png, image-2020-10-11-22-02-17-166.png
>
>
> When using the synthetic generator json file example from the doc ( 
> https://hadoop.apache.org/docs/current/hadoop-sls/SchedulerLoadSimulator.html#SYNTH_JSON_input_file_format
>  ), it throws the following exception:
> {noformat}
> java.lang.IllegalArgumentException: Null user
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269)
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161)
> at 
> org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {noformat}
> So the solution is either:
> 1) to make {{user_name}} a mandatory field, or
> 2) to set default user in SLS code if the json file does not define it.
> IMO, solution 2 might be better, because in most cases (if not all) 
> {{user_name}} has no impact on scheduler performance, thus it is reasonable 
> to make it an optional field, which is also consistent with the {{job.user}} 
> field in SLS JSON file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10448) SLS should set default user to handle SYNTH format

2020-10-11 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211922#comment-17211922
 ] 

zhuqi edited comment on YARN-10448 at 10/11/20, 2:03 PM:
-

CC [~adam.antal] 

I have moved the "default" String to a private static final constant.

The unit test works well with the "syn_generic.json" file in the resources 
folder , when we removed the "user_name": "foobar".

The following picture passed the sample test.

Thanks. 


was (Author: zhuqi):
CC [~adam.antal] 

I have moved the "default" String to a private static final constant.

The unit test works well with the "syn_generic.json" file in the resources 
folder , when we removed the 

"user_name": "foobar".

Thanks. 

> SLS should set default user to handle SYNTH format
> --
>
> Key: YARN-10448
> URL: https://issues.apache.org/jira/browse/YARN-10448
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.2.1, 3.4.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10448.001.patch, YARN-10448.002.patch, 
> YARN-10448.003.patch, image-2020-10-11-22-01-37-227.png, 
> image-2020-10-11-22-02-17-166.png
>
>
> When using the synthetic generator json file example from the doc ( 
> https://hadoop.apache.org/docs/current/hadoop-sls/SchedulerLoadSimulator.html#SYNTH_JSON_input_file_format
>  ), it throws the following exception:
> {noformat}
> java.lang.IllegalArgumentException: Null user
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269)
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161)
> at 
> org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {noformat}
> So the solution is either:
> 1) to make {{user_name}} a mandatory field, or
> 2) to set default user in SLS code if the json file does not define it.
> IMO, solution 2 might be better, because in most cases (if not all) 
> {{user_name}} has no impact on scheduler performance, thus it is reasonable 
> to make it an optional field, which is also consistent with the {{job.user}} 
> field in SLS JSON file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10448) SLS should set default user to handle SYNTH format

2020-10-11 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-10448:
-
Attachment: image-2020-10-11-22-02-17-166.png

> SLS should set default user to handle SYNTH format
> --
>
> Key: YARN-10448
> URL: https://issues.apache.org/jira/browse/YARN-10448
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.2.1, 3.4.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10448.001.patch, YARN-10448.002.patch, 
> YARN-10448.003.patch, image-2020-10-11-22-01-37-227.png, 
> image-2020-10-11-22-02-17-166.png
>
>
> When using the synthetic generator json file example from the doc ( 
> https://hadoop.apache.org/docs/current/hadoop-sls/SchedulerLoadSimulator.html#SYNTH_JSON_input_file_format
>  ), it throws the following exception:
> {noformat}
> java.lang.IllegalArgumentException: Null user
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269)
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161)
> at 
> org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {noformat}
> So the solution is either:
> 1) to make {{user_name}} a mandatory field, or
> 2) to set default user in SLS code if the json file does not define it.
> IMO, solution 2 might be better, because in most cases (if not all) 
> {{user_name}} has no impact on scheduler performance, thus it is reasonable 
> to make it an optional field, which is also consistent with the {{job.user}} 
> field in SLS JSON file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10448) SLS should set default user to handle SYNTH format

2020-10-11 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211934#comment-17211934
 ] 

zhuqi commented on YARN-10448:
--

!image-2020-10-11-22-01-37-227.png|width=871,height=179!

!image-2020-10-11-22-02-17-166.png|width=341,height=42!

> SLS should set default user to handle SYNTH format
> --
>
> Key: YARN-10448
> URL: https://issues.apache.org/jira/browse/YARN-10448
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.2.1, 3.4.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10448.001.patch, YARN-10448.002.patch, 
> YARN-10448.003.patch, image-2020-10-11-22-01-37-227.png, 
> image-2020-10-11-22-02-17-166.png
>
>
> When using the synthetic generator json file example from the doc ( 
> https://hadoop.apache.org/docs/current/hadoop-sls/SchedulerLoadSimulator.html#SYNTH_JSON_input_file_format
>  ), it throws the following exception:
> {noformat}
> java.lang.IllegalArgumentException: Null user
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269)
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161)
> at 
> org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {noformat}
> So the solution is either:
> 1) to make {{user_name}} a mandatory field, or
> 2) to set default user in SLS code if the json file does not define it.
> IMO, solution 2 might be better, because in most cases (if not all) 
> {{user_name}} has no impact on scheduler performance, thus it is reasonable 
> to make it an optional field, which is also consistent with the {{job.user}} 
> field in SLS JSON file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10448) SLS should set default user to handle SYNTH format

2020-10-11 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-10448:
-
Attachment: image-2020-10-11-22-01-37-227.png

> SLS should set default user to handle SYNTH format
> --
>
> Key: YARN-10448
> URL: https://issues.apache.org/jira/browse/YARN-10448
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.2.1, 3.4.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10448.001.patch, YARN-10448.002.patch, 
> YARN-10448.003.patch, image-2020-10-11-22-01-37-227.png
>
>
> When using the synthetic generator json file example from the doc ( 
> https://hadoop.apache.org/docs/current/hadoop-sls/SchedulerLoadSimulator.html#SYNTH_JSON_input_file_format
>  ), it throws the following exception:
> {noformat}
> java.lang.IllegalArgumentException: Null user
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269)
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161)
> at 
> org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {noformat}
> So the solution is either:
> 1) to make {{user_name}} a mandatory field, or
> 2) to set default user in SLS code if the json file does not define it.
> IMO, solution 2 might be better, because in most cases (if not all) 
> {{user_name}} has no impact on scheduler performance, thus it is reasonable 
> to make it an optional field, which is also consistent with the {{job.user}} 
> field in SLS JSON file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10448) SLS should set default user to handle SYNTH format

2020-10-11 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211922#comment-17211922
 ] 

zhuqi commented on YARN-10448:
--

CC [~adam.antal] 

I have moved the "default" String to a private static final constant.

The unit test works well with the "syn_generic.json" file in the resources 
folder , when we removed the 

"user_name": "foobar".

Thanks. 

> SLS should set default user to handle SYNTH format
> --
>
> Key: YARN-10448
> URL: https://issues.apache.org/jira/browse/YARN-10448
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.2.1, 3.4.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10448.001.patch, YARN-10448.002.patch, 
> YARN-10448.003.patch
>
>
> When using the synthetic generator json file example from the doc ( 
> https://hadoop.apache.org/docs/current/hadoop-sls/SchedulerLoadSimulator.html#SYNTH_JSON_input_file_format
>  ), it throws the following exception:
> {noformat}
> java.lang.IllegalArgumentException: Null user
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269)
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161)
> at 
> org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {noformat}
> So the solution is either:
> 1) to make {{user_name}} a mandatory field, or
> 2) to set default user in SLS code if the json file does not define it.
> IMO, solution 2 might be better, because in most cases (if not all) 
> {{user_name}} has no impact on scheduler performance, thus it is reasonable 
> to make it an optional field, which is also consistent with the {{job.user}} 
> field in SLS JSON file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10448) SLS should set default user to handle SYNTH format

2020-09-29 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203789#comment-17203789
 ] 

zhuqi commented on YARN-10448:
--

CC [~cheersyang] [~Tao Yang]   

Attached the 002 patch: set default user in \{{startAMFromSynthGenerator()}} if 
\{{job.getUser()}} is null.

Thanks.

> SLS should set default user to handle SYNTH format
> --
>
> Key: YARN-10448
> URL: https://issues.apache.org/jira/browse/YARN-10448
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.2.1, 3.4.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10448.001.patch, YARN-10448.002.patch
>
>
> java.lang.IllegalArgumentException: Null user
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269)
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161)
> at 
> org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10448) SLS should set default user to handle SYNTH format

2020-09-29 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-10448:
-
Summary: SLS should set default user to handle SYNTH format  (was: When use 
the sls (SYNTH JSON input file format) example the user be null will cause 
failed.)

> SLS should set default user to handle SYNTH format
> --
>
> Key: YARN-10448
> URL: https://issues.apache.org/jira/browse/YARN-10448
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.2.1, 3.4.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10448.001.patch
>
>
> java.lang.IllegalArgumentException: Null user
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269)
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161)
> at 
> org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9693) When AMRMProxyService is enabled RMCommunicator will register with failure

2020-09-29 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203663#comment-17203663
 ] 

zhuqi commented on YARN-9693:
-

CC [~cane]  [~panlijie] 

If the patch fix the problem, i meet the same error too.

Thanks.

> When AMRMProxyService is enabled RMCommunicator will register with failure
> --
>
> Key: YARN-9693
> URL: https://issues.apache.org/jira/browse/YARN-9693
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: federation
>Affects Versions: 3.1.2
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: YARN-9693.001.patch
>
>
> When we enable amrm proxy service, the  RMCommunicator will register with 
> failure below:
> {code:java}
> 2019-07-23 17:12:44,794 INFO [TaskHeartbeatHandler PingChecker] 
> org.apache.hadoop.mapreduce.v2.app.TaskHeartbeatHandler: TaskHeartbeatHandler 
> thread interrupted
> 2019-07-23 17:12:44,794 ERROR [main] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid 
> AMRMToken from appattempt_1563872237585_0001_02
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:186)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.serviceStart(RMCommunicator.java:123)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.serviceStart(RMContainerAllocator.java:280)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.serviceStart(MRAppMaster.java:986)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1300)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$6.run(MRAppMaster.java:1768)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1716)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1764)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1698)
> Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: 
> Invalid AMRMToken from appattempt_1563872237585_0001_02
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>   at com.sun.proxy.$Proxy93.registerApplicationMaster(Unknown Source)
>   at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:170)
>   ... 14 more
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  Invalid AMRMToken 

[jira] [Commented] (YARN-10448) When use the sls (SYNTH JSON input file format) example the user be null will cause failed.

2020-09-24 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201457#comment-17201457
 ] 

zhuqi commented on YARN-10448:
--

No need new test.

> When use the sls (SYNTH JSON input file format) example the user be null will 
> cause failed.
> ---
>
> Key: YARN-10448
> URL: https://issues.apache.org/jira/browse/YARN-10448
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.2.1, 3.4.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: YARN-10448.001.patch
>
>
> java.lang.IllegalArgumentException: Null user
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269)
> at 
> org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191)
> at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161)
> at 
> org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10448) When use the sls (SYNTH JSON input file format) example the user be null will cause failed.

2020-09-24 Thread zhuqi (Jira)
zhuqi created YARN-10448:


 Summary: When use the sls (SYNTH JSON input file format) example 
the user be null will cause failed.
 Key: YARN-10448
 URL: https://issues.apache.org/jira/browse/YARN-10448
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler-load-simulator
Reporter: zhuqi
Assignee: zhuqi


java.lang.IllegalArgumentException: Null user
at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269)
at 
org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256)
at 
org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191)
at 
org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161)
at org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10400) Build the new version of hadoop on Mac os system with bug

2020-09-07 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191573#comment-17191573
 ] 

zhuqi commented on YARN-10400:
--

cc [~jiwq] 

It's a good choice, thanks.

> Build the new version of hadoop on Mac os system with bug
> -
>
> Key: YARN-10400
> URL: https://issues.apache.org/jira/browse/YARN-10400
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: zhuqi
>Priority: Major
> Attachments: image-2020-08-18-00-23-48-730.png
>
>
> !image-2020-08-18-00-23-48-730.png|width=1141,height=449!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10400) Build the new version of hadoop on Mac os system with bug

2020-08-17 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-10400:
-
Description: !image-2020-08-18-00-23-48-730.png|width=1141,height=449!  
(was: !image-2020-08-18-00-23-48-730.png!)

> Build the new version of hadoop on Mac os system with bug
> -
>
> Key: YARN-10400
> URL: https://issues.apache.org/jira/browse/YARN-10400
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: zhuqi
>Priority: Major
> Attachments: image-2020-08-18-00-23-48-730.png
>
>
> !image-2020-08-18-00-23-48-730.png|width=1141,height=449!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10400) Build the new version of hadoop on Mac os system with bug

2020-08-17 Thread zhuqi (Jira)
zhuqi created YARN-10400:


 Summary: Build the new version of hadoop on Mac os system with bug
 Key: YARN-10400
 URL: https://issues.apache.org/jira/browse/YARN-10400
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.3.0
Reporter: zhuqi
 Attachments: image-2020-08-18-00-23-48-730.png

!image-2020-08-18-00-23-48-730.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB

2019-09-29 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi reopened YARN-2368:
-

We can not only set the yarn.resourcemanager.zk-jutemaxbuffer-bytes to be 
configured, but also can control the if we want to retry or just log some 
application info to define what application cause the boom of the zk buffer. So 
that we can make sure the gc problem not happen when we retry too much and time 
out the zk connection. Also we can find the root application which cause the 
boom of the zk buffer.

> ResourceManager failed when ZKRMStateStore tries to update znode data larger 
> than 1MB
> -
>
> Key: YARN-2368
> URL: https://issues.apache.org/jira/browse/YARN-2368
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.1
>Reporter: Leitao Guo
>Assignee: zhuqi
>Priority: Critical
> Attachments: YARN-2368.patch
>
>
> Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed 
> finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode 
> larger than 1MB, which is the default configuration of ZooKeeper server and 
> client in 'jute.maxbuffer'.
> ResourceManager (ip addr: 10.153.80.8) log shows as the following:
> {code}
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2014-07-25 22:33:11,214 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for 
> /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Meanwhile, ZooKeeps log shows as the following:
> {code}
> 2014-07-25 22:10:09,728 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - 
> Accepted socket connection from /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client 
> attempting to renew session 0x247684586e70006 at /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating 
> client: 0x247684586e70006
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@595] - Established session 
> 0x247684586e70006 with negotiated timeout 1 for client /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth 
> packet /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth 
> success /10.153.80.8:58890
> 

[jira] [Assigned] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB

2019-09-29 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi reassigned YARN-2368:
---

Assignee: zhuqi

> ResourceManager failed when ZKRMStateStore tries to update znode data larger 
> than 1MB
> -
>
> Key: YARN-2368
> URL: https://issues.apache.org/jira/browse/YARN-2368
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.1
>Reporter: Leitao Guo
>Assignee: zhuqi
>Priority: Critical
> Attachments: YARN-2368.patch
>
>
> Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed 
> finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode 
> larger than 1MB, which is the default configuration of ZooKeeper server and 
> client in 'jute.maxbuffer'.
> ResourceManager (ip addr: 10.153.80.8) log shows as the following:
> {code}
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session connected
> 2014-07-25 22:33:11,078 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> ZKRMStateStore Session restored
> 2014-07-25 22:33:11,214 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for 
> /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Meanwhile, ZooKeeps log shows as the following:
> {code}
> 2014-07-25 22:10:09,728 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - 
> Accepted socket connection from /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client 
> attempting to renew session 0x247684586e70006 at /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating 
> client: 0x247684586e70006
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@595] - Established session 
> 0x247684586e70006 with negotiated timeout 1 for client /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth 
> packet /10.153.80.8:58890
> 2014-07-25 22:10:09,730 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth 
> success /10.153.80.8:58890
> 2014-07-25 22:10:09,742 [myid:1] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception 
> causing close of session 0x247684586e70006 due to java.io.IOException: Len 
> error 1530
> 747
> 2014-07-25 22:10:09,743 [myid:1] - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed 
> socket connection for client /10.153.80.8:58890 which 

[jira] [Commented] (YARN-8995) Log events info in AsyncDispatcher when event queue size cumulatively reaches a certain number every time.

2019-09-05 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923146#comment-16923146
 ] 

zhuqi commented on YARN-8995:
-

Hi [~Tao Yang] 

I have improved it now. 

Thanks a lot.

> Log events info in AsyncDispatcher when event queue size cumulatively reaches 
> a certain number every time.
> --
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, 
> YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, 
> YARN-8995.014.patch, YARN-8995.015.patch, YARN-8995.016.patch, 
> image-2019-09-04-15-20-02-914.png
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8995) Log events info in AsyncDispatcher when event queue size cumulatively reaches a certain number every time.

2019-09-05 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-8995:

Summary: Log events info in AsyncDispatcher when event queue size 
cumulatively reaches a certain number every time.  (was: Log the event type of 
the too big AsyncDispatcher event queue size, and add the information to the 
metrics. )

> Log events info in AsyncDispatcher when event queue size cumulatively reaches 
> a certain number every time.
> --
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, 
> YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, 
> YARN-8995.014.patch, YARN-8995.015.patch, YARN-8995.016.patch, 
> image-2019-09-04-15-20-02-914.png
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-09-04 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1699#comment-1699
 ] 

zhuqi commented on YARN-8995:
-

Hi [~Tao Yang]. 

!image-2019-09-04-15-20-02-914.png!

The metric that i have changed.Now not in thousand, but i forget to change it 
in the last two patch. Sorry for my mistake. 

Thanks.

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, 
> YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, 
> YARN-8995.014.patch, image-2019-09-04-15-20-02-914.png
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-09-04 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-8995:

Attachment: image-2019-09-04-15-20-02-914.png

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, 
> YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, 
> YARN-8995.014.patch, image-2019-09-04-15-20-02-914.png
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-09-04 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921977#comment-16921977
 ] 

zhuqi commented on YARN-8995:
-

Hi [~Tao Yang] 

Now i have fixed the checkstyle.

Thanks.

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, 
> YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, 
> YARN-8995.014.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-09-02 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-8995:

Attachment: (was: YARN-8995.013.patch)

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, 
> YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-08-31 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920268#comment-16920268
 ] 

zhuqi commented on YARN-8995:
-

Hi   [~Tao Yang]

Now the patch fixed is available. 

Thanks very much for your patience.

 

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, 
> YARN-8995.011.patch, YARN-8995.012.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-08-31 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-8995:

Attachment: (was: YARN-8995.010.patch)

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-08-25 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-8995:

Attachment: (was: YARN-8995.009.patch)

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.009.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-08-25 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-8995:

Attachment: (was: YARN-8995.010.patch)

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-08-25 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915282#comment-16915282
 ] 

zhuqi commented on YARN-8995:
-

Hi  [~cheersyang]/[~Tao Yang]

Now i submit the new patch. 

If any other advice for merge it.

Thanks.

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch, YARN-8995.010.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-08-21 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912381#comment-16912381
 ] 

zhuqi commented on YARN-8995:
-

Hi [~cheersyang] 

Thanks for your review.

I use in-thousands here, because i want to force user to set it in thousand in 
order to match the existed queue size log info if they won't use the default 
5000.

The new description is good to me. I will change my description.

Thanks.

 

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-08-19 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910496#comment-16910496
 ] 

zhuqi commented on YARN-8995:
-

Hi, [~Tao Yang] 
Thanks a lot. And i am looking forward to contribute more. 

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9634) Make yarn submit dir and log aggregation dir more evenly distributed

2019-06-21 Thread zhuqi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870034#comment-16870034
 ] 

zhuqi commented on YARN-9634:
-

Hi, [~Tao Yang].

Thanks for comment. I have changed my description, as [~cheersyang] commented, 
my focus is if we can distributed these dirs among multi federation dirs, to 
get better namenode performance for my HDFS federation. But will affect some 
other logic, the details should be considered.

 

> Make yarn submit dir and log aggregation dir more evenly distributed
> 
>
> Key: YARN-9634
> URL: https://issues.apache.org/jira/browse/YARN-9634
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
>
> When the cluster size is large, the dir which user submits the job, and the 
> dir which container log aggregate, and other information will fill the HDFS 
> directory, because the HDFS directory has a default storage limit, this can 
> be configured by "yarn.log-aggregation.retain-seconds" to solve. But  the 
> FSNamesystemLock#writeLock and rpc operation which these dir operation 
> triggered will affect the namespace which these dirs are located, in order to 
> get this better we have let this dir in one single HDFS federation namespace, 
> but with the cluster become huge, the single namespace will also affect the 
> rpc performance. In response to this situation, we can change these dirs more 
> distributed among multi namespace dirs, with some policy to choose, such as 
> hash policy and round robin policy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9634) Make yarn submit dir and log aggregation dir more evenly distributed

2019-06-21 Thread zhuqi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated YARN-9634:

Description: When the cluster size is large, the dir which user submits the 
job, and the dir which container log aggregate, and other information will fill 
the HDFS directory, because the HDFS directory has a default storage limit, 
this can be configured by "yarn.log-aggregation.retain-seconds" to solve. But  
the FSNamesystemLock#writeLock and rpc operation which these dir operation 
triggered will affect the namespace which these dirs are located, in order to 
get this better we have let this dir in one single HDFS federation namespace, 
but with the cluster become huge, the single namespace will also affect the rpc 
performance. In response to this situation, we can change these dirs more 
distributed among multi namespace dirs, with some policy to choose, such as 
hash policy and round robin policy.  (was: When the cluster size is large, the 
dir which user submits the job, and the dir which container log aggregate, and 
other information will fill the HDFS directory, because the HDFS directory has 
a default storage limit. In response to this situation, we can change these 
dirs more distributed, with some policy to choose, such as hash policy and 
round robin policy.)

> Make yarn submit dir and log aggregation dir more evenly distributed
> 
>
> Key: YARN-9634
> URL: https://issues.apache.org/jira/browse/YARN-9634
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
>
> When the cluster size is large, the dir which user submits the job, and the 
> dir which container log aggregate, and other information will fill the HDFS 
> directory, because the HDFS directory has a default storage limit, this can 
> be configured by "yarn.log-aggregation.retain-seconds" to solve. But  the 
> FSNamesystemLock#writeLock and rpc operation which these dir operation 
> triggered will affect the namespace which these dirs are located, in order to 
> get this better we have let this dir in one single HDFS federation namespace, 
> but with the cluster become huge, the single namespace will also affect the 
> rpc performance. In response to this situation, we can change these dirs more 
> distributed among multi namespace dirs, with some policy to choose, such as 
> hash policy and round robin policy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-06-21 Thread zhuqi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870027#comment-16870027
 ] 

zhuqi commented on YARN-8995:
-

Now the test has no problem. cc  [~Tao Yang].

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.

2019-06-21 Thread zhuqi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869594#comment-16869594
 ] 

zhuqi commented on YARN-8995:
-

Hi, [~Tao Yang]

Now, i submit the new patch and fix checkstyle warnings.

Thanks.

> Log the event type of the too big AsyncDispatcher event queue size, and add 
> the information to the metrics. 
> 
>
> Key: YARN-8995
> URL: https://issues.apache.org/jira/browse/YARN-8995
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics, nodemanager, resourcemanager
>Affects Versions: 3.2.0, 3.3.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: TestStreamPerf.java, YARN-8995.001.patch, 
> YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, 
> YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, 
> YARN-8995.008.patch
>
>
> In our growing cluster,there are unexpected situations that cause some event 
> queues to block the performance of the cluster, such as the bug of  
> https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to 
> log the event type of the too big event queue size, and add the information 
> to the metrics, and the threshold of queue size is a parametor which can be 
> changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9634) Make yarn submit dir and log aggregation dir more evenly distributed

2019-06-20 Thread zhuqi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869151#comment-16869151
 ] 

zhuqi commented on YARN-9634:
-

Hi [~cheersyang],

Yeah i mean space quota. Also in normal situation, if we don't use quota, i 
mean that in our large cluster,  the fixed namespace binding the submit dir and 
log aggeration dir, will affect the fixed namespace rpc performance. I think we 
can add configuration with some round robin policy or hash policy to 
distributed this dir to configured namespaces among the hdfs federation. 

> Make yarn submit dir and log aggregation dir more evenly distributed
> 
>
> Key: YARN-9634
> URL: https://issues.apache.org/jira/browse/YARN-9634
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
>
> When the cluster size is large, the dir which user submits the job, and the 
> dir which container log aggregate, and other information will fill the HDFS 
> directory, because the HDFS directory has a default storage limit. In 
> response to this situation, we can change these dirs more distributed, with 
> some policy to choose, such as hash policy and round robin policy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



<    1   2   3   4   >