[jira] [Updated] (YARN-10504) Implement weight mode in Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-10504: - Attachment: (was: YARN-10504.001.patch) > Implement weight mode in Capacity Scheduler > --- > > Key: YARN-10504 > URL: https://issues.apache.org/jira/browse/YARN-10504 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: zhuqi >Priority: Major > Attachments: YARN-10504.001.patch > > > To allow the possibility to flexibly create queues in Capacity Scheduler a > weight mode should be introduced. The existing \{{capacity }}property should > be used with a different syntax, i.e: > root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0 > root.users.capacity = 1.0w > root.users.capacity = w:1.0 > Weight support should not impact the existing functionality. > > The new functionality should: > * accept and validate the new weight values > * enforce a singular mode on the whole queue tree > * (re)calculate the relative (percentage-based) capacities based on the > weights during launch and every time the queue structure changes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10504) Implement weight mode in Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-10504: - Attachment: YARN-10504.001.patch > Implement weight mode in Capacity Scheduler > --- > > Key: YARN-10504 > URL: https://issues.apache.org/jira/browse/YARN-10504 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: zhuqi >Priority: Major > Attachments: YARN-10504.001.patch > > > To allow the possibility to flexibly create queues in Capacity Scheduler a > weight mode should be introduced. The existing \{{capacity }}property should > be used with a different syntax, i.e: > root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0 > root.users.capacity = 1.0w > root.users.capacity = w:1.0 > Weight support should not impact the existing functionality. > > The new functionality should: > * accept and validate the new weight values > * enforce a singular mode on the whole queue tree > * (re)calculate the relative (percentage-based) capacities based on the > weights during launch and every time the queue structure changes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10504) Implement weight mode in Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi reassigned YARN-10504: Assignee: zhuqi (was: Benjamin Teke) > Implement weight mode in Capacity Scheduler > --- > > Key: YARN-10504 > URL: https://issues.apache.org/jira/browse/YARN-10504 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Benjamin Teke >Assignee: zhuqi >Priority: Major > > To allow the possibility to flexibly create queues in Capacity Scheduler a > weight mode should be introduced. The existing \{{capacity }}property should > be used with a different syntax, i.e: > root.users.capacity = (1.0) or ~1.0 or ^1.0 or @1.0 > root.users.capacity = 1.0w > root.users.capacity = w:1.0 > Weight support should not impact the existing functionality. > > The new functionality should: > * accept and validate the new weight values > * enforce a singular mode on the whole queue tree > * (re)calculate the relative (percentage-based) capacities based on the > weights during launch and every time the queue structure changes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-10380: - Attachment: (was: YARN-10380.002.patch) > Import logic of multi-node allocation in CapacityScheduler > -- > > Key: YARN-10380 > URL: https://issues.apache.org/jira/browse/YARN-10380 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0 >Reporter: Wangda Tan >Assignee: zhuqi >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > *1) Entry point:* > When we do multi-node allocation, we're using the same logic of async > scheduling: > {code:java} > // Allocate containers of node [start, end) > for (FiCaSchedulerNode node : nodes) { > if (current++ >= start) { > if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) { > continue; > } > cs.allocateContainersToNode(node.getNodeID(), false); > } > } {code} > Is it the most effective way to do multi-node scheduling? Should we allocate > based on partitions? In above logic, if we have thousands of node in one > partition, we will repeatly access all nodes of the partition thousands of > times. > I would suggest looking at making entry-point for node-heartbeat, > async-scheduling (single node), and async-scheduling (multi-node) to be > different. > Node-heartbeat and async-scheduling (single node) can be still similar and > share most of the code. > async-scheduling (multi-node): should iterate partition first, using pseudo > code like: > {code:java} > for (partition : all partitions) { > allocateContainersOnMultiNodes(getCandidate(partition)) > } {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-10380: - Attachment: (was: YARN-10380.001.patch) > Import logic of multi-node allocation in CapacityScheduler > -- > > Key: YARN-10380 > URL: https://issues.apache.org/jira/browse/YARN-10380 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0 >Reporter: Wangda Tan >Assignee: zhuqi >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > *1) Entry point:* > When we do multi-node allocation, we're using the same logic of async > scheduling: > {code:java} > // Allocate containers of node [start, end) > for (FiCaSchedulerNode node : nodes) { > if (current++ >= start) { > if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) { > continue; > } > cs.allocateContainersToNode(node.getNodeID(), false); > } > } {code} > Is it the most effective way to do multi-node scheduling? Should we allocate > based on partitions? In above logic, if we have thousands of node in one > partition, we will repeatly access all nodes of the partition thousands of > times. > I would suggest looking at making entry-point for node-heartbeat, > async-scheduling (single node), and async-scheduling (multi-node) to be > different. > Node-heartbeat and async-scheduling (single node) can be still similar and > share most of the code. > async-scheduling (multi-node): should iterate partition first, using pseudo > code like: > {code:java} > for (partition : all partitions) { > allocateContainersOnMultiNodes(getCandidate(partition)) > } {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10525) Support FS Convert to CS with weight mode enabled in CS.
zhuqi created YARN-10525: Summary: Support FS Convert to CS with weight mode enabled in CS. Key: YARN-10525 URL: https://issues.apache.org/jira/browse/YARN-10525 Project: Hadoop YARN Issue Type: Sub-task Reporter: zhuqi -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10524) Support multi resource type based weight mode in CS.
zhuqi created YARN-10524: Summary: Support multi resource type based weight mode in CS. Key: YARN-10524 URL: https://issues.apache.org/jira/browse/YARN-10524 Project: Hadoop YARN Issue Type: Sub-task Reporter: zhuqi Assignee: zhuqi -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10522) Document for Flexible Auto Queue Creation in Capacity Scheduler.
zhuqi created YARN-10522: Summary: Document for Flexible Auto Queue Creation in Capacity Scheduler. Key: YARN-10522 URL: https://issues.apache.org/jira/browse/YARN-10522 Project: Hadoop YARN Issue Type: Sub-task Reporter: zhuqi We should update document to support this feature. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10521) To support the mixed mode on different levels(optional) or disabled all.
zhuqi created YARN-10521: Summary: To support the mixed mode on different levels(optional) or disabled all. Key: YARN-10521 URL: https://issues.apache.org/jira/browse/YARN-10521 Project: Hadoop YARN Issue Type: Sub-task Reporter: zhuqi Assignee: zhuqi *Mixed* percentage / weight / absolute resource should *not* be allowed *at the same time* _at all hierarchical levels_. Restricting this to all levels of the hierarchy is not absolutely necessary, as theoretically it is possible to support the mixed mode on different levels, but whether it is worth it is up for debate. *Mixed* static and auto created under the parent will not be supported if the queues are defined by percentages. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9443) Fast RM Failover using Ratis (Raft protocol)
[ https://issues.apache.org/jira/browse/YARN-9443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244973#comment-17244973 ] zhuqi commented on YARN-9443: - [~prabhujoseph] [~leftnoteasy] It's a greate improvement, i am looking forward to the design. And i can help to finish some sub-tasks when i am free, and i wish it can apply to our production cluster with thousands of nodes, i will be very helpful. Thanks a lot. > Fast RM Failover using Ratis (Raft protocol) > > > Key: YARN-9443 > URL: https://issues.apache.org/jira/browse/YARN-9443 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Affects Versions: 3.2.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > During Failover, the RM Standby will have a lag as it has to recover from > Zookeeper / FileSystem StateStore. RM HA using Ratis (Raft Protocol) can > achieve Fast failover as all RMs are in sync already. This is used by Ozone - > HDDS-505. > > cc [~nandakumar131] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10514) Introduce a dominant resource based schedule policy to increase the resource utilization, avoid heavy cluster resource fragments.
[ https://issues.apache.org/jira/browse/YARN-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-10514: - Description: When we schedule in multi node lookup policy for async scheduling, or just use heartbeat update based scheduling, we both meet scheduling fragments. When cpu-intensive jobs or gpu-intensive or memory-intensive etc, the cluster will meet heavy waste of resources, so this issue will help to move scheduler support dominant resource based schedule, to help our cluster get better resource utilization, also in order to load balance nodemanager resource distribution. was:When we schedule in multi node lookup policy for async scheduling, or just use heartbeat update based scheduling, we both meet scheduling fragments. When cpu-intensive jobs or gpu-intensive or memory-intensive etc, the cluster will meet heavy waste of resources, so this issue will help to move scheduler support dominant resource based schedule, to help our cluster get better resource utilization. > Introduce a dominant resource based schedule policy to increase the resource > utilization, avoid heavy cluster resource fragments. > - > > Key: YARN-10514 > URL: https://issues.apache.org/jira/browse/YARN-10514 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.3.0, 3.4.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-10514.001.patch > > > When we schedule in multi node lookup policy for async scheduling, or just > use heartbeat update based scheduling, we both meet scheduling fragments. > When cpu-intensive jobs or gpu-intensive or memory-intensive etc, the cluster > will meet heavy waste of resources, so this issue will help to move scheduler > support dominant resource based schedule, to help our cluster get better > resource utilization, also in order to load balance nodemanager resource > distribution. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10514) Introduce a dominant resource based schedule policy to increase the resource utilization, avoid heavy cluster resource fragments.
[ https://issues.apache.org/jira/browse/YARN-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243071#comment-17243071 ] zhuqi edited comment on YARN-10514 at 12/3/20, 10:17 AM: - [~leftnoteasy] [~tangzhankun] [~prabhujoseph] [~sunil.gov...@gmail.com] [~jiwq] If you any advice about this proposal. I submit a draft patch to support async multi node scheduling mode in CS, but i think in heartbeat schedule mode in CS/FS, we also should to handle resource resource fragments, to help get better resource utilization. was (Author: zhuqi): [~leftnoteasy] [~tangzhankun] [~prabhujoseph] [~sunil.gov...@gmail.com] [~jiwq] If you any advice about this proposal. I submit a draft patch to support async multi node scheduling mode, but i think in heartbeat schedule mode we also should to handle resource resource fragments, to help get better resource utilization. > Introduce a dominant resource based schedule policy to increase the resource > utilization, avoid heavy cluster resource fragments. > - > > Key: YARN-10514 > URL: https://issues.apache.org/jira/browse/YARN-10514 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.3.0, 3.4.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-10514.001.patch > > > When we schedule in multi node lookup policy for async scheduling, or just > use heartbeat update based scheduling, we both meet scheduling fragments. > When cpu-intensive jobs or gpu-intensive or memory-intensive etc, the cluster > will meet heavy waste of resources, so this issue will help to move scheduler > support dominant resource based schedule, to help our cluster get better > resource utilization. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10514) Introduce a dominant resource based schedule policy to increase the resource utilization, avoid heavy cluster resource fragments.
[ https://issues.apache.org/jira/browse/YARN-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243071#comment-17243071 ] zhuqi commented on YARN-10514: -- [~leftnoteasy] [~tangzhankun] [~prabhujoseph] [~sunil.gov...@gmail.com] [~jiwq] If you any advice about this proposal. I submit a draft patch to support async multi node scheduling mode, but i think in heartbeat schedule mode we also should to handle resource resource fragments, to help get better resource utilization. > Introduce a dominant resource based schedule policy to increase the resource > utilization, avoid heavy cluster resource fragments. > - > > Key: YARN-10514 > URL: https://issues.apache.org/jira/browse/YARN-10514 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.3.0, 3.4.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-10514.001.patch > > > When we schedule in multi node lookup policy for async scheduling, or just > use heartbeat update based scheduling, we both meet scheduling fragments. > When cpu-intensive jobs or gpu-intensive or memory-intensive etc, the cluster > will meet heavy waste of resources, so this issue will help to move scheduler > support dominant resource based schedule, to help our cluster get better > resource utilization. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10514) Introduce a dominant resource based schedule policy to increase the resource utilization, avoid heavy cluster resource fragments.
zhuqi created YARN-10514: Summary: Introduce a dominant resource based schedule policy to increase the resource utilization, avoid heavy cluster resource fragments. Key: YARN-10514 URL: https://issues.apache.org/jira/browse/YARN-10514 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 3.3.0, 3.4.0 Reporter: zhuqi Assignee: zhuqi When we schedule in multi node lookup policy for async scheduling, or just use heartbeat update based scheduling, we both meet scheduling fragments. When cpu-intensive jobs or gpu-intensive or memory-intensive etc, the cluster will meet heavy waste of resources, so this issue will help to move scheduler support dominant resource based schedule, to help our cluster get better resource utilization. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10496) [Umbrella] Support Flexible Auto Queue Creation in Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-10496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242887#comment-17242887 ] zhuqi edited comment on YARN-10496 at 12/3/20, 3:53 AM: Thanks [~wangda] for putting this proposal. As an old user of FS, i think option #1 would be the way to go, i agree with [~epayne] said. We should discuss how to define max capacity in CS, in FS max is used by absolute resource in regular. If we can restrict the max capacity to two choices(or three): 1: Use absolute resources. 2: Use percentage of the immediate parent.(Such as percentage : weight * 1.5 etc) 3(optional): Use percentage of the cluster. This will help FS user to migrate, also CS users will be adapted to it. Thanks a lot. was (Author: zhuqi): Thanks [~wangda] for putting this proposal. As an old user of FS, i think option #1 would be the way to go, i agree with [~epayne] said. We should discuss how to define max capacity in CS, in FS max is used by absolute resource in regular. If we can restrict the max capacity to two choices(or three): 1: Use absolute resources. 2: Use percentage of the immediate parent. 3(optional): Use percentage of the cluster. This will help FS user to migrate, also CS users will be adapted to it. Thanks a lot. > [Umbrella] Support Flexible Auto Queue Creation in Capacity Scheduler > - > > Key: YARN-10496 > URL: https://issues.apache.org/jira/browse/YARN-10496 > Project: Hadoop YARN > Issue Type: New Feature > Components: capacity scheduler >Reporter: Wangda Tan >Priority: Major > > CapacityScheduler today doesn’t support an auto queue creation which is > flexible enough. The current constraints: > * Only leaf queues can be auto-created > * A parent can only have either static queues or dynamic ones. This causes > multiple constraints. For example: > * It isn’t possible to have a VIP user like Alice with a static queue > root.user.alice with 50% capacity while the other user queues (under > root.user) are created dynamically and they share the remaining 50% of > resources. > > * In comparison, FairScheduler allows the following scenarios, Capacity > Scheduler doesn’t: > ** This implies that there is no possibility to have both dynamically > created and static queues at the same time under root > * A new queue needs to be created under an existing parent, while the parent > already has static queues > * Nested queue mapping policy, like in the following example: > | > > | > * Here two levels of queues may need to be created > If an application belongs to user _alice_ (who has the primary_group of > _engineering_), the scheduler checks whether _root.engineering_ exists, if it > doesn’t, it’ll be created. Then scheduler checks whether > _root.engineering.alice_ exists, and creates it if it doesn't. > > When we try to move users from FairScheduler to CapacityScheduler, we face > feature gaps which blocks users migrate from FS to CS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10496) [Umbrella] Support Flexible Auto Queue Creation in Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-10496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242887#comment-17242887 ] zhuqi commented on YARN-10496: -- Thanks [~wangda] for putting this proposal. As an old user of FS, i think option #1 would be the way to go, i agree with [~epayne] said. We should discuss how to define max capacity in CS, in FS max is used by absolute resource in regular. If we can restrict the max capacity to two choices(or three): 1: Use absolute resources. 2: Use percentage of the immediate parent. 3(optional): Use percentage of the cluster. This will help FS user to migrate, also CS users will be adapted to it. Thanks a lot. > [Umbrella] Support Flexible Auto Queue Creation in Capacity Scheduler > - > > Key: YARN-10496 > URL: https://issues.apache.org/jira/browse/YARN-10496 > Project: Hadoop YARN > Issue Type: New Feature > Components: capacity scheduler >Reporter: Wangda Tan >Priority: Major > > CapacityScheduler today doesn’t support an auto queue creation which is > flexible enough. The current constraints: > * Only leaf queues can be auto-created > * A parent can only have either static queues or dynamic ones. This causes > multiple constraints. For example: > * It isn’t possible to have a VIP user like Alice with a static queue > root.user.alice with 50% capacity while the other user queues (under > root.user) are created dynamically and they share the remaining 50% of > resources. > > * In comparison, FairScheduler allows the following scenarios, Capacity > Scheduler doesn’t: > ** This implies that there is no possibility to have both dynamically > created and static queues at the same time under root > * A new queue needs to be created under an existing parent, while the parent > already has static queues > * Nested queue mapping policy, like in the following example: > | > > | > * Here two levels of queues may need to be created > If an application belongs to user _alice_ (who has the primary_group of > _engineering_), the scheduler checks whether _root.engineering_ exists, if it > doesn’t, it’ll be created. Then scheduler checks whether > _root.engineering.alice_ exists, and creates it if it doesn't. > > When we try to move users from FairScheduler to CapacityScheduler, we face > feature gaps which blocks users migrate from FS to CS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10169) Mixed absolute resource value and percentage-based resource value in CapacityScheduler should fail
[ https://issues.apache.org/jira/browse/YARN-10169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240742#comment-17240742 ] zhuqi edited comment on YARN-10169 at 12/2/20, 11:45 AM: - [~leftnoteasy] [~sunil.gov...@gmail.com] [~BilwaST] [~prabhujoseph] I have submitted a patch to fix it, if anyone can review it. Thanks. was (Author: zhuqi): [~leftnoteasy] [~sunil.gov...@gmail.com] [~BilwaST] [~prabhujoseph] @ I have submitted a patch to fix it, if anyone can review it. Thanks. > Mixed absolute resource value and percentage-based resource value in > CapacityScheduler should fail > -- > > Key: YARN-10169 > URL: https://issues.apache.org/jira/browse/YARN-10169 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: zhuqi >Priority: Blocker > Attachments: YARN-10169.001.patch, YARN-10169.002.patch, > YARN-10169.003.patch > > > To me this is a bug: if there's a queue has capacity set to float, and > maximum-capacity set to absolute value. Existing logic allows the behavior. > For example: > {code:java} > queue.capacity = 0.8 > queue.maximum-capacity = [mem=x, vcore=y] {code} > We should throw exception when configured like this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242179#comment-17242179 ] zhuqi edited comment on YARN-9618 at 12/2/20, 9:05 AM: --- [~bibinchundatt] [~leftnoteasy] This is a big improvement for nodemanager scalability. I submit a draft patch, to change event trigger to app itself. Avoid async dispatcher boom. If you any advice. Thanks. was (Author: zhuqi): [~bibinchundatt] [~leftnoteasy] This is a big improvement for nodemanager scalability. I submit a draft patch, to change event trigger to app itself. If you any advice. Thanks. > NodeListManager event improvement > - > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: zhuqi >Priority: Critical > Attachments: YARN-9618.001.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242179#comment-17242179 ] zhuqi commented on YARN-9618: - [~bibinchundatt] [~leftnoteasy] This is a big improvement for nodemanager scalability. I submit a draft patch, to change event trigger to app itself. If you any advice. Thanks. > NodeListManager event improvement > - > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: zhuqi >Priority: Critical > Attachments: YARN-9618.001.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242013#comment-17242013 ] zhuqi commented on YARN-10380: -- [~jiwq] Thanks a lot for your review, i have updated in latest PR. [~ztang] , if you any other advice for commit. Thanks. > Import logic of multi-node allocation in CapacityScheduler > -- > > Key: YARN-10380 > URL: https://issues.apache.org/jira/browse/YARN-10380 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0 >Reporter: Wangda Tan >Assignee: zhuqi >Priority: Critical > Labels: pull-request-available > Attachments: YARN-10380.001.patch, YARN-10380.002.patch > > Time Spent: 50m > Remaining Estimate: 0h > > *1) Entry point:* > When we do multi-node allocation, we're using the same logic of async > scheduling: > {code:java} > // Allocate containers of node [start, end) > for (FiCaSchedulerNode node : nodes) { > if (current++ >= start) { > if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) { > continue; > } > cs.allocateContainersToNode(node.getNodeID(), false); > } > } {code} > Is it the most effective way to do multi-node scheduling? Should we allocate > based on partitions? In above logic, if we have thousands of node in one > partition, we will repeatly access all nodes of the partition thousands of > times. > I would suggest looking at making entry-point for node-heartbeat, > async-scheduling (single node), and async-scheduling (multi-node) to be > different. > Node-heartbeat and async-scheduling (single node) can be still similar and > share most of the code. > async-scheduling (multi-node): should iterate partition first, using pseudo > code like: > {code:java} > for (partition : all partitions) { > allocateContainersOnMultiNodes(getCandidate(partition)) > } {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi reassigned YARN-9618: --- Assignee: zhuqi > NodeListManager event improvement > - > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: zhuqi >Priority: Critical > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10500) TestDelegationTokenRenewer fails intermittently
[ https://issues.apache.org/jira/browse/YARN-10500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241575#comment-17241575 ] zhuqi commented on YARN-10500: -- [~aajisaka] i meet it too, when submit a patch base the trunk. > TestDelegationTokenRenewer fails intermittently > --- > > Key: YARN-10500 > URL: https://issues.apache.org/jira/browse/YARN-10500 > Project: Hadoop YARN > Issue Type: Bug > Components: test >Reporter: Akira Ajisaka >Priority: Major > Labels: flaky-test > > TestDelegationTokenRenewer sometimes timeouts. > https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/334/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt > {noformat} > [INFO] Running > org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer > [ERROR] Tests run: 23, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 83.675 s <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer > [ERROR] > testTokenThreadTimeout(org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer) > Time elapsed: 30.065 s <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 3 > milliseconds > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:394) > at > org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testTokenThreadTimeout(TestDelegationTokenRenewer.java:1769) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:748) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241453#comment-17241453 ] zhuqi commented on YARN-10380: -- [~ztang], I have manually tested with async test, and also add a unit test in TestCapacitySchedulerAsyncScheduling class in latest PR. Thanks a lot. > Import logic of multi-node allocation in CapacityScheduler > -- > > Key: YARN-10380 > URL: https://issues.apache.org/jira/browse/YARN-10380 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0 >Reporter: Wangda Tan >Assignee: zhuqi >Priority: Critical > Attachments: YARN-10380.001.patch, YARN-10380.002.patch > > > *1) Entry point:* > When we do multi-node allocation, we're using the same logic of async > scheduling: > {code:java} > // Allocate containers of node [start, end) > for (FiCaSchedulerNode node : nodes) { > if (current++ >= start) { > if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) { > continue; > } > cs.allocateContainersToNode(node.getNodeID(), false); > } > } {code} > Is it the most effective way to do multi-node scheduling? Should we allocate > based on partitions? In above logic, if we have thousands of node in one > partition, we will repeatly access all nodes of the partition thousands of > times. > I would suggest looking at making entry-point for node-heartbeat, > async-scheduling (single node), and async-scheduling (multi-node) to be > different. > Node-heartbeat and async-scheduling (single node) can be still similar and > share most of the code. > async-scheduling (multi-node): should iterate partition first, using pseudo > code like: > {code:java} > for (partition : all partitions) { > allocateContainersOnMultiNodes(getCandidate(partition)) > } {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10169) Mixed absolute resource value and percentage-based resource value in CapacityScheduler should fail
[ https://issues.apache.org/jira/browse/YARN-10169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240742#comment-17240742 ] zhuqi edited comment on YARN-10169 at 12/1/20, 6:11 AM: [~leftnoteasy] [~sunil.gov...@gmail.com] [~BilwaST] [~prabhujoseph] @ I have submitted a patch to fix it, if anyone can review it. Thanks. was (Author: zhuqi): [~leftnoteasy] [~sunil.gov...@gmail.com] [~BilwaST] [~prabhujoseph] I have submitted a patch to fix it, if anyone can review it. Thanks. > Mixed absolute resource value and percentage-based resource value in > CapacityScheduler should fail > -- > > Key: YARN-10169 > URL: https://issues.apache.org/jira/browse/YARN-10169 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: zhuqi >Priority: Blocker > Attachments: YARN-10169.001.patch, YARN-10169.002.patch, > YARN-10169.003.patch > > > To me this is a bug: if there's a queue has capacity set to float, and > maximum-capacity set to absolute value. Existing logic allows the behavior. > For example: > {code:java} > queue.capacity = 0.8 > queue.maximum-capacity = [mem=x, vcore=y] {code} > We should throw exception when configured like this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241257#comment-17241257 ] zhuqi commented on YARN-10380: -- [~ztang] Thanks a lot for your patient review. I have fixed the check style in updated PR. [~wangda] , if we need add new unit tests here. > Import logic of multi-node allocation in CapacityScheduler > -- > > Key: YARN-10380 > URL: https://issues.apache.org/jira/browse/YARN-10380 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0 >Reporter: Wangda Tan >Assignee: zhuqi >Priority: Critical > Attachments: YARN-10380.001.patch, YARN-10380.002.patch > > > *1) Entry point:* > When we do multi-node allocation, we're using the same logic of async > scheduling: > {code:java} > // Allocate containers of node [start, end) > for (FiCaSchedulerNode node : nodes) { > if (current++ >= start) { > if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) { > continue; > } > cs.allocateContainersToNode(node.getNodeID(), false); > } > } {code} > Is it the most effective way to do multi-node scheduling? Should we allocate > based on partitions? In above logic, if we have thousands of node in one > partition, we will repeatly access all nodes of the partition thousands of > times. > I would suggest looking at making entry-point for node-heartbeat, > async-scheduling (single node), and async-scheduling (multi-node) to be > different. > Node-heartbeat and async-scheduling (single node) can be still similar and > share most of the code. > async-scheduling (multi-node): should iterate partition first, using pseudo > code like: > {code:java} > for (partition : all partitions) { > allocateContainersOnMultiNodes(getCandidate(partition)) > } {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10169) Mixed absolute resource value and percentage-based resource value in CapacityScheduler should fail
[ https://issues.apache.org/jira/browse/YARN-10169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240765#comment-17240765 ] zhuqi commented on YARN-10169: -- Fix testComplexValidateAbsoluteResourceConfig not trigger mixed case in patch 002. > Mixed absolute resource value and percentage-based resource value in > CapacityScheduler should fail > -- > > Key: YARN-10169 > URL: https://issues.apache.org/jira/browse/YARN-10169 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: zhuqi >Priority: Blocker > Attachments: YARN-10169.001.patch, YARN-10169.002.patch > > > To me this is a bug: if there's a queue has capacity set to float, and > maximum-capacity set to absolute value. Existing logic allows the behavior. > For example: > {code:java} > queue.capacity = 0.8 > queue.maximum-capacity = [mem=x, vcore=y] {code} > We should throw exception when configured like this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10169) Mixed absolute resource value and percentage-based resource value in CapacityScheduler should fail
[ https://issues.apache.org/jira/browse/YARN-10169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240742#comment-17240742 ] zhuqi commented on YARN-10169: -- [~leftnoteasy] [~sunil.gov...@gmail.com] [~BilwaST] [~prabhujoseph] I have submitted a patch to fix it, if anyone can review it. Thanks. > Mixed absolute resource value and percentage-based resource value in > CapacityScheduler should fail > -- > > Key: YARN-10169 > URL: https://issues.apache.org/jira/browse/YARN-10169 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: zhuqi >Priority: Blocker > Attachments: YARN-10169.001.patch > > > To me this is a bug: if there's a queue has capacity set to float, and > maximum-capacity set to absolute value. Existing logic allows the behavior. > For example: > {code:java} > queue.capacity = 0.8 > queue.maximum-capacity = [mem=x, vcore=y] {code} > We should throw exception when configured like this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10169) Mixed absolute resource value and percentage-based resource value in CapacityScheduler should fail
[ https://issues.apache.org/jira/browse/YARN-10169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi reassigned YARN-10169: Assignee: zhuqi (was: Tanu Ajmera) > Mixed absolute resource value and percentage-based resource value in > CapacityScheduler should fail > -- > > Key: YARN-10169 > URL: https://issues.apache.org/jira/browse/YARN-10169 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: zhuqi >Priority: Blocker > > To me this is a bug: if there's a queue has capacity set to float, and > maximum-capacity set to absolute value. Existing logic allows the behavior. > For example: > {code:java} > queue.capacity = 0.8 > queue.maximum-capacity = [mem=x, vcore=y] {code} > We should throw exception when configured like this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10503) Support queue capacity in terms of absolute resources with more resourceTypes.
[ https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239997#comment-17239997 ] zhuqi commented on YARN-10503: -- cc [~wangda] [~sunilg] [~tangzhankun] If you any advice about it. > Support queue capacity in terms of absolute resources with more resourceTypes. > -- > > Key: YARN-10503 > URL: https://issues.apache.org/jira/browse/YARN-10503 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Fix For: 3.3.1, 3.4.1 > > > Now the absolute resources are memory and cores. > {code:java} > /** > * Different resource types supported. > */ > public enum AbsoluteResourceType { > MEMORY, VCORES; > }{code} > But in our GPU production clusters, we need to support more resourceTypes. > It's very import for cluster scaling when with different resourceType > absolute demands. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10503) Support queue capacity in terms of absolute resources with more resourceTypes.
[ https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239997#comment-17239997 ] zhuqi edited comment on YARN-10503 at 11/28/20, 4:05 PM: - cc [~wangda] [~sunilg] [~tangzhankun] If you any advice about it. Thanks. was (Author: zhuqi): cc [~wangda] [~sunilg] [~tangzhankun] If you any advice about it. > Support queue capacity in terms of absolute resources with more resourceTypes. > -- > > Key: YARN-10503 > URL: https://issues.apache.org/jira/browse/YARN-10503 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Fix For: 3.3.1, 3.4.1 > > > Now the absolute resources are memory and cores. > {code:java} > /** > * Different resource types supported. > */ > public enum AbsoluteResourceType { > MEMORY, VCORES; > }{code} > But in our GPU production clusters, we need to support more resourceTypes. > It's very import for cluster scaling when with different resourceType > absolute demands. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10503) Support queue capacity in terms of absolute resources with more resourceTypes.
zhuqi created YARN-10503: Summary: Support queue capacity in terms of absolute resources with more resourceTypes. Key: YARN-10503 URL: https://issues.apache.org/jira/browse/YARN-10503 Project: Hadoop YARN Issue Type: Improvement Reporter: zhuqi Assignee: zhuqi Fix For: 3.3.1, 3.4.1 Now the absolute resources are memory and cores. {code:java} /** * Different resource types supported. */ public enum AbsoluteResourceType { MEMORY, VCORES; }{code} But in our GPU production clusters, we need to support more resourceTypes. It's very import for cluster scaling when with different resourceType absolute demands. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6224) Should consider utilization of each ResourceType on node while scheduling
[ https://issues.apache.org/jira/browse/YARN-6224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239571#comment-17239571 ] zhuqi edited comment on YARN-6224 at 11/27/20, 9:41 AM: cc [~wangda] [~bibinchundatt] [~sunilg] [~tangzhankun] It's a good policy to choose node based dominant node resource, when with more and more resource type in our production cluster, such as GPU/FPGA etc. # We should pass ResourceInfo which includes DominantInfo to multi node sort policy. # We should sort the Node according to the Dominant Resource which passed by. was (Author: zhuqi): cc [~wangda] [~bibinchundatt] [~sunilg] [~tangzhankun] It's a good policy to choose node based dominant node resource, when with more and more resource type in our production cluster, such as GPU/FPGA etc. > Should consider utilization of each ResourceType on node while scheduling > - > > Key: YARN-6224 > URL: https://issues.apache.org/jira/browse/YARN-6224 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Jie >Assignee: zhuqi >Priority: Major > > In situation like YARN-6101, if we consider all type of resource(vcore, > memory) utilization on node rather than just answer we can allocate or not, > we are more likely to have better resource utilization as a whole. > It is possible that we have a set of candidate nodes, then find the most > promising node to assign to one request considering node resource utilization > with global scheduling. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-6224) Should consider utilization of each ResourceType on node while scheduling
[ https://issues.apache.org/jira/browse/YARN-6224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi reassigned YARN-6224: --- Assignee: zhuqi > Should consider utilization of each ResourceType on node while scheduling > - > > Key: YARN-6224 > URL: https://issues.apache.org/jira/browse/YARN-6224 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Jie >Assignee: zhuqi >Priority: Major > > In situation like YARN-6101, if we consider all type of resource(vcore, > memory) utilization on node rather than just answer we can allocate or not, > we are more likely to have better resource utilization as a whole. > It is possible that we have a set of candidate nodes, then find the most > promising node to assign to one request considering node resource utilization > with global scheduling. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6224) Should consider utilization of each ResourceType on node while scheduling
[ https://issues.apache.org/jira/browse/YARN-6224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239571#comment-17239571 ] zhuqi edited comment on YARN-6224 at 11/27/20, 8:30 AM: cc [~wangda] [~bibinchundatt] [~sunilg] [~tangzhankun] It's a good policy to choose node based dominant node resource, when with more and more resource type in our production cluster, such as GPU/FPGA etc. was (Author: zhuqi): cc [~wangda] [~bibinchundatt] [~sunilg] It's a good policy to choose node based dominant node resource, when with more and more resource type in our production cluster, such as GPU/FPGA etc. > Should consider utilization of each ResourceType on node while scheduling > - > > Key: YARN-6224 > URL: https://issues.apache.org/jira/browse/YARN-6224 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Jie >Priority: Major > > In situation like YARN-6101, if we consider all type of resource(vcore, > memory) utilization on node rather than just answer we can allocate or not, > we are more likely to have better resource utilization as a whole. > It is possible that we have a set of candidate nodes, then find the most > promising node to assign to one request considering node resource utilization > with global scheduling. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6224) Should consider utilization of each ResourceType on node while scheduling
[ https://issues.apache.org/jira/browse/YARN-6224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239571#comment-17239571 ] zhuqi commented on YARN-6224: - cc [~wangda] [~bibinchundatt] [~sunilg] It's a good policy to choose node based dominant node resource, when with more and more resource type in our production cluster, such as GPU/FPGA etc. > Should consider utilization of each ResourceType on node while scheduling > - > > Key: YARN-6224 > URL: https://issues.apache.org/jira/browse/YARN-6224 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Jie >Priority: Major > > In situation like YARN-6101, if we consider all type of resource(vcore, > memory) utilization on node rather than just answer we can allocate or not, > we are more likely to have better resource utilization as a whole. > It is possible that we have a set of candidate nodes, then find the most > promising node to assign to one request considering node resource utilization > with global scheduling. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8557) Exclude lagged/unhealthy/decommissioned nodes in async allocating thread
[ https://issues.apache.org/jira/browse/YARN-8557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239526#comment-17239526 ] zhuqi commented on YARN-8557: - [~cheersyang] [~wangda] [~tangzhankun] [~sunilg] [~BilwaST] [~Tao Yang] I add the patch , if you can review for it: 1 Add configurable HB lag. 2 Support exclude unhealthy and decommissioned and decommissing nodes. Thanks. > Exclude lagged/unhealthy/decommissioned nodes in async allocating thread > > > Key: YARN-8557 > URL: https://issues.apache.org/jira/browse/YARN-8557 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.4.0 >Reporter: Weiwei Yang >Assignee: zhuqi >Priority: Major > Attachments: YARN-8557.001.patch > > > Currently only HB-lagged is handled, with hard-coded 2 times of HB lag which > we should make it configurable. And more over, we need to exclude unhealthy > and decommissioned nodes too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-8557) Exclude lagged/unhealthy/decommissioned nodes in async allocating thread
[ https://issues.apache.org/jira/browse/YARN-8557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi reassigned YARN-8557: --- Assignee: zhuqi > Exclude lagged/unhealthy/decommissioned nodes in async allocating thread > > > Key: YARN-8557 > URL: https://issues.apache.org/jira/browse/YARN-8557 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Weiwei Yang >Assignee: zhuqi >Priority: Major > > Currently only HB-lagged is handled, with hard-coded 2 times of HB lag which > we should make it configurable. And more over, we need to exclude unhealthy > and decommissioned nodes too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239499#comment-17239499 ] zhuqi commented on YARN-10380: -- It seems not trigger jenkins, add a pull request to trigger it. > Import logic of multi-node allocation in CapacityScheduler > -- > > Key: YARN-10380 > URL: https://issues.apache.org/jira/browse/YARN-10380 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0 >Reporter: Wangda Tan >Assignee: zhuqi >Priority: Critical > Attachments: YARN-10380.001.patch, YARN-10380.002.patch > > > *1) Entry point:* > When we do multi-node allocation, we're using the same logic of async > scheduling: > {code:java} > // Allocate containers of node [start, end) > for (FiCaSchedulerNode node : nodes) { > if (current++ >= start) { > if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) { > continue; > } > cs.allocateContainersToNode(node.getNodeID(), false); > } > } {code} > Is it the most effective way to do multi-node scheduling? Should we allocate > based on partitions? In above logic, if we have thousands of node in one > partition, we will repeatly access all nodes of the partition thousands of > times. > I would suggest looking at making entry-point for node-heartbeat, > async-scheduling (single node), and async-scheduling (multi-node) to be > different. > Node-heartbeat and async-scheduling (single node) can be still similar and > share most of the code. > async-scheduling (multi-node): should iterate partition first, using pseudo > code like: > {code:java} > for (partition : all partitions) { > allocateContainersOnMultiNodes(getCandidate(partition)) > } {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239076#comment-17239076 ] zhuqi edited comment on YARN-10380 at 11/27/20, 2:25 AM: - [~wangda] [~tangzhankun] [~sunilg] [~BilwaST] [~Tao Yang] I have attached the draft patch, if you can review it. Thanks. was (Author: zhuqi): [~wangda] [~sunilg] [~BilwaST] [~Tao Yang] I have attached the draft patch, if you can review it. Thanks. > Import logic of multi-node allocation in CapacityScheduler > -- > > Key: YARN-10380 > URL: https://issues.apache.org/jira/browse/YARN-10380 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0 >Reporter: Wangda Tan >Assignee: zhuqi >Priority: Critical > Attachments: YARN-10380.001.patch, YARN-10380.002.patch > > > *1) Entry point:* > When we do multi-node allocation, we're using the same logic of async > scheduling: > {code:java} > // Allocate containers of node [start, end) > for (FiCaSchedulerNode node : nodes) { > if (current++ >= start) { > if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) { > continue; > } > cs.allocateContainersToNode(node.getNodeID(), false); > } > } {code} > Is it the most effective way to do multi-node scheduling? Should we allocate > based on partitions? In above logic, if we have thousands of node in one > partition, we will repeatly access all nodes of the partition thousands of > times. > I would suggest looking at making entry-point for node-heartbeat, > async-scheduling (single node), and async-scheduling (multi-node) to be > different. > Node-heartbeat and async-scheduling (single node) can be still similar and > share most of the code. > async-scheduling (multi-node): should iterate partition first, using pseudo > code like: > {code:java} > for (partition : all partitions) { > allocateContainersOnMultiNodes(getCandidate(partition)) > } {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9449) Non-exclusive labels can create reservation loop on cluster without unlabeled node
[ https://issues.apache.org/jira/browse/YARN-9449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239269#comment-17239269 ] zhuqi commented on YARN-9449: - [~aditya9277] If you apply the patch. https://issues.apache.org/jira/browse/YARN-9903 It may be helpful. > Non-exclusive labels can create reservation loop on cluster without unlabeled > node > -- > > Key: YARN-9449 > URL: https://issues.apache.org/jira/browse/YARN-9449 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.5 >Reporter: Brandon Scheller >Priority: Major > > https://issues.apache.org/jira/browse/YARN-5342 Added a counter to Yarn so > that unscheduled resource requests were attempted to be scheduled on > unlabeled nodes first. > This counter is reset only when an attempt to schedule happens on an > unlabeled node. > On hadoop clusters with only labeled nodes, this counter can never be reset > and therefore it will block skipping that node. > Because the node will not be skipped, it creates the loop shown below in the > Yarn RM logs. > This can block scheduling of a spark executor for example and cause the spark > application to get stuck. > > {{_2019-02-18 23:54:22,591 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl > (ResourceManager Event Processor): container_1550533628872_0003_01_23 > Container Transitioned from NEW to RESERVED 2019-02-18 23:54:22,591 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator > (ResourceManager Event Processor): Reserved container > application=application_1550533628872_0003 resource= > queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@6ffe0dc3 > cluster= 2019-02-18 23:54:22,592 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue > (ResourceManager Event Processor): assignedContainer queue=root > usedCapacity=0.0 absoluteUsedCapacity=0.0 used= > cluster= 2019-02-18 23:54:23,592 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler > (ResourceManager Event Processor): Trying to fulfill reservation for > application application_1550533628872_0003 on node: > ip-10-0-0-122.ec2.internal:8041 2019-02-18 23:54:23,592 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp > (ResourceManager Event Processor): Application > application_1550533628872_0003 unreserved on node host: > ip-10-0-0-122.ec2.internal:8041 #containers=1 available= vCores:7> used=, currently has 0 at priority 1; > currentReservation on node-label=LABELED 2019-02-18 > 23:54:23,593 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl > (ResourceManager Event Processor): container_1550533628872_0003_01_24 > Container Transitioned from NEW to RESERVED 2019-02-18 23:54:23,593 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator > (ResourceManager Event Processor): Reserved container > application=application_1550533628872_0003 resource= > queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@6ffe0dc3 > cluster= 2019-02-18 23:54:23,593 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue > (ResourceManager Event Processor): assignedContainer queue=root > usedCapacity=0.0 absoluteUsedCapacity=0.0 used= > cluster= 2019-02-18 23:54:24,593 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler > (ResourceManager Event Processor): Trying to fulfill reservation for > application application_1550533628872_0003 on node: > ip-10-0-0-122.ec2.internal:8041 2019-02-18 23:54:24,593 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp > (ResourceManager Event Processor): Application > application_1550533628872_0003 unreserved on node host: > ip-10-0-0-122.ec2.internal:8041 #containers=1 available= vCores:7> used=, currently has 0 at priority 1; > currentReservation on node-label=LABELED 2019-02-18 > 23:54:24,594 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl > (ResourceManager Event Processor): container_1550533628872_0003_01_25 > Container Transitioned from NEW to RESERVED 2019-02-18 23:54:24,594 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator > (ResourceManager Event Processor): Reserved container > application=application_1550533628872_0003 resource= > queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@6ffe0dc3 >
[jira] [Comment Edited] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239076#comment-17239076 ] zhuqi edited comment on YARN-10380 at 11/26/20, 10:55 AM: -- [~wangda] [~sunilg] [~BilwaST] [~Tao Yang] I have attached the draft patch, if you can review it. Thanks. was (Author: zhuqi): [~wangda] [~sunilg] [~BilwaST] I have attached the draft patch, if you any advice. Thanks. > Import logic of multi-node allocation in CapacityScheduler > -- > > Key: YARN-10380 > URL: https://issues.apache.org/jira/browse/YARN-10380 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0, 3.4.0 >Reporter: Wangda Tan >Assignee: zhuqi >Priority: Critical > Attachments: YARN-10380.001.patch > > > *1) Entry point:* > When we do multi-node allocation, we're using the same logic of async > scheduling: > {code:java} > // Allocate containers of node [start, end) > for (FiCaSchedulerNode node : nodes) { > if (current++ >= start) { > if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) { > continue; > } > cs.allocateContainersToNode(node.getNodeID(), false); > } > } {code} > Is it the most effective way to do multi-node scheduling? Should we allocate > based on partitions? In above logic, if we have thousands of node in one > partition, we will repeatly access all nodes of the partition thousands of > times. > I would suggest looking at making entry-point for node-heartbeat, > async-scheduling (single node), and async-scheduling (multi-node) to be > different. > Node-heartbeat and async-scheduling (single node) can be still similar and > share most of the code. > async-scheduling (multi-node): should iterate partition first, using pseudo > code like: > {code:java} > for (partition : all partitions) { > allocateContainersOnMultiNodes(getCandidate(partition)) > } {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239076#comment-17239076 ] zhuqi commented on YARN-10380: -- [~wangda] [~sunilg] [~BilwaST] I have attached the draft patch, if you any advice. Thanks. > Import logic of multi-node allocation in CapacityScheduler > -- > > Key: YARN-10380 > URL: https://issues.apache.org/jira/browse/YARN-10380 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0, 3.4.0 >Reporter: Wangda Tan >Assignee: zhuqi >Priority: Critical > Attachments: YARN-10380.001.patch > > > *1) Entry point:* > When we do multi-node allocation, we're using the same logic of async > scheduling: > {code:java} > // Allocate containers of node [start, end) > for (FiCaSchedulerNode node : nodes) { > if (current++ >= start) { > if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) { > continue; > } > cs.allocateContainersToNode(node.getNodeID(), false); > } > } {code} > Is it the most effective way to do multi-node scheduling? Should we allocate > based on partitions? In above logic, if we have thousands of node in one > partition, we will repeatly access all nodes of the partition thousands of > times. > I would suggest looking at making entry-point for node-heartbeat, > async-scheduling (single node), and async-scheduling (multi-node) to be > different. > Node-heartbeat and async-scheduling (single node) can be still similar and > share most of the code. > async-scheduling (multi-node): should iterate partition first, using pseudo > code like: > {code:java} > for (partition : all partitions) { > allocateContainersOnMultiNodes(getCandidate(partition)) > } {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi reassigned YARN-10380: Assignee: zhuqi > Import logic of multi-node allocation in CapacityScheduler > -- > > Key: YARN-10380 > URL: https://issues.apache.org/jira/browse/YARN-10380 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: zhuqi >Priority: Critical > Attachments: YARN-10380.001.patch > > > *1) Entry point:* > When we do multi-node allocation, we're using the same logic of async > scheduling: > {code:java} > // Allocate containers of node [start, end) > for (FiCaSchedulerNode node : nodes) { > if (current++ >= start) { > if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) { > continue; > } > cs.allocateContainersToNode(node.getNodeID(), false); > } > } {code} > Is it the most effective way to do multi-node scheduling? Should we allocate > based on partitions? In above logic, if we have thousands of node in one > partition, we will repeatly access all nodes of the partition thousands of > times. > I would suggest looking at making entry-point for node-heartbeat, > async-scheduling (single node), and async-scheduling (multi-node) to be > different. > Node-heartbeat and async-scheduling (single node) can be still similar and > share most of the code. > async-scheduling (multi-node): should iterate partition first, using pseudo > code like: > {code:java} > for (partition : all partitions) { > allocateContainersOnMultiNodes(getCandidate(partition)) > } {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10132) For Federation,yarn applicationattempt fail command throws an exception
[ https://issues.apache.org/jira/browse/YARN-10132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236015#comment-17236015 ] zhuqi edited comment on YARN-10132 at 11/23/20, 9:55 AM: - cc [~BilwaST] [~subru] : I have attached a patch for this, if you can review it for the merge. Thanks. was (Author: zhuqi): cc [~BilwaST] : I have attached a patch for this, if you can review it for the merge. Thanks. > For Federation,yarn applicationattempt fail command throws an exception > --- > > Key: YARN-10132 > URL: https://issues.apache.org/jira/browse/YARN-10132 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Sushanta Sen >Assignee: zhuqi >Priority: Major > Attachments: YARN-10132.001.patch, YARN-10132.002.patch, > YARN-10132.003.patch, YARN-10132.004.patch > > > yarn applicationattempt fail command is failing with exception > “org.apache.commons.lang.NotImplementedException: Code is not implemented”. > {noformat} > ./yarn applicationattempt -fail appattempt_1581497870689_0001_01 > Failing attempt appattempt_1581497870689_0001_01 of application > application_1581497870689_0001 > 2020-02-12 20:48:48,530 INFO impl.YarnClientImpl: Failing application attempt > appattempt_1581497870689_0001_01 > Exception in thread "main" org.apache.commons.lang.NotImplementedException: > Code is not implemented > at > org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor.failApplicationAttempt(FederationClientInterceptor.java:980) > at > org.apache.hadoop.yarn.server.router.clientrm.RouterClientRMService.failApplicationAttempt(RouterClientRMService.java:388) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.failApplicationAttempt(ApplicationClientProtocolPBServiceImpl.java:210) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:581) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2793) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.failApplicationAttempt(ApplicationClientProtocolPBClientImpl.java:223) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy8.failApplicationAttempt(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.failApplicationAttempt(YarnClientImpl.java:447) > at > org.apache.hadoop.yarn.client.cli.ApplicationCLI.failApplicationAttempt(ApplicationCLI.java:985) > at > org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:455) > at
[jira] [Comment Edited] (YARN-10111) In Federation cluster Distributed Shell Application submission fails as YarnClient#getQueueInfo is not implemented
[ https://issues.apache.org/jira/browse/YARN-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17237256#comment-17237256 ] zhuqi edited comment on YARN-10111 at 11/23/20, 9:54 AM: - cc [~BilwaST] [~ayushtkn] [~tangzhankun] Now i add the draft patch with the consistent queue name across all sub clusters. We should discuss how to get the capacity related values. Because all sub clusters have their own total resources , and the get queue info api just can get single RM related value, but the capacity related values should get all sub clusters resources info first. If you any advice? Thanks. was (Author: zhuqi): cc [~BilwaST] Now i add the draft patch with the consistent queue name across all sub clusters. We should discuss how to get the capacity related values. Because all sub clusters have their own total resources , and the get queue info api just can get single RM related value, but the capacity related values should get all sub clusters resources info first. If you any advice? Thanks. > In Federation cluster Distributed Shell Application submission fails as > YarnClient#getQueueInfo is not implemented > -- > > Key: YARN-10111 > URL: https://issues.apache.org/jira/browse/YARN-10111 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Sushanta Sen >Assignee: zhuqi >Priority: Blocker > Attachments: YARN-10111.001.patch > > > In Federation cluster Distributed Shell Application submission fails as > YarnClient#getQueueInfo is not implemented. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10132) For Federation,yarn applicationattempt fail command throws an exception
[ https://issues.apache.org/jira/browse/YARN-10132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236015#comment-17236015 ] zhuqi edited comment on YARN-10132 at 11/23/20, 9:53 AM: - cc [~BilwaST] : I have attached a patch for this, if you can review it for the merge. Thanks. was (Author: zhuqi): cc [~bilwa_st] : I have attached a patch for this, if you can review it for the merge. Thanks. > For Federation,yarn applicationattempt fail command throws an exception > --- > > Key: YARN-10132 > URL: https://issues.apache.org/jira/browse/YARN-10132 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Sushanta Sen >Assignee: zhuqi >Priority: Major > Attachments: YARN-10132.001.patch, YARN-10132.002.patch, > YARN-10132.003.patch, YARN-10132.004.patch > > > yarn applicationattempt fail command is failing with exception > “org.apache.commons.lang.NotImplementedException: Code is not implemented”. > {noformat} > ./yarn applicationattempt -fail appattempt_1581497870689_0001_01 > Failing attempt appattempt_1581497870689_0001_01 of application > application_1581497870689_0001 > 2020-02-12 20:48:48,530 INFO impl.YarnClientImpl: Failing application attempt > appattempt_1581497870689_0001_01 > Exception in thread "main" org.apache.commons.lang.NotImplementedException: > Code is not implemented > at > org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor.failApplicationAttempt(FederationClientInterceptor.java:980) > at > org.apache.hadoop.yarn.server.router.clientrm.RouterClientRMService.failApplicationAttempt(RouterClientRMService.java:388) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.failApplicationAttempt(ApplicationClientProtocolPBServiceImpl.java:210) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:581) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2793) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.failApplicationAttempt(ApplicationClientProtocolPBClientImpl.java:223) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy8.failApplicationAttempt(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.failApplicationAttempt(YarnClientImpl.java:447) > at > org.apache.hadoop.yarn.client.cli.ApplicationCLI.failApplicationAttempt(ApplicationCLI.java:985) > at > org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:455) > at
[jira] [Comment Edited] (YARN-10111) In Federation cluster Distributed Shell Application submission fails as YarnClient#getQueueInfo is not implemented
[ https://issues.apache.org/jira/browse/YARN-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17237256#comment-17237256 ] zhuqi edited comment on YARN-10111 at 11/23/20, 9:52 AM: - cc [~BilwaST] Now i add the draft patch with the consistent queue name across all sub clusters. We should discuss how to get the capacity related values. Because all sub clusters have their own total resources , and the get queue info api just can get single RM related value, but the capacity related values should get all sub clusters resources info first. If you any advice? Thanks. was (Author: zhuqi): cc [~bilwa_st] Now i add the draft patch with the consistent queue name across all sub clusters. We should discuss how to get the capacity related values. Because all sub clusters have their own total resources , and the get queue info api just can get single RM related value, but the capacity related values should get all sub clusters resources info first. If you any advice? Thanks. > In Federation cluster Distributed Shell Application submission fails as > YarnClient#getQueueInfo is not implemented > -- > > Key: YARN-10111 > URL: https://issues.apache.org/jira/browse/YARN-10111 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Sushanta Sen >Assignee: zhuqi >Priority: Blocker > Attachments: YARN-10111.001.patch > > > In Federation cluster Distributed Shell Application submission fails as > YarnClient#getQueueInfo is not implemented. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10111) In Federation cluster Distributed Shell Application submission fails as YarnClient#getQueueInfo is not implemented
[ https://issues.apache.org/jira/browse/YARN-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17237256#comment-17237256 ] zhuqi commented on YARN-10111: -- cc [~bilwa_st] Now i add the draft patch with the consistent queue name across all sub clusters. We should discuss how to get the capacity related values. Because all sub clusters have their own total resources , and the get queue info api just can get single RM related value, but the capacity related values should get all sub clusters resources info first. If you any advice? Thanks. > In Federation cluster Distributed Shell Application submission fails as > YarnClient#getQueueInfo is not implemented > -- > > Key: YARN-10111 > URL: https://issues.apache.org/jira/browse/YARN-10111 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Sushanta Sen >Assignee: zhuqi >Priority: Blocker > Attachments: YARN-10111.001.patch > > > In Federation cluster Distributed Shell Application submission fails as > YarnClient#getQueueInfo is not implemented. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10111) In Federation cluster Distributed Shell Application submission fails as YarnClient#getQueueInfo is not implemented
[ https://issues.apache.org/jira/browse/YARN-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-10111: - Attachment: (was: YARN-10111.001.patch) > In Federation cluster Distributed Shell Application submission fails as > YarnClient#getQueueInfo is not implemented > -- > > Key: YARN-10111 > URL: https://issues.apache.org/jira/browse/YARN-10111 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Sushanta Sen >Assignee: zhuqi >Priority: Blocker > > In Federation cluster Distributed Shell Application submission fails as > YarnClient#getQueueInfo is not implemented. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10111) In Federation cluster Distributed Shell Application submission fails as YarnClient#getQueueInfo is not implemented
[ https://issues.apache.org/jira/browse/YARN-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-10111: - Attachment: YARN-10111.001.patch > In Federation cluster Distributed Shell Application submission fails as > YarnClient#getQueueInfo is not implemented > -- > > Key: YARN-10111 > URL: https://issues.apache.org/jira/browse/YARN-10111 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Sushanta Sen >Assignee: zhuqi >Priority: Blocker > > In Federation cluster Distributed Shell Application submission fails as > YarnClient#getQueueInfo is not implemented. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10111) In Federation cluster Distributed Shell Application submission fails as YarnClient#getQueueInfo is not implemented
[ https://issues.apache.org/jira/browse/YARN-10111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi reassigned YARN-10111: Assignee: zhuqi > In Federation cluster Distributed Shell Application submission fails as > YarnClient#getQueueInfo is not implemented > -- > > Key: YARN-10111 > URL: https://issues.apache.org/jira/browse/YARN-10111 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Sushanta Sen >Assignee: zhuqi >Priority: Blocker > > In Federation cluster Distributed Shell Application submission fails as > YarnClient#getQueueInfo is not implemented. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10311) Yarn Service should support obtaining tokens from multiple name services
[ https://issues.apache.org/jira/browse/YARN-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236588#comment-17236588 ] zhuqi commented on YARN-10311: -- [~BilwaST] It is a value improvement, is it going on for merge. > Yarn Service should support obtaining tokens from multiple name services > > > Key: YARN-10311 > URL: https://issues.apache.org/jira/browse/YARN-10311 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-10311.001.patch, YARN-10311.002.patch > > > Currently yarn services support single name service tokens. We can add a new > conf called > "yarn.service.hdfs-servers" for supporting this -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10132) For Federation,yarn applicationattempt fail command throws an exception
[ https://issues.apache.org/jira/browse/YARN-10132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236015#comment-17236015 ] zhuqi commented on YARN-10132: -- cc [~bilwa_st] : I have attached a patch for this, if you can review it for the merge. Thanks. > For Federation,yarn applicationattempt fail command throws an exception > --- > > Key: YARN-10132 > URL: https://issues.apache.org/jira/browse/YARN-10132 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Sushanta Sen >Assignee: zhuqi >Priority: Major > Attachments: YARN-10132.001.patch, YARN-10132.002.patch > > > yarn applicationattempt fail command is failing with exception > “org.apache.commons.lang.NotImplementedException: Code is not implemented”. > {noformat} > ./yarn applicationattempt -fail appattempt_1581497870689_0001_01 > Failing attempt appattempt_1581497870689_0001_01 of application > application_1581497870689_0001 > 2020-02-12 20:48:48,530 INFO impl.YarnClientImpl: Failing application attempt > appattempt_1581497870689_0001_01 > Exception in thread "main" org.apache.commons.lang.NotImplementedException: > Code is not implemented > at > org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor.failApplicationAttempt(FederationClientInterceptor.java:980) > at > org.apache.hadoop.yarn.server.router.clientrm.RouterClientRMService.failApplicationAttempt(RouterClientRMService.java:388) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.failApplicationAttempt(ApplicationClientProtocolPBServiceImpl.java:210) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:581) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2793) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.failApplicationAttempt(ApplicationClientProtocolPBClientImpl.java:223) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy8.failApplicationAttempt(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.failApplicationAttempt(YarnClientImpl.java:447) > at > org.apache.hadoop.yarn.client.cli.ApplicationCLI.failApplicationAttempt(ApplicationCLI.java:985) > at > org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:455) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at > org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:119) > {noformat} --
[jira] [Assigned] (YARN-10132) For Federation,yarn applicationattempt fail command throws an exception
[ https://issues.apache.org/jira/browse/YARN-10132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi reassigned YARN-10132: Assignee: zhuqi > For Federation,yarn applicationattempt fail command throws an exception > --- > > Key: YARN-10132 > URL: https://issues.apache.org/jira/browse/YARN-10132 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Sushanta Sen >Assignee: zhuqi >Priority: Major > > yarn applicationattempt fail command is failing with exception > “org.apache.commons.lang.NotImplementedException: Code is not implemented”. > {noformat} > ./yarn applicationattempt -fail appattempt_1581497870689_0001_01 > Failing attempt appattempt_1581497870689_0001_01 of application > application_1581497870689_0001 > 2020-02-12 20:48:48,530 INFO impl.YarnClientImpl: Failing application attempt > appattempt_1581497870689_0001_01 > Exception in thread "main" org.apache.commons.lang.NotImplementedException: > Code is not implemented > at > org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor.failApplicationAttempt(FederationClientInterceptor.java:980) > at > org.apache.hadoop.yarn.server.router.clientrm.RouterClientRMService.failApplicationAttempt(RouterClientRMService.java:388) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.failApplicationAttempt(ApplicationClientProtocolPBServiceImpl.java:210) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:581) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:863) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2793) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateRuntimeException(RPCUtil.java:85) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:122) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.failApplicationAttempt(ApplicationClientProtocolPBClientImpl.java:223) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy8.failApplicationAttempt(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.failApplicationAttempt(YarnClientImpl.java:447) > at > org.apache.hadoop.yarn.client.cli.ApplicationCLI.failApplicationAttempt(ApplicationCLI.java:985) > at > org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:455) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at > org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:119) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail:
[jira] [Comment Edited] (YARN-10463) For Federation, we should support getApplicationAttemptReport.
[ https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227197#comment-17227197 ] zhuqi edited comment on YARN-10463 at 11/6/20, 7:18 AM: Hi [~BilwaST] Rebase and fixed all checkstyle in latest patch. Thanks for your review. was (Author: zhuqi): Hi [~BilwaST] Fixed all checkstyle in latest patch. Thanks. > For Federation, we should support getApplicationAttemptReport. > -- > > Key: YARN-10463 > URL: https://issues.apache.org/jira/browse/YARN-10463 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-10463.001.patch, YARN-10463.002.patch, > YARN-10463.003.patch, YARN-10463.004.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10463) For Federation, we should support getApplicationAttemptReport.
[ https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227197#comment-17227197 ] zhuqi commented on YARN-10463: -- Hi [~BilwaST] Fixed all checkstyle in latest patch. Thanks. > For Federation, we should support getApplicationAttemptReport. > -- > > Key: YARN-10463 > URL: https://issues.apache.org/jira/browse/YARN-10463 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-10463.001.patch, YARN-10463.002.patch, > YARN-10463.003.patch, YARN-10463.004.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10463) For Federation, we should support getApplicationAttemptReport.
[ https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227129#comment-17227129 ] zhuqi commented on YARN-10463: -- Hi [~BilwaST] I have fixed the above problem, and the getApplicationAttemptReportLatency not inited bug in the new patch. Thanks. > For Federation, we should support getApplicationAttemptReport. > -- > > Key: YARN-10463 > URL: https://issues.apache.org/jira/browse/YARN-10463 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-10463.001.patch, YARN-10463.002.patch, > YARN-10463.003.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10470) When building new web ui with root user, the bower install should support it.
zhuqi created YARN-10470: Summary: When building new web ui with root user, the bower install should support it. Key: YARN-10470 URL: https://issues.apache.org/jira/browse/YARN-10470 Project: Hadoop YARN Issue Type: Bug Components: webapp Reporter: zhuqi Assignee: zhuqi Attachments: image-2020-10-22-22-39-48-709.png !image-2020-10-22-22-39-48-709.png|width=1120,height=342! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10463) For Federation, we should support getApplicationAttemptReport.
[ https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218198#comment-17218198 ] zhuqi edited comment on YARN-10463 at 10/22/20, 5:50 AM: - Hi [~BilwaST] I have fixed your comments in the new patch, but how can we fill the appAttemptId to the RMApp in RMClientService. The test will failed in RMClientService: {code:java} // code placeholder RMAppAttempt appAttempt = application.getAppAttempts().get(appAttemptId); if (appAttempt == null) { throw new ApplicationAttemptNotFoundException( "ApplicationAttempt with id '" + appAttemptId + "' doesn't exist in RM."); } {code} Thanks. was (Author: zhuqi): Hi [~BilwaST] I have fixed your comments in the new patch, but how can we fill the appAttemptId to the RMClientService. The test will failed in RMClientService: {code:java} // code placeholder RMAppAttempt appAttempt = application.getAppAttempts().get(appAttemptId); if (appAttempt == null) { throw new ApplicationAttemptNotFoundException( "ApplicationAttempt with id '" + appAttemptId + "' doesn't exist in RM."); } {code} Thanks. > For Federation, we should support getApplicationAttemptReport. > -- > > Key: YARN-10463 > URL: https://issues.apache.org/jira/browse/YARN-10463 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-10463.001.patch, YARN-10463.002.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10463) For Federation, we should support getApplicationAttemptReport.
[ https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218198#comment-17218198 ] zhuqi commented on YARN-10463: -- Hi [~BilwaST] I have fixed your comments in the new patch, but how can we fill the appAttemptId to the RMClientService. The test will failed in RMClientService: {code:java} // code placeholder RMAppAttempt appAttempt = application.getAppAttempts().get(appAttemptId); if (appAttempt == null) { throw new ApplicationAttemptNotFoundException( "ApplicationAttempt with id '" + appAttemptId + "' doesn't exist in RM."); } {code} Thanks. > For Federation, we should support getApplicationAttemptReport. > -- > > Key: YARN-10463 > URL: https://issues.apache.org/jira/browse/YARN-10463 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-10463.001.patch, YARN-10463.002.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10144) Federation: Add missing FederationClientInterceptor APIs
[ https://issues.apache.org/jira/browse/YARN-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi reassigned YARN-10144: Assignee: zhuqi > Federation: Add missing FederationClientInterceptor APIs > > > Key: YARN-10144 > URL: https://issues.apache.org/jira/browse/YARN-10144 > Project: Hadoop YARN > Issue Type: Improvement > Components: federation >Reporter: D M Murali Krishna Reddy >Assignee: zhuqi >Priority: Major > > In FederationClientInterceptor, many API's are not Implemented. > * getClusterNodes > * getQueueInfo > * getQueueUserAcls > * moveApplicationAcrossQueues > * getNewReservation > * submitReservation > * listReservations > * updateReservation > * deleteReservation > * getNodeToLabels > * getLabelsToNodes > * getClusterNodeLabels > * getApplicationAttemptReport > * getApplicationAttempts > * getContainerReport > * getContainers > * getDelegationToken > * renewDelegationToken > * cancelDelegationToken > * failApplicationAttempt > * updateApplicationPriority > * signalToContainer > * updateApplicationTimeouts > * getResourceProfiles > * getResourceProfile > * getResourceTypeInfo > * getAttributesToNodes > * getClusterNodeAttributes > * getNodesToAttributes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10463) For Federation, we should support getApplicationAttemptReport.
[ https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215259#comment-17215259 ] zhuqi commented on YARN-10463: -- Hi [~BilwaST] Thanks for your quickly reply and review. I understand it now, but another question: When we call the getQueueInfo of one queue absolute path, and there are more than two sub clusters which have the same absolute path, if it mean we will return a map with subcluster ID as a key and queue info as values? > For Federation, we should support getApplicationAttemptReport. > -- > > Key: YARN-10463 > URL: https://issues.apache.org/jira/browse/YARN-10463 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-10463.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10463) For Federation, we should support getApplicationAttemptReport.
[ https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215245#comment-17215245 ] zhuqi commented on YARN-10463: -- cc [~adam.antal] [~Tao Yang] [~dmmkr] [~BilwaST] I add the getApplicationAttemptReport for Federation. And i can take the remaining missing FederationClientInterceptor APIs, but i have one question: If we pass the subcluster Id to list some subcluster related result, or we list all the subcluster related result to client? For example: Such as getClusterNodes , if we list all nodes of all subclusters one by one , or just pass a subcluster id which we want to query? Thanks. > For Federation, we should support getApplicationAttemptReport. > -- > > Key: YARN-10463 > URL: https://issues.apache.org/jira/browse/YARN-10463 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-10463.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10463) For Federation, we should support getApplicationAttemptReport.
[ https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-10463: - Summary: For Federation, we should support getApplicationAttemptReport. (was: For Federation, we should support getApplicationAttemptReport。) > For Federation, we should support getApplicationAttemptReport. > -- > > Key: YARN-10463 > URL: https://issues.apache.org/jira/browse/YARN-10463 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-10463.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10463) For Federation, we should support getApplicationAttemptReport。
[ https://issues.apache.org/jira/browse/YARN-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-10463: - Parent: YARN-10144 Issue Type: Sub-task (was: Task) > For Federation, we should support getApplicationAttemptReport。 > -- > > Key: YARN-10463 > URL: https://issues.apache.org/jira/browse/YARN-10463 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10463) For Federation, we should support getApplicationAttemptReport。
zhuqi created YARN-10463: Summary: For Federation, we should support getApplicationAttemptReport。 Key: YARN-10463 URL: https://issues.apache.org/jira/browse/YARN-10463 Project: Hadoop YARN Issue Type: Task Reporter: zhuqi Assignee: zhuqi -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10448) SLS should set default user to handle SYNTH format
[ https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212437#comment-17212437 ] zhuqi commented on YARN-10448: -- CC [~adam.antal] Thanks for your patient review and commit. The unit tests failures are not related to it, and i have fixed the checkstyle warning in the new patch. > SLS should set default user to handle SYNTH format > -- > > Key: YARN-10448 > URL: https://issues.apache.org/jira/browse/YARN-10448 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 3.2.1, 3.4.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-10448.001.patch, YARN-10448.002.patch, > YARN-10448.003.patch, YARN-10448.004.patch, > image-2020-10-11-22-01-37-227.png, image-2020-10-11-22-02-17-166.png > > > When using the synthetic generator json file example from the doc ( > https://hadoop.apache.org/docs/current/hadoop-sls/SchedulerLoadSimulator.html#SYNTH_JSON_input_file_format > ), it throws the following exception: > {noformat} > java.lang.IllegalArgumentException: Null user > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269) > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256) > at > org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191) > at > org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161) > at > org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > {noformat} > So the solution is either: > 1) to make {{user_name}} a mandatory field, or > 2) to set default user in SLS code if the json file does not define it. > IMO, solution 2 might be better, because in most cases (if not all) > {{user_name}} has no impact on scheduler performance, thus it is reasonable > to make it an optional field, which is also consistent with the {{job.user}} > field in SLS JSON file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10448) SLS should set default user to handle SYNTH format
[ https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211922#comment-17211922 ] zhuqi edited comment on YARN-10448 at 10/11/20, 2:03 PM: - CC [~adam.antal] I have moved the "default" String to a private static final constant. The unit test works well with the "syn_generic.json" file in the resources folder , when we removed the "user_name": "foobar". The following picture passed the sample test. Thanks. was (Author: zhuqi): CC [~adam.antal] I have moved the "default" String to a private static final constant. The unit test works well with the "syn_generic.json" file in the resources folder , when we removed the "user_name": "foobar". Thanks. > SLS should set default user to handle SYNTH format > -- > > Key: YARN-10448 > URL: https://issues.apache.org/jira/browse/YARN-10448 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 3.2.1, 3.4.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-10448.001.patch, YARN-10448.002.patch, > YARN-10448.003.patch, image-2020-10-11-22-01-37-227.png, > image-2020-10-11-22-02-17-166.png > > > When using the synthetic generator json file example from the doc ( > https://hadoop.apache.org/docs/current/hadoop-sls/SchedulerLoadSimulator.html#SYNTH_JSON_input_file_format > ), it throws the following exception: > {noformat} > java.lang.IllegalArgumentException: Null user > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269) > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256) > at > org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191) > at > org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161) > at > org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > {noformat} > So the solution is either: > 1) to make {{user_name}} a mandatory field, or > 2) to set default user in SLS code if the json file does not define it. > IMO, solution 2 might be better, because in most cases (if not all) > {{user_name}} has no impact on scheduler performance, thus it is reasonable > to make it an optional field, which is also consistent with the {{job.user}} > field in SLS JSON file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10448) SLS should set default user to handle SYNTH format
[ https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-10448: - Attachment: image-2020-10-11-22-02-17-166.png > SLS should set default user to handle SYNTH format > -- > > Key: YARN-10448 > URL: https://issues.apache.org/jira/browse/YARN-10448 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 3.2.1, 3.4.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-10448.001.patch, YARN-10448.002.patch, > YARN-10448.003.patch, image-2020-10-11-22-01-37-227.png, > image-2020-10-11-22-02-17-166.png > > > When using the synthetic generator json file example from the doc ( > https://hadoop.apache.org/docs/current/hadoop-sls/SchedulerLoadSimulator.html#SYNTH_JSON_input_file_format > ), it throws the following exception: > {noformat} > java.lang.IllegalArgumentException: Null user > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269) > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256) > at > org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191) > at > org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161) > at > org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > {noformat} > So the solution is either: > 1) to make {{user_name}} a mandatory field, or > 2) to set default user in SLS code if the json file does not define it. > IMO, solution 2 might be better, because in most cases (if not all) > {{user_name}} has no impact on scheduler performance, thus it is reasonable > to make it an optional field, which is also consistent with the {{job.user}} > field in SLS JSON file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10448) SLS should set default user to handle SYNTH format
[ https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211934#comment-17211934 ] zhuqi commented on YARN-10448: -- !image-2020-10-11-22-01-37-227.png|width=871,height=179! !image-2020-10-11-22-02-17-166.png|width=341,height=42! > SLS should set default user to handle SYNTH format > -- > > Key: YARN-10448 > URL: https://issues.apache.org/jira/browse/YARN-10448 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 3.2.1, 3.4.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-10448.001.patch, YARN-10448.002.patch, > YARN-10448.003.patch, image-2020-10-11-22-01-37-227.png, > image-2020-10-11-22-02-17-166.png > > > When using the synthetic generator json file example from the doc ( > https://hadoop.apache.org/docs/current/hadoop-sls/SchedulerLoadSimulator.html#SYNTH_JSON_input_file_format > ), it throws the following exception: > {noformat} > java.lang.IllegalArgumentException: Null user > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269) > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256) > at > org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191) > at > org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161) > at > org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > {noformat} > So the solution is either: > 1) to make {{user_name}} a mandatory field, or > 2) to set default user in SLS code if the json file does not define it. > IMO, solution 2 might be better, because in most cases (if not all) > {{user_name}} has no impact on scheduler performance, thus it is reasonable > to make it an optional field, which is also consistent with the {{job.user}} > field in SLS JSON file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10448) SLS should set default user to handle SYNTH format
[ https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-10448: - Attachment: image-2020-10-11-22-01-37-227.png > SLS should set default user to handle SYNTH format > -- > > Key: YARN-10448 > URL: https://issues.apache.org/jira/browse/YARN-10448 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 3.2.1, 3.4.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-10448.001.patch, YARN-10448.002.patch, > YARN-10448.003.patch, image-2020-10-11-22-01-37-227.png > > > When using the synthetic generator json file example from the doc ( > https://hadoop.apache.org/docs/current/hadoop-sls/SchedulerLoadSimulator.html#SYNTH_JSON_input_file_format > ), it throws the following exception: > {noformat} > java.lang.IllegalArgumentException: Null user > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269) > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256) > at > org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191) > at > org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161) > at > org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > {noformat} > So the solution is either: > 1) to make {{user_name}} a mandatory field, or > 2) to set default user in SLS code if the json file does not define it. > IMO, solution 2 might be better, because in most cases (if not all) > {{user_name}} has no impact on scheduler performance, thus it is reasonable > to make it an optional field, which is also consistent with the {{job.user}} > field in SLS JSON file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10448) SLS should set default user to handle SYNTH format
[ https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211922#comment-17211922 ] zhuqi commented on YARN-10448: -- CC [~adam.antal] I have moved the "default" String to a private static final constant. The unit test works well with the "syn_generic.json" file in the resources folder , when we removed the "user_name": "foobar". Thanks. > SLS should set default user to handle SYNTH format > -- > > Key: YARN-10448 > URL: https://issues.apache.org/jira/browse/YARN-10448 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 3.2.1, 3.4.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-10448.001.patch, YARN-10448.002.patch, > YARN-10448.003.patch > > > When using the synthetic generator json file example from the doc ( > https://hadoop.apache.org/docs/current/hadoop-sls/SchedulerLoadSimulator.html#SYNTH_JSON_input_file_format > ), it throws the following exception: > {noformat} > java.lang.IllegalArgumentException: Null user > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269) > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256) > at > org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191) > at > org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161) > at > org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > {noformat} > So the solution is either: > 1) to make {{user_name}} a mandatory field, or > 2) to set default user in SLS code if the json file does not define it. > IMO, solution 2 might be better, because in most cases (if not all) > {{user_name}} has no impact on scheduler performance, thus it is reasonable > to make it an optional field, which is also consistent with the {{job.user}} > field in SLS JSON file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10448) SLS should set default user to handle SYNTH format
[ https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203789#comment-17203789 ] zhuqi commented on YARN-10448: -- CC [~cheersyang] [~Tao Yang] Attached the 002 patch: set default user in \{{startAMFromSynthGenerator()}} if \{{job.getUser()}} is null. Thanks. > SLS should set default user to handle SYNTH format > -- > > Key: YARN-10448 > URL: https://issues.apache.org/jira/browse/YARN-10448 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 3.2.1, 3.4.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-10448.001.patch, YARN-10448.002.patch > > > java.lang.IllegalArgumentException: Null user > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269) > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256) > at > org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191) > at > org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161) > at > org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10448) SLS should set default user to handle SYNTH format
[ https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-10448: - Summary: SLS should set default user to handle SYNTH format (was: When use the sls (SYNTH JSON input file format) example the user be null will cause failed.) > SLS should set default user to handle SYNTH format > -- > > Key: YARN-10448 > URL: https://issues.apache.org/jira/browse/YARN-10448 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 3.2.1, 3.4.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-10448.001.patch > > > java.lang.IllegalArgumentException: Null user > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269) > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256) > at > org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191) > at > org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161) > at > org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9693) When AMRMProxyService is enabled RMCommunicator will register with failure
[ https://issues.apache.org/jira/browse/YARN-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203663#comment-17203663 ] zhuqi commented on YARN-9693: - CC [~cane] [~panlijie] If the patch fix the problem, i meet the same error too. Thanks. > When AMRMProxyService is enabled RMCommunicator will register with failure > -- > > Key: YARN-9693 > URL: https://issues.apache.org/jira/browse/YARN-9693 > Project: Hadoop YARN > Issue Type: Improvement > Components: federation >Affects Versions: 3.1.2 >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-9693.001.patch > > > When we enable amrm proxy service, the RMCommunicator will register with > failure below: > {code:java} > 2019-07-23 17:12:44,794 INFO [TaskHeartbeatHandler PingChecker] > org.apache.hadoop.mapreduce.v2.app.TaskHeartbeatHandler: TaskHeartbeatHandler > thread interrupted > 2019-07-23 17:12:44,794 ERROR [main] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid > AMRMToken from appattempt_1563872237585_0001_02 > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:186) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.serviceStart(RMCommunicator.java:123) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.serviceStart(RMContainerAllocator.java:280) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.serviceStart(MRAppMaster.java:986) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1300) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$6.run(MRAppMaster.java:1768) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1716) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1764) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1698) > Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: > Invalid AMRMToken from appattempt_1563872237585_0001_02 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy93.registerApplicationMaster(Unknown Source) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:170) > ... 14 more > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > Invalid AMRMToken
[jira] [Commented] (YARN-10448) When use the sls (SYNTH JSON input file format) example the user be null will cause failed.
[ https://issues.apache.org/jira/browse/YARN-10448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201457#comment-17201457 ] zhuqi commented on YARN-10448: -- No need new test. > When use the sls (SYNTH JSON input file format) example the user be null will > cause failed. > --- > > Key: YARN-10448 > URL: https://issues.apache.org/jira/browse/YARN-10448 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 3.2.1, 3.4.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: YARN-10448.001.patch > > > java.lang.IllegalArgumentException: Null user > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269) > at > org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256) > at > org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191) > at > org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161) > at > org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10448) When use the sls (SYNTH JSON input file format) example the user be null will cause failed.
zhuqi created YARN-10448: Summary: When use the sls (SYNTH JSON input file format) example the user be null will cause failed. Key: YARN-10448 URL: https://issues.apache.org/jira/browse/YARN-10448 Project: Hadoop YARN Issue Type: Bug Components: scheduler-load-simulator Reporter: zhuqi Assignee: zhuqi java.lang.IllegalArgumentException: Null user at org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1269) at org.apache.hadoop.security.UserGroupInformation.createRemoteUser(UserGroupInformation.java:1256) at org.apache.hadoop.yarn.sls.appmaster.AMSimulator.submitReservationWhenSpecified(AMSimulator.java:191) at org.apache.hadoop.yarn.sls.appmaster.AMSimulator.firstStep(AMSimulator.java:161) at org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:88) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10400) Build the new version of hadoop on Mac os system with bug
[ https://issues.apache.org/jira/browse/YARN-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191573#comment-17191573 ] zhuqi commented on YARN-10400: -- cc [~jiwq] It's a good choice, thanks. > Build the new version of hadoop on Mac os system with bug > - > > Key: YARN-10400 > URL: https://issues.apache.org/jira/browse/YARN-10400 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: zhuqi >Priority: Major > Attachments: image-2020-08-18-00-23-48-730.png > > > !image-2020-08-18-00-23-48-730.png|width=1141,height=449! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10400) Build the new version of hadoop on Mac os system with bug
[ https://issues.apache.org/jira/browse/YARN-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-10400: - Description: !image-2020-08-18-00-23-48-730.png|width=1141,height=449! (was: !image-2020-08-18-00-23-48-730.png!) > Build the new version of hadoop on Mac os system with bug > - > > Key: YARN-10400 > URL: https://issues.apache.org/jira/browse/YARN-10400 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: zhuqi >Priority: Major > Attachments: image-2020-08-18-00-23-48-730.png > > > !image-2020-08-18-00-23-48-730.png|width=1141,height=449! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10400) Build the new version of hadoop on Mac os system with bug
zhuqi created YARN-10400: Summary: Build the new version of hadoop on Mac os system with bug Key: YARN-10400 URL: https://issues.apache.org/jira/browse/YARN-10400 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.3.0 Reporter: zhuqi Attachments: image-2020-08-18-00-23-48-730.png !image-2020-08-18-00-23-48-730.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB
[ https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi reopened YARN-2368: - We can not only set the yarn.resourcemanager.zk-jutemaxbuffer-bytes to be configured, but also can control the if we want to retry or just log some application info to define what application cause the boom of the zk buffer. So that we can make sure the gc problem not happen when we retry too much and time out the zk connection. Also we can find the root application which cause the boom of the zk buffer. > ResourceManager failed when ZKRMStateStore tries to update znode data larger > than 1MB > - > > Key: YARN-2368 > URL: https://issues.apache.org/jira/browse/YARN-2368 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.1 >Reporter: Leitao Guo >Assignee: zhuqi >Priority: Critical > Attachments: YARN-2368.patch > > > Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed > finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode > larger than 1MB, which is the default configuration of ZooKeeper server and > client in 'jute.maxbuffer'. > ResourceManager (ip addr: 10.153.80.8) log shows as the following: > {code} > 2014-07-25 22:33:11,078 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2014-07-25 22:33:11,078 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2014-07-25 22:33:11,214 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a > org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for > /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:745) > {code} > Meanwhile, ZooKeeps log shows as the following: > {code} > 2014-07-25 22:10:09,728 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - > Accepted socket connection from /10.153.80.8:58890 > 2014-07-25 22:10:09,730 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client > attempting to renew session 0x247684586e70006 at /10.153.80.8:58890 > 2014-07-25 22:10:09,730 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating > client: 0x247684586e70006 > 2014-07-25 22:10:09,730 [myid:1] - INFO > [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@595] - Established session > 0x247684586e70006 with negotiated timeout 1 for client /10.153.80.8:58890 > 2014-07-25 22:10:09,730 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth > packet /10.153.80.8:58890 > 2014-07-25 22:10:09,730 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth > success /10.153.80.8:58890 >
[jira] [Assigned] (YARN-2368) ResourceManager failed when ZKRMStateStore tries to update znode data larger than 1MB
[ https://issues.apache.org/jira/browse/YARN-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi reassigned YARN-2368: --- Assignee: zhuqi > ResourceManager failed when ZKRMStateStore tries to update znode data larger > than 1MB > - > > Key: YARN-2368 > URL: https://issues.apache.org/jira/browse/YARN-2368 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.1 >Reporter: Leitao Guo >Assignee: zhuqi >Priority: Critical > Attachments: YARN-2368.patch > > > Both ResouceManagers throw out STATE_STORE_OP_FAILED events and failed > finally. ZooKeeper log shows that ZKRMStateStore tries to update a znode > larger than 1MB, which is the default configuration of ZooKeeper server and > client in 'jute.maxbuffer'. > ResourceManager (ip addr: 10.153.80.8) log shows as the following: > {code} > 2014-07-25 22:33:11,078 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session connected > 2014-07-25 22:33:11,078 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > ZKRMStateStore Session restored > 2014-07-25 22:33:11,214 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a > org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type > STATE_STORE_OP_FAILED. Cause: > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for > /rmstore/ZKRMStateRoot/RMAppRoot/application_1406264354826_1645/appattempt_1406264354826_1645_01 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1069) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:926) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$8.run(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:923) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:620) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:675) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:745) > {code} > Meanwhile, ZooKeeps log shows as the following: > {code} > 2014-07-25 22:10:09,728 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - > Accepted socket connection from /10.153.80.8:58890 > 2014-07-25 22:10:09,730 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@832] - Client > attempting to renew session 0x247684586e70006 at /10.153.80.8:58890 > 2014-07-25 22:10:09,730 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@107] - Revalidating > client: 0x247684586e70006 > 2014-07-25 22:10:09,730 [myid:1] - INFO > [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@595] - Established session > 0x247684586e70006 with negotiated timeout 1 for client /10.153.80.8:58890 > 2014-07-25 22:10:09,730 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@863] - got auth > packet /10.153.80.8:58890 > 2014-07-25 22:10:09,730 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@897] - auth > success /10.153.80.8:58890 > 2014-07-25 22:10:09,742 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception > causing close of session 0x247684586e70006 due to java.io.IOException: Len > error 1530 > 747 > 2014-07-25 22:10:09,743 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed > socket connection for client /10.153.80.8:58890 which
[jira] [Commented] (YARN-8995) Log events info in AsyncDispatcher when event queue size cumulatively reaches a certain number every time.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923146#comment-16923146 ] zhuqi commented on YARN-8995: - Hi [~Tao Yang] I have improved it now. Thanks a lot. > Log events info in AsyncDispatcher when event queue size cumulatively reaches > a certain number every time. > -- > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, > YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, > YARN-8995.014.patch, YARN-8995.015.patch, YARN-8995.016.patch, > image-2019-09-04-15-20-02-914.png > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8995) Log events info in AsyncDispatcher when event queue size cumulatively reaches a certain number every time.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-8995: Summary: Log events info in AsyncDispatcher when event queue size cumulatively reaches a certain number every time. (was: Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics. ) > Log events info in AsyncDispatcher when event queue size cumulatively reaches > a certain number every time. > -- > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, > YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, > YARN-8995.014.patch, YARN-8995.015.patch, YARN-8995.016.patch, > image-2019-09-04-15-20-02-914.png > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1699#comment-1699 ] zhuqi commented on YARN-8995: - Hi [~Tao Yang]. !image-2019-09-04-15-20-02-914.png! The metric that i have changed.Now not in thousand, but i forget to change it in the last two patch. Sorry for my mistake. Thanks. > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, > YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, > YARN-8995.014.patch, image-2019-09-04-15-20-02-914.png > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-8995: Attachment: image-2019-09-04-15-20-02-914.png > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, > YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, > YARN-8995.014.patch, image-2019-09-04-15-20-02-914.png > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921977#comment-16921977 ] zhuqi commented on YARN-8995: - Hi [~Tao Yang] Now i have fixed the checkstyle. Thanks. > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, > YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch, > YARN-8995.014.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-8995: Attachment: (was: YARN-8995.013.patch) > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, > YARN-8995.011.patch, YARN-8995.012.patch, YARN-8995.013.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920268#comment-16920268 ] zhuqi commented on YARN-8995: - Hi [~Tao Yang] Now the patch fixed is available. Thanks very much for your patience. > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch, YARN-8995.009.patch, YARN-8995.010.patch, > YARN-8995.011.patch, YARN-8995.012.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-8995: Attachment: (was: YARN-8995.010.patch) > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Fix For: 3.2.0 > > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch, YARN-8995.009.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-8995: Attachment: (was: YARN-8995.009.patch) > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Fix For: 3.2.0 > > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch, YARN-8995.009.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-8995: Attachment: (was: YARN-8995.010.patch) > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915282#comment-16915282 ] zhuqi commented on YARN-8995: - Hi [~cheersyang]/[~Tao Yang] Now i submit the new patch. If any other advice for merge it. Thanks. > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch, YARN-8995.010.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912381#comment-16912381 ] zhuqi commented on YARN-8995: - Hi [~cheersyang] Thanks for your review. I use in-thousands here, because i want to force user to set it in thousand in order to match the existed queue size log info if they won't use the default 5000. The new description is good to me. I will change my description. Thanks. > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910496#comment-16910496 ] zhuqi commented on YARN-8995: - Hi, [~Tao Yang] Thanks a lot. And i am looking forward to contribute more. > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9634) Make yarn submit dir and log aggregation dir more evenly distributed
[ https://issues.apache.org/jira/browse/YARN-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870034#comment-16870034 ] zhuqi commented on YARN-9634: - Hi, [~Tao Yang]. Thanks for comment. I have changed my description, as [~cheersyang] commented, my focus is if we can distributed these dirs among multi federation dirs, to get better namenode performance for my HDFS federation. But will affect some other logic, the details should be considered. > Make yarn submit dir and log aggregation dir more evenly distributed > > > Key: YARN-9634 > URL: https://issues.apache.org/jira/browse/YARN-9634 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > > When the cluster size is large, the dir which user submits the job, and the > dir which container log aggregate, and other information will fill the HDFS > directory, because the HDFS directory has a default storage limit, this can > be configured by "yarn.log-aggregation.retain-seconds" to solve. But the > FSNamesystemLock#writeLock and rpc operation which these dir operation > triggered will affect the namespace which these dirs are located, in order to > get this better we have let this dir in one single HDFS federation namespace, > but with the cluster become huge, the single namespace will also affect the > rpc performance. In response to this situation, we can change these dirs more > distributed among multi namespace dirs, with some policy to choose, such as > hash policy and round robin policy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9634) Make yarn submit dir and log aggregation dir more evenly distributed
[ https://issues.apache.org/jira/browse/YARN-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated YARN-9634: Description: When the cluster size is large, the dir which user submits the job, and the dir which container log aggregate, and other information will fill the HDFS directory, because the HDFS directory has a default storage limit, this can be configured by "yarn.log-aggregation.retain-seconds" to solve. But the FSNamesystemLock#writeLock and rpc operation which these dir operation triggered will affect the namespace which these dirs are located, in order to get this better we have let this dir in one single HDFS federation namespace, but with the cluster become huge, the single namespace will also affect the rpc performance. In response to this situation, we can change these dirs more distributed among multi namespace dirs, with some policy to choose, such as hash policy and round robin policy. (was: When the cluster size is large, the dir which user submits the job, and the dir which container log aggregate, and other information will fill the HDFS directory, because the HDFS directory has a default storage limit. In response to this situation, we can change these dirs more distributed, with some policy to choose, such as hash policy and round robin policy.) > Make yarn submit dir and log aggregation dir more evenly distributed > > > Key: YARN-9634 > URL: https://issues.apache.org/jira/browse/YARN-9634 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > > When the cluster size is large, the dir which user submits the job, and the > dir which container log aggregate, and other information will fill the HDFS > directory, because the HDFS directory has a default storage limit, this can > be configured by "yarn.log-aggregation.retain-seconds" to solve. But the > FSNamesystemLock#writeLock and rpc operation which these dir operation > triggered will affect the namespace which these dirs are located, in order to > get this better we have let this dir in one single HDFS federation namespace, > but with the cluster become huge, the single namespace will also affect the > rpc performance. In response to this situation, we can change these dirs more > distributed among multi namespace dirs, with some policy to choose, such as > hash policy and round robin policy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870027#comment-16870027 ] zhuqi commented on YARN-8995: - Now the test has no problem. cc [~Tao Yang]. > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8995) Log the event type of the too big AsyncDispatcher event queue size, and add the information to the metrics.
[ https://issues.apache.org/jira/browse/YARN-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869594#comment-16869594 ] zhuqi commented on YARN-8995: - Hi, [~Tao Yang] Now, i submit the new patch and fix checkstyle warnings. Thanks. > Log the event type of the too big AsyncDispatcher event queue size, and add > the information to the metrics. > > > Key: YARN-8995 > URL: https://issues.apache.org/jira/browse/YARN-8995 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics, nodemanager, resourcemanager >Affects Versions: 3.2.0, 3.3.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: TestStreamPerf.java, YARN-8995.001.patch, > YARN-8995.002.patch, YARN-8995.003.patch, YARN-8995.004.patch, > YARN-8995.005.patch, YARN-8995.006.patch, YARN-8995.007.patch, > YARN-8995.008.patch > > > In our growing cluster,there are unexpected situations that cause some event > queues to block the performance of the cluster, such as the bug of > https://issues.apache.org/jira/browse/YARN-5262 . I think it's necessary to > log the event type of the too big event queue size, and add the information > to the metrics, and the threshold of queue size is a parametor which can be > changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9634) Make yarn submit dir and log aggregation dir more evenly distributed
[ https://issues.apache.org/jira/browse/YARN-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869151#comment-16869151 ] zhuqi commented on YARN-9634: - Hi [~cheersyang], Yeah i mean space quota. Also in normal situation, if we don't use quota, i mean that in our large cluster, the fixed namespace binding the submit dir and log aggeration dir, will affect the fixed namespace rpc performance. I think we can add configuration with some round robin policy or hash policy to distributed this dir to configured namespaces among the hdfs federation. > Make yarn submit dir and log aggregation dir more evenly distributed > > > Key: YARN-9634 > URL: https://issues.apache.org/jira/browse/YARN-9634 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > > When the cluster size is large, the dir which user submits the job, and the > dir which container log aggregate, and other information will fill the HDFS > directory, because the HDFS directory has a default storage limit. In > response to this situation, we can change these dirs more distributed, with > some policy to choose, such as hash policy and round robin policy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org