[jira] [Commented] (YARN-8709) intra-queue preemption checker always fail since one under-served queue was deleted
[ https://issues.apache.org/jira/browse/YARN-8709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597101#comment-16597101 ] Weiwei Yang commented on YARN-8709: --- LGTM, changing to PA > intra-queue preemption checker always fail since one under-served queue was > deleted > --- > > Key: YARN-8709 > URL: https://issues.apache.org/jira/browse/YARN-8709 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler preemption >Affects Versions: 3.2.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-8709.001.patch > > > After some queues deleted, the preemption checker in SchedulingMonitor was > always skipped because of YarnRuntimeException for every run. > Error logs: > {noformat} > ERROR [SchedulingMonitor (ProportionalCapacityPreemptionPolicy)] > org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor: > Exception raised while executing preemption checker, skip this run..., > exception= > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: This shouldn't > happen, cannot find TempQueuePerPartition for queueName=1535075839208 > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.getQueueByPartition(ProportionalCapacityPreemptionPolicy.java:701) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.computeIntraQueuePreemptionDemand(IntraQueueCandidatesSelector.java:302) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.selectCandidates(IntraQueueCandidatesSelector.java:128) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:514) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:348) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:186) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:300) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) > at java.lang.Thread.run(Thread.java:834) > {noformat} > I think there is something wrong with partitionToUnderServedQueues field in > ProportionalCapacityPreemptionPolicy. Items of partitionToUnderServedQueues > can be add but never be removed, except rebuilding this policy. For example, > once under-served queue "a" is added into this structure, it will always be > there and never be removed, intra-queue preemption checker will try to get > all queues info for partitionToUnderServedQueues in > IntraQueueCandidatesSelector#selectCandidates and will throw > YarnRuntimeException if not found. So that after queue "a" is deleted from > queue structure, the preemption checker will always fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8102) Retrospect on having enable and disable flag for Node Attribute
[ https://issues.apache.org/jira/browse/YARN-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597083#comment-16597083 ] Weiwei Yang edited comment on YARN-8102 at 8/30/18 5:54 AM: I agree with [~Naganarasimha] about simplifying the configs, can we use {noformat} file://{hadoop.tmp.dir}/yarn/system/node-attributes file://{hadoop.tmp.dir}/yarn/system/node-labels {noformat} as the default dir? But if we change the default path for node-labels, that will be an incompatible change right? Maybe at least we can have #2 done? was (Author: cheersyang): I agree with [~Naganarasimha] about simplifying the configs, can we use {noformat} [file://$|file://%24/] \{hadoop.tmp.dir}/yarn/system/node-attributes [file://$|file://%24/] \{hadoop.tmp.dir}/yarn/system/node-labels {noformat} as the default dir? But if we change the default path for node-labels, that will be an incompatible change right? Maybe at least we can have #2 done? > Retrospect on having enable and disable flag for Node Attribute > --- > > Key: YARN-8102 > URL: https://issues.apache.org/jira/browse/YARN-8102 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Major > > Currently node attribute feature is by default enabled. We have to revisit on > the same. > Enabling by default means will try to create store for all cluster > installation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8102) Retrospect on having enable and disable flag for Node Attribute
[ https://issues.apache.org/jira/browse/YARN-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597083#comment-16597083 ] Weiwei Yang edited comment on YARN-8102 at 8/30/18 5:54 AM: I agree with [~Naganarasimha] about simplifying the configs, can we use * [file://$|file://%24/] \{hadoop.tmp.dir}/yarn/system/node-attributes * [file://$|file://%24/]{hadoop.tmp.dir}/yarn/system/node-labels as the default dir? But if we change the default path for node-labels, that will be an incompatible change right? Maybe at least we can have #2 done? was (Author: cheersyang): I agree with [~Naganarasimha] about simplifying the configs, can we use * [file://$|file://%24/]{hadoop.tmp.dir}/yarn/system/node-attributes * [file://$|file://%24/]{hadoop.tmp.dir}/yarn/system/node-labels as the default dir? But if we change the default path for node-labels, that will be an incompatible change right? Maybe at least we can have #2 done? > Retrospect on having enable and disable flag for Node Attribute > --- > > Key: YARN-8102 > URL: https://issues.apache.org/jira/browse/YARN-8102 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Major > > Currently node attribute feature is by default enabled. We have to revisit on > the same. > Enabling by default means will try to create store for all cluster > installation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8102) Retrospect on having enable and disable flag for Node Attribute
[ https://issues.apache.org/jira/browse/YARN-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597083#comment-16597083 ] Weiwei Yang edited comment on YARN-8102 at 8/30/18 5:54 AM: I agree with [~Naganarasimha] about simplifying the configs, can we use {noformat} [file://$|file://%24/] \{hadoop.tmp.dir}/yarn/system/node-attributes [file://$|file://%24/] \{hadoop.tmp.dir}/yarn/system/node-labels {noformat} as the default dir? But if we change the default path for node-labels, that will be an incompatible change right? Maybe at least we can have #2 done? was (Author: cheersyang): I agree with [~Naganarasimha] about simplifying the configs, can we use * [file://$|file://%24/] \{hadoop.tmp.dir}/yarn/system/node-attributes * [file://$|file://%24/]{hadoop.tmp.dir}/yarn/system/node-labels as the default dir? But if we change the default path for node-labels, that will be an incompatible change right? Maybe at least we can have #2 done? > Retrospect on having enable and disable flag for Node Attribute > --- > > Key: YARN-8102 > URL: https://issues.apache.org/jira/browse/YARN-8102 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Major > > Currently node attribute feature is by default enabled. We have to revisit on > the same. > Enabling by default means will try to create store for all cluster > installation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8102) Retrospect on having enable and disable flag for Node Attribute
[ https://issues.apache.org/jira/browse/YARN-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597083#comment-16597083 ] Weiwei Yang commented on YARN-8102: --- I agree with [~Naganarasimha] about simplifying the configs, can we use * [file://$|file://%24/]{hadoop.tmp.dir}/yarn/system/node-attributes * [file://$|file://%24/]{hadoop.tmp.dir}/yarn/system/node-labels as the default dir? But if we change the default path for node-labels, that will be an incompatible change right? Maybe at least we can have #2 done? > Retrospect on having enable and disable flag for Node Attribute > --- > > Key: YARN-8102 > URL: https://issues.apache.org/jira/browse/YARN-8102 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Major > > Currently node attribute feature is by default enabled. We have to revisit on > the same. > Enabling by default means will try to create store for all cluster > installation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8666) [UI2] Remove application tab from Yarn Queue Page
[ https://issues.apache.org/jira/browse/YARN-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akhil PB updated YARN-8666: --- Summary: [UI2] Remove application tab from Yarn Queue Page (was: Remove application tab from Yarn Queue Page) > [UI2] Remove application tab from Yarn Queue Page > - > > Key: YARN-8666 > URL: https://issues.apache.org/jira/browse/YARN-8666 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Assignee: Yesha Vora >Priority: Major > Attachments: Screen Shot 2018-08-14 at 3.43.18 PM.png > > > Yarn UI2 Queue page puts Application button. This button does not redirect to > any other page. In addition to that running application table is also > available on same page. > Thus, there is no need to have a button for application in Queue page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8112) Fix min cardinality check for same source and target tags in intra-app constraints
[ https://issues.apache.org/jira/browse/YARN-8112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-8112: -- Parent Issue: YARN-8731 (was: YARN-7812) > Fix min cardinality check for same source and target tags in intra-app > constraints > -- > > Key: YARN-8112 > URL: https://issues.apache.org/jira/browse/YARN-8112 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Weiwei Yang >Assignee: Konstantinos Karanasos >Priority: Major > > The min cardinality constraint (min cardinality = _k_) ensures that a > container is placed at a node that has already k occurrences of the target > tag. For example, a constraint _zk=3,CARDINALITY,NODE,hb,2,10_ will place > each of the three zk containers on a node with at least 2 hb instances (and > no more than 10 for the max cardinality). > Affinity constraints is a special case of this where min cardinality is 1. > Currently we do not support min cardinality when the source and the target of > the constraint are the same in an intra-app constraint. > Therefore, zk=3,CARDINALITY,NODE,zk,2,10 is not supported, and neither is > zk=3,IN,NODE,zk. > This Jira will address this problem by placing the first k containers on the > same node (or any other specified scope, e.g., rack), so that min cardinality > can be met when placing the subsequent containers with the same tag. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-7858) Support special Node Attribute scopes in addition to NODE and RACK
[ https://issues.apache.org/jira/browse/YARN-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang resolved YARN-7858. --- Resolution: Duplicate > Support special Node Attribute scopes in addition to NODE and RACK > -- > > Key: YARN-7858 > URL: https://issues.apache.org/jira/browse/YARN-7858 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun Suresh >Assignee: Weiwei Yang >Priority: Major > > Currently, we have only two scopes defined: NODE and RACK against which we > check the cardinality of the placement. > This idea should be extended to support node-attribute scopes. For eg: > Placement of containers across *upgrade domains* and *failure domains*. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8555) Parameterize TestSchedulingRequestContainerAllocation(Async) to cover both PC handler options
[ https://issues.apache.org/jira/browse/YARN-8555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-8555: -- Parent Issue: YARN-8731 (was: YARN-7812) > Parameterize TestSchedulingRequestContainerAllocation(Async) to cover both PC > handler options > -- > > Key: YARN-8555 > URL: https://issues.apache.org/jira/browse/YARN-8555 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Weiwei Yang >Priority: Minor > > Current test cases in this 2 classes are only targeting for 1 handler type, > \{{scheduler}} or \{{processor}}. Once YARN-8015 is done, we should modify > them to be parameterized in order to cover both cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8555) Parameterize TestSchedulingRequestContainerAllocation(Async) to cover both PC handler options
[ https://issues.apache.org/jira/browse/YARN-8555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-8555: -- Labels: (was: newbie) > Parameterize TestSchedulingRequestContainerAllocation(Async) to cover both PC > handler options > -- > > Key: YARN-8555 > URL: https://issues.apache.org/jira/browse/YARN-8555 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Weiwei Yang >Priority: Minor > > Current test cases in this 2 classes are only targeting for 1 handler type, > \{{scheduler}} or \{{processor}}. Once YARN-8015 is done, we should modify > them to be parameterized in order to cover both cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6621) Validate Placement Constraints
[ https://issues.apache.org/jira/browse/YARN-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-6621: -- Parent Issue: YARN-8731 (was: YARN-7812) > Validate Placement Constraints > -- > > Key: YARN-6621 > URL: https://issues.apache.org/jira/browse/YARN-6621 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Konstantinos Karanasos >Assignee: Konstantinos Karanasos >Priority: Major > > This library will be used to validate placement constraints. > It can serve multiple validation purposes: > 1) Check if the placement constraint has a valid form (e.g., a cardinality > constraint should not have an associated target expression, a DELAYED_OR > compound expression should only appear in specific places in a constraint > tree, etc.) > 2) Check if the constraints given by a user are conflicting (e.g., > cardinality more than 5 in a host and less than 3 in a rack). > 3) Check that the constraints are properly added in the Placement Constraint > Manager. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7800) Bind node constraint once a container is proposed to be placed on this node
[ https://issues.apache.org/jira/browse/YARN-7800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-7800: -- Parent Issue: YARN-8731 (was: YARN-7812) > Bind node constraint once a container is proposed to be placed on this node > --- > > Key: YARN-7800 > URL: https://issues.apache.org/jira/browse/YARN-7800 > Project: Hadoop YARN > Issue Type: Sub-task > Components: RM >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Attachments: bind_node_constraint.pdf > > > We found when there is circular dependency between multiple scheduling > requests, allocation decisions made by placement constraint algorithm might > be conflicting with related tags. More describing of the issue please refer > to YARN-7783. > To solve this issue, a possible solution is to bind *node constraint*. If the > algorithm wants to place any new container to this node, except for checking > if it satisfies the placement constraint, it also check if it satisfies the > node constraint. For example > 1) "foo", anti-affinity with "foo" > +Implies node constraint:+ on each node, it cannot have more than 1 foo tags > 2) "bar", anti-affinity with "foo" > +Implies node constraint:+ on each node, it cannot have both "bar" and "foo" > tags > With such constraint, it works like > * req2 is placed on any of nodes, e.g n2, +a node constraint[1] is added to > n2 that constrains this node cannot have both "bar" and "foo" tags+ > * when the algorithm wants to place req1 on n2, it checks if its placement > constraint is satisfied. It should be as there is no foo container on this > node yet. > * Then the algorithm checks if the node constraint is satisfied. It is not > because it violate node constraint [1]. > This avoids to do additional re-attempt like what was done in YARN-7783. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7752) Handle AllocationTags for Opportunistic containers.
[ https://issues.apache.org/jira/browse/YARN-7752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-7752: -- Parent Issue: YARN-8731 (was: YARN-7812) > Handle AllocationTags for Opportunistic containers. > --- > > Key: YARN-7752 > URL: https://issues.apache.org/jira/browse/YARN-7752 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun Suresh >Priority: Major > > JIRA to track how opportunistic containers are handled w.r.t > AllocationTagsManager creation and removal of tags. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7819) Allow PlacementProcessor to be used with the FairScheduler
[ https://issues.apache.org/jira/browse/YARN-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-7819: -- Parent Issue: YARN-8731 (was: YARN-7812) > Allow PlacementProcessor to be used with the FairScheduler > -- > > Key: YARN-7819 > URL: https://issues.apache.org/jira/browse/YARN-7819 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun Suresh >Assignee: Arun Suresh >Priority: Major > Attachments: YARN-7819-YARN-6592.001.patch, > YARN-7819-YARN-7812.001.patch, YARN-7819.002.patch, YARN-7819.003.patch, > YARN-7819.004.patch > > > The FairScheduler needs to implement the > {{ResourceScheduler#attemptAllocationOnNode}} function for the processor to > support the FairScheduler. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7746) Fix PlacementProcessor to support app priority
[ https://issues.apache.org/jira/browse/YARN-7746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-7746: -- Parent Issue: YARN-8731 (was: YARN-7812) > Fix PlacementProcessor to support app priority > -- > > Key: YARN-7746 > URL: https://issues.apache.org/jira/browse/YARN-7746 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun Suresh >Assignee: Arun Suresh >Priority: Major > Attachments: YARN-7746.001.patch > > > The Threadpools used in the Processor should be modified to take a priority > blocking queue that respects application priority. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7812) Improvements to Rich Placement Constraints in YARN
[ https://issues.apache.org/jira/browse/YARN-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-7812: -- Issue Type: New Feature (was: Improvement) > Improvements to Rich Placement Constraints in YARN > -- > > Key: YARN-7812 > URL: https://issues.apache.org/jira/browse/YARN-7812 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Arun Suresh >Priority: Major > Fix For: 3.2.0 > > > This umbrella tracks the efforts for supporting following features > # Inter-app placement constraints > # Composite placement constraints, such as AND/OR expressions > # Support placement constraints in Capacity Scheduler -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7812) Improvements to Rich Placement Constraints in YARN
[ https://issues.apache.org/jira/browse/YARN-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597066#comment-16597066 ] Weiwei Yang edited comment on YARN-7812 at 8/30/18 5:10 AM: Thanks [~asuresh], [~kkaranasos], [~leftnoteasy], [~sunilg] for completing this feature, since main functionality are done, I think we can close this umbrella and set the fixed version to 3.2.0. There are some remaining enhancements, lets use YARN-8731 to track. Thanks! was (Author: cheersyang): Thanks [~asuresh], [~kkaranasos], [~leftnoteasy] for completing this feature, since main functionality are done, I think we can close this umbrella and set the fixed version to 3.2.0. There are some remaining enhancements, lets use YARN-8731 to track. Thanks! > Improvements to Rich Placement Constraints in YARN > -- > > Key: YARN-7812 > URL: https://issues.apache.org/jira/browse/YARN-7812 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Arun Suresh >Priority: Major > Fix For: 3.2.0 > > > This umbrella tracks the efforts for supporting following features > # Inter-app placement constraints > # Composite placement constraints, such as AND/OR expressions > # Support placement constraints in Capacity Scheduler -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-7812) Improvements to Rich Placement Constraints in YARN
[ https://issues.apache.org/jira/browse/YARN-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang resolved YARN-7812. --- Resolution: Fixed Fix Version/s: 3.2.0 Thanks [~asuresh], [~kkaranasos], [~leftnoteasy] for completing this feature, since main functionality are done, I think we can close this umbrella and set the fixed version to 3.2.0. There are some remaining enhancements, lets use YARN-8731 to track. Thanks! > Improvements to Rich Placement Constraints in YARN > -- > > Key: YARN-7812 > URL: https://issues.apache.org/jira/browse/YARN-7812 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Arun Suresh >Priority: Major > Fix For: 3.2.0 > > > This umbrella tracks the efforts for supporting following features > # Inter-app placement constraints > # Composite placement constraints, such as AND/OR expressions > # Support placement constraints in Capacity Scheduler -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7812) Improvements to Rich Placement Constraints in YARN
[ https://issues.apache.org/jira/browse/YARN-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-7812: -- Description: This umbrella tracks the efforts for supporting following features # Inter-app placement constraints # Composite placement constraints, such as AND/OR expressions # Support placement constraints in Capacity Scheduler > Improvements to Rich Placement Constraints in YARN > -- > > Key: YARN-7812 > URL: https://issues.apache.org/jira/browse/YARN-7812 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Arun Suresh >Priority: Major > > This umbrella tracks the efforts for supporting following features > # Inter-app placement constraints > # Composite placement constraints, such as AND/OR expressions > # Support placement constraints in Capacity Scheduler -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8731) Rich Placement constraints optimization and enhancements
Weiwei Yang created YARN-8731: - Summary: Rich Placement constraints optimization and enhancements Key: YARN-8731 URL: https://issues.apache.org/jira/browse/YARN-8731 Project: Hadoop YARN Issue Type: Improvement Reporter: Weiwei Yang We have supported main functionality of rich placement constraints in v3.2.0, this umbrella is opened to track remaining items of placement constraints optimization and enhancements. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7865) Node attributes documentation
[ https://issues.apache.org/jira/browse/YARN-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597060#comment-16597060 ] Weiwei Yang commented on YARN-7865: --- Hi [~Naganarasimha] Thanks for working on the documentation, it looks really good. Just go over the doc, I have following comments/suggestions: (1) *Distributed node-to-Attributes mapping* There are two rows describing {{yarn.nodemanager.node-attributes.provider.fetch-timeout-ms}}, the first one should be replaced by {{yarn.nodemanager.node-attributes.provider.fetch-interval-ms}} correct? The description seems to be saying the interval, not timeout. (2) *Specifying node attributes for application* It would be nice if we add some explanation right after the java code sample. Something like: {noformat} The above SchedulingRequest applies 1 container on nodes that must satisfy following constraints: 1) Node attribute rm.yarn.io/python doesn't exist on the node or it exist but its value is not equal to 3 2) Node attribute rm.yarn.io/java must exist on the node and its value is equal to 1.8 {noformat} BTW, can we rename the variable {{schedulingRequest1}} to just {{schedulingRequest}} in this example? (3) It will also be good if we mention node attribtue constraints are HARD limits, something like {noformat} Node attribute constraints are hard limits, that says the allocation can only be made if the node satisfies the node attribute constraint. In another word, the request keeps pending until it finds a valid node satisfying the constraint. There is no relax policy at present. {noformat} this can be added to the features section I guess. Feel free to rephrase if you like. Thanks > Node attributes documentation > - > > Key: YARN-7865 > URL: https://issues.apache.org/jira/browse/YARN-7865 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Weiwei Yang >Assignee: Naganarasimha G R >Priority: Major > Attachments: NodeAttributes.html, YARN-7865-YARN-3409.001.patch, > YARN-7865-YARN-3409.002.patch, YARN-7865-YARN-3409.003.patch, > YARN-7865-YARN-3409.004.patch > > > We need proper docs to introduce how to enable node-attributes how to > configure providers, how to specify script paths, arguments in configuration, > what should be the proper permission of the script and who will run the > script. Also it would be good to add more info to the description of the > configuration properties. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4879) Enhance Allocate Protocol to Identify Requests Explicitly
[ https://issues.apache.org/jira/browse/YARN-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597011#comment-16597011 ] qiuliang commented on YARN-4879: Thanks for putting the doc together with all the details. And I have a question that why the number of Rack2 for Req2 is 3? Node1 and Node4 on Rack1, Node5 on Rack2, so the number of Rack1 is 4 and Rack2 is 2. Where am I wrong? > Enhance Allocate Protocol to Identify Requests Explicitly > - > > Key: YARN-4879 > URL: https://issues.apache.org/jira/browse/YARN-4879 > Project: Hadoop YARN > Issue Type: Improvement > Components: applications, resourcemanager >Reporter: Subru Krishnan >Assignee: Subru Krishnan >Priority: Major > Fix For: 2.9.0, 3.0.0-beta1 > > Attachments: SimpleAllocateProtocolProposal-v1.pdf, > SimpleAllocateProtocolProposal-v2.pdf > > > For legacy reasons, the current allocate protocol expects expanded requests > which represent the cumulative request for any change in resource > constraints. This is not only very difficult to comprehend but makes it > impossible for the scheduler to associate container allocations to the > original requests. This problem is amplified by the fact that the expansion > is managed by the AMRMClient which makes it cumbersome for non-Java clients > as they all have to replicate the non-trivial logic. In this JIRA, we are > proposing enhancement to the Allocate Protocol to allow AMs to identify > requests explicitly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8680) YARN NM: Implement Iterable Abstraction for LocalResourceTrackerstate
[ https://issues.apache.org/jira/browse/YARN-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596999#comment-16596999 ] Sunil Govindan commented on YARN-8680: -- Thanks [~pradeepambati]. I ll also try to help u review this. cc [~cheersyang], cud u also pls take a look. > YARN NM: Implement Iterable Abstraction for LocalResourceTrackerstate > - > > Key: YARN-8680 > URL: https://issues.apache.org/jira/browse/YARN-8680 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Pradeep Ambati >Assignee: Pradeep Ambati >Priority: Critical > Attachments: YARN-8680.00.patch, YARN-8680.01.patch > > > Similar to YARN-8242, implement iterable abstraction for > LocalResourceTrackerState to load completed and in progress resources when > needed rather than loading them all at a time for a respective state. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8102) Retrospect on having enable and disable flag for Node Attribute
[ https://issues.apache.org/jira/browse/YARN-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596967#comment-16596967 ] Naganarasimha G R commented on YARN-8102: - Hi [~bibinchundatt] & [~sunilg], I think we had sufficient discussion on this, IMHO its not required and it can be safely configured to local host's tmp where in most of the clusters will have access else we can choose relative path to *_"hadoop.tmp.dir"_* which is what was also used in case of ATSv1 timeline store. I would prefer here to avoid any kind of new configuration being introduced. If agree then we can go ahead and do the required above changes and close this jira. cc [~cheersyang] > Retrospect on having enable and disable flag for Node Attribute > --- > > Key: YARN-8102 > URL: https://issues.apache.org/jira/browse/YARN-8102 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Major > > Currently node attribute feature is by default enabled. We have to revisit on > the same. > Enabling by default means will try to create store for all cluster > installation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7865) Node attributes documentation
[ https://issues.apache.org/jira/browse/YARN-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596963#comment-16596963 ] Naganarasimha G R commented on YARN-7865: - Thanks [~sunilg], for the review, Please find the attached patch addressing your comments. > Node attributes documentation > - > > Key: YARN-7865 > URL: https://issues.apache.org/jira/browse/YARN-7865 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Weiwei Yang >Assignee: Naganarasimha G R >Priority: Major > Attachments: NodeAttributes.html, YARN-7865-YARN-3409.001.patch, > YARN-7865-YARN-3409.002.patch, YARN-7865-YARN-3409.003.patch, > YARN-7865-YARN-3409.004.patch > > > We need proper docs to introduce how to enable node-attributes how to > configure providers, how to specify script paths, arguments in configuration, > what should be the proper permission of the script and who will run the > script. Also it would be good to add more info to the description of the > configuration properties. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7865) Node attributes documentation
[ https://issues.apache.org/jira/browse/YARN-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-7865: Attachment: YARN-7865-YARN-3409.004.patch > Node attributes documentation > - > > Key: YARN-7865 > URL: https://issues.apache.org/jira/browse/YARN-7865 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Weiwei Yang >Assignee: Naganarasimha G R >Priority: Major > Attachments: NodeAttributes.html, YARN-7865-YARN-3409.001.patch, > YARN-7865-YARN-3409.002.patch, YARN-7865-YARN-3409.003.patch, > YARN-7865-YARN-3409.004.patch > > > We need proper docs to introduce how to enable node-attributes how to > configure providers, how to specify script paths, arguments in configuration, > what should be the proper permission of the script and who will run the > script. Also it would be good to add more info to the description of the > configuration properties. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7865) Node attributes documentation
[ https://issues.apache.org/jira/browse/YARN-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-7865: Attachment: (was: NodeAttributes.html) > Node attributes documentation > - > > Key: YARN-7865 > URL: https://issues.apache.org/jira/browse/YARN-7865 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Weiwei Yang >Assignee: Naganarasimha G R >Priority: Major > Attachments: NodeAttributes.html, YARN-7865-YARN-3409.001.patch, > YARN-7865-YARN-3409.002.patch, YARN-7865-YARN-3409.003.patch > > > We need proper docs to introduce how to enable node-attributes how to > configure providers, how to specify script paths, arguments in configuration, > what should be the proper permission of the script and who will run the > script. Also it would be good to add more info to the description of the > configuration properties. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7865) Node attributes documentation
[ https://issues.apache.org/jira/browse/YARN-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-7865: Attachment: NodeAttributes.html > Node attributes documentation > - > > Key: YARN-7865 > URL: https://issues.apache.org/jira/browse/YARN-7865 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Weiwei Yang >Assignee: Naganarasimha G R >Priority: Major > Attachments: NodeAttributes.html, YARN-7865-YARN-3409.001.patch, > YARN-7865-YARN-3409.002.patch, YARN-7865-YARN-3409.003.patch > > > We need proper docs to introduce how to enable node-attributes how to > configure providers, how to specify script paths, arguments in configuration, > what should be the proper permission of the script and who will run the > script. Also it would be good to add more info to the description of the > configuration properties. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8468) Limit container sizes per queue in FairScheduler
[ https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596926#comment-16596926 ] Wangda Tan commented on YARN-8468: -- [~bsteinbach], Thanks, I think it makes sense to normalize/validate the max allocation. Few things: 1) Could u add basic tests to CS to make sure it works? 2) Inside validateIncreaseDecreaseRequest, it gets maximumAllocation of queue twice, let's try to avoid this if possible. 3) And similarly, {{normalizeAndvalidateRequest}} calls getMaximumAllocation of queue again and again. Let's try to call the getMaximumAllocation once for every {{allocate}} call. And request another set if eyes to check details of the patch. cc: [~sunilg], [~cheersyang] > Limit container sizes per queue in FairScheduler > > > Key: YARN-8468 > URL: https://issues.apache.org/jira/browse/YARN-8468 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.1.0 >Reporter: Antal Bálint Steinbach >Assignee: Antal Bálint Steinbach >Priority: Critical > Attachments: YARN-8468.000.patch, YARN-8468.001.patch, > YARN-8468.002.patch, YARN-8468.003.patch, YARN-8468.004.patch, > YARN-8468.005.patch, YARN-8468.006.patch, YARN-8468.007.patch, > YARN-8468.008.patch > > > When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" > to limit the overall size of a container. This applies globally to all > containers and cannot be limited by queue or and is not scheduler dependent. > > The goal of this ticket is to allow this value to be set on a per queue basis. > > The use case: User has two pools, one for ad hoc jobs and one for enterprise > apps. User wants to limit ad hoc jobs to small containers but allow > enterprise apps to request as many resources as needed. Setting > yarn.scheduler.maximum-allocation-mb sets a default value for maximum > container size for all queues and setting maximum resources per queue with > “maxContainerResources” queue config value. > > Suggested solution: > > All the infrastructure is already in the code. We need to do the following: > * add the setting to the queue properties for all queue types (parent and > leaf), this will cover dynamically created queues. > * if we set it on the root we override the scheduler setting and we should > not allow that. > * make sure that queue resource cap can not be larger than scheduler max > resource cap in the config. > * implement getMaximumResourceCapability(String queueName) in the > FairScheduler > * implement getMaximumResourceCapability() in both FSParentQueue and > FSLeafQueue as follows > * expose the setting in the queue information in the RM web UI. > * expose the setting in the metrics etc for the queue. > * write JUnit tests. > * update the scheduler documentation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8730) TestRMWebServiceAppsNodelabel#testAppsRunning fails
[ https://issues.apache.org/jira/browse/YARN-8730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596921#comment-16596921 ] Jason Lowe commented on YARN-8730: -- trunk and other releases ahead of 2.8 do not do this since they changed the annotation of the ResourceInfo class from XmlAccessType.FIELD to XmlAccessType.NONE in YARN-6232 when the same Resource field was added to ResourceInfo. branch-2.8 needs a similar setup so the "res" field is not advertised in the REST API. > TestRMWebServiceAppsNodelabel#testAppsRunning fails > --- > > Key: YARN-8730 > URL: https://issues.apache.org/jira/browse/YARN-8730 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.4 >Reporter: Jason Lowe >Priority: Major > > TestRMWebServiceAppsNodelabel is failing in branch-2.8: > {noformat} > Running > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.473 sec <<< > FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel > testAppsRunning(org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel) > Time elapsed: 6.708 sec <<< FAILURE! > org.junit.ComparisonFailure: partition amused > expected:<{"[]memory":1024,"vCores...> but > was:<{"[res":{"memory":1024,"memorySize":1024,"virtualCores":1},"]memory":1024,"vCores...> > at org.junit.Assert.assertEquals(Assert.java:115) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel.verifyResource(TestRMWebServiceAppsNodelabel.java:222) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel.testAppsRunning(TestRMWebServiceAppsNodelabel.java:205) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7619) Max AM Resource value in Capacity Scheduler UI has to be refreshed for every user
[ https://issues.apache.org/jira/browse/YARN-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596922#comment-16596922 ] Jason Lowe commented on YARN-7619: -- The branch-2.8 version of this patch unfortunately affected the 2.8 REST API when it modified the ResourceInfo DAO class. See YARN-8730 for details. > Max AM Resource value in Capacity Scheduler UI has to be refreshed for every > user > - > > Key: YARN-7619 > URL: https://issues.apache.org/jira/browse/YARN-7619 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2, 3.1.0 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.1, 2.8.4 > > Attachments: Max AM Resources is Different for Each User.png, > YARN-7619.001.patch, YARN-7619.002.patch, YARN-7619.003.patch, > YARN-7619.004.branch-2.8.patch, YARN-7619.004.branch-3.0.patch, > YARN-7619.004.patch, YARN-7619.005.branch-2.8.patch, > YARN-7619.005.branch-3.0.patch, YARN-7619.005.patch > > > YARN-7245 addressed the problem that the {{Max AM Resource}} in the capacity > scheduler UI used to contain the queue-level AM limit instead of the > user-level AM limit. It fixed this by using the user-specific AM limit that > is calculated in {{LeafQueue#activateApplications}}, stored in each user's > {{LeafQueue#User}} object, and retrieved via > {{UserInfo#getResourceUsageInfo}}. > The problem is that this user-specific AM limit depends on the activity of > other users and other applications in a queue, and it is only calculated and > updated when a user's application is activated. So, when > {{CapacitySchedulerPage}} retrieves the user-specific AM limit, it is a stale > value unless an application was recently activated for a particular user. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8730) TestRMWebServiceAppsNodelabel#testAppsRunning fails
[ https://issues.apache.org/jira/browse/YARN-8730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-8730: - Affects Version/s: 2.8.4 > TestRMWebServiceAppsNodelabel#testAppsRunning fails > --- > > Key: YARN-8730 > URL: https://issues.apache.org/jira/browse/YARN-8730 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.4 >Reporter: Jason Lowe >Priority: Major > > TestRMWebServiceAppsNodelabel is failing in branch-2.8: > {noformat} > Running > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.473 sec <<< > FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel > testAppsRunning(org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel) > Time elapsed: 6.708 sec <<< FAILURE! > org.junit.ComparisonFailure: partition amused > expected:<{"[]memory":1024,"vCores...> but > was:<{"[res":{"memory":1024,"memorySize":1024,"virtualCores":1},"]memory":1024,"vCores...> > at org.junit.Assert.assertEquals(Assert.java:115) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel.verifyResource(TestRMWebServiceAppsNodelabel.java:222) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel.testAppsRunning(TestRMWebServiceAppsNodelabel.java:205) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8730) TestRMWebServiceAppsNodelabel#testAppsRunning fails
[ https://issues.apache.org/jira/browse/YARN-8730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596911#comment-16596911 ] Jason Lowe commented on YARN-8730: -- git bisect narrows this down to YARN-7619. A "res" field was added to the ResourceInfo DAO object, and since all fields are advertised by default it incorrectly appears in query output. The trunk version of ResourceInfo does not automatically advertise fields by default. > TestRMWebServiceAppsNodelabel#testAppsRunning fails > --- > > Key: YARN-8730 > URL: https://issues.apache.org/jira/browse/YARN-8730 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Jason Lowe >Priority: Major > > TestRMWebServiceAppsNodelabel is failing in branch-2.8: > {noformat} > Running > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.473 sec <<< > FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel > testAppsRunning(org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel) > Time elapsed: 6.708 sec <<< FAILURE! > org.junit.ComparisonFailure: partition amused > expected:<{"[]memory":1024,"vCores...> but > was:<{"[res":{"memory":1024,"memorySize":1024,"virtualCores":1},"]memory":1024,"vCores...> > at org.junit.Assert.assertEquals(Assert.java:115) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel.verifyResource(TestRMWebServiceAppsNodelabel.java:222) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel.testAppsRunning(TestRMWebServiceAppsNodelabel.java:205) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8730) TestRMWebServiceAppsNodelabel#testAppsRunning fails
Jason Lowe created YARN-8730: Summary: TestRMWebServiceAppsNodelabel#testAppsRunning fails Key: YARN-8730 URL: https://issues.apache.org/jira/browse/YARN-8730 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jason Lowe TestRMWebServiceAppsNodelabel is failing in branch-2.8: {noformat} Running org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.473 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel testAppsRunning(org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel) Time elapsed: 6.708 sec <<< FAILURE! org.junit.ComparisonFailure: partition amused expected:<{"[]memory":1024,"vCores...> but was:<{"[res":{"memory":1024,"memorySize":1024,"virtualCores":1},"]memory":1024,"vCores...> at org.junit.Assert.assertEquals(Assert.java:115) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel.verifyResource(TestRMWebServiceAppsNodelabel.java:222) at org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel.testAppsRunning(TestRMWebServiceAppsNodelabel.java:205) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8051) TestRMEmbeddedElector#testCallbackSynchronization is flakey
[ https://issues.apache.org/jira/browse/YARN-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596905#comment-16596905 ] Eric Payne commented on YARN-8051: -- {quote} -1 unit336m 4s hadoop-yarn-server-resourcemanager in the patch failed. -1 asflicense 0m 32s The patch generated 1 ASF License warnings. {quote} I think these are unrelated. The unit test that was modified does have a previously included ASF section, and I manually verified that the unit test succeeds for branch-2, branch-2.9, and branch-2.8. > TestRMEmbeddedElector#testCallbackSynchronization is flakey > --- > > Key: YARN-8051 > URL: https://issues.apache.org/jira/browse/YARN-8051 > Project: Hadoop YARN > Issue Type: Improvement > Components: test >Affects Versions: 2.10.0, 2.9.1, 2.8.4, 3.0.2, 3.2.0, 3.1.1 >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Major > Fix For: 3.2.0, 3.0.4, 3.1.2 > > Attachments: YARN-8051-branch-2.002.patch, YARN-8051.001.patch, > YARN-8051.002.patch > > > We've seen some rare flakey failures in > {{TestRMEmbeddedElector#testCallbackSynchronization}}: > {noformat} > org.mockito.exceptions.verification.WantedButNotInvoked: > Wanted but not invoked: > adminService.transitionToStandby(); > -> at > org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronizationNeutral(TestRMEmbeddedElector.java:215) > Actually, there were zero interactions with this mock. > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronizationNeutral(TestRMEmbeddedElector.java:215) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronization(TestRMEmbeddedElector.java:146) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronization(TestRMEmbeddedElector.java:109) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8051) TestRMEmbeddedElector#testCallbackSynchronization is flakey
[ https://issues.apache.org/jira/browse/YARN-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596901#comment-16596901 ] Eric Payne commented on YARN-8051: -- +1. Committing to branch-2, branch-2.9, and branch-2.8 > TestRMEmbeddedElector#testCallbackSynchronization is flakey > --- > > Key: YARN-8051 > URL: https://issues.apache.org/jira/browse/YARN-8051 > Project: Hadoop YARN > Issue Type: Improvement > Components: test >Affects Versions: 2.10.0, 2.9.1, 2.8.4, 3.0.2, 3.2.0, 3.1.1 >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Major > Fix For: 3.2.0, 3.0.4, 3.1.2 > > Attachments: YARN-8051-branch-2.002.patch, YARN-8051.001.patch, > YARN-8051.002.patch > > > We've seen some rare flakey failures in > {{TestRMEmbeddedElector#testCallbackSynchronization}}: > {noformat} > org.mockito.exceptions.verification.WantedButNotInvoked: > Wanted but not invoked: > adminService.transitionToStandby(); > -> at > org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronizationNeutral(TestRMEmbeddedElector.java:215) > Actually, there were zero interactions with this mock. > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronizationNeutral(TestRMEmbeddedElector.java:215) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronization(TestRMEmbeddedElector.java:146) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronization(TestRMEmbeddedElector.java:109) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8706) DelayedProcessKiller is executed for Docker containers even though docker stop sends a KILL signal after the specified grace period
[ https://issues.apache.org/jira/browse/YARN-8706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596887#comment-16596887 ] Eric Badger commented on YARN-8706: --- I think the default sleep delay between STOPSIGNAL and SIGKILL is irrelevant to this specific JIRA. It's certainly something that we can discuss and maybe it is reasonable to increase the value, but I don't think that it has anything to do with how we send the signals. That discussion is a separate issue. I am in still in favor of my original proposal > DelayedProcessKiller is executed for Docker containers even though docker > stop sends a KILL signal after the specified grace period > --- > > Key: YARN-8706 > URL: https://issues.apache.org/jira/browse/YARN-8706 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Labels: docker > > {{DockerStopCommand}} adds a grace period of 10 seconds. > 10 seconds is also the default grace time use by docker stop > [https://docs.docker.com/engine/reference/commandline/stop/] > Documentation of the docker stop: > {quote}the main process inside the container will receive {{SIGTERM}}, and > after a grace period, {{SIGKILL}}. > {quote} > There is a {{DelayedProcessKiller}} in {{ContainerExcecutor}} which executes > for all containers after a delay when {{sleepDelayBeforeSigKill>0}}. By > default this is set to {{250 milliseconds}} and so irrespective of the > container type, it will always get executed. > > For a docker container, {{docker stop}} takes care of sending a {{SIGKILL}} > after the grace period > - when sleepDelayBeforeSigKill > 10 seconds, then there is no point of > executing DelayedProcessKiller > - when sleepDelayBeforeSigKill < 1 second, then the grace period should be > the smallest value, which is 1 second, because anyways we are forcing kill > after 250 ms > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8706) DelayedProcessKiller is executed for Docker containers even though docker stop sends a KILL signal after the specified grace period
[ https://issues.apache.org/jira/browse/YARN-8706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596883#comment-16596883 ] Eric Yang commented on YARN-8706: - {quote}Why is this specific to docker containers? Other types of containers maybe dealing with data and if the default grace period of 250 millis is too small, then it can be changed with the config NM_SLEEP_DELAY_BEFORE_SIGKILL_MS. Maybe this should be something that the application could specify as well, but that is a different discussion.{quote} YARN containers were mostly stateless, and not reused. The short termination wait time can work without causing problem to Hadoop specific application. With introduction of Docker container, it might take several seconds to gracefully shutdown a database daemon. 10 seconds default seems like a safer wait time if docker container is persisted and reused. There isn't much data point to show waiting longer is better at this time. The default setting may be revisited later. > DelayedProcessKiller is executed for Docker containers even though docker > stop sends a KILL signal after the specified grace period > --- > > Key: YARN-8706 > URL: https://issues.apache.org/jira/browse/YARN-8706 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Labels: docker > > {{DockerStopCommand}} adds a grace period of 10 seconds. > 10 seconds is also the default grace time use by docker stop > [https://docs.docker.com/engine/reference/commandline/stop/] > Documentation of the docker stop: > {quote}the main process inside the container will receive {{SIGTERM}}, and > after a grace period, {{SIGKILL}}. > {quote} > There is a {{DelayedProcessKiller}} in {{ContainerExcecutor}} which executes > for all containers after a delay when {{sleepDelayBeforeSigKill>0}}. By > default this is set to {{250 milliseconds}} and so irrespective of the > container type, it will always get executed. > > For a docker container, {{docker stop}} takes care of sending a {{SIGKILL}} > after the grace period > - when sleepDelayBeforeSigKill > 10 seconds, then there is no point of > executing DelayedProcessKiller > - when sleepDelayBeforeSigKill < 1 second, then the grace period should be > the smallest value, which is 1 second, because anyways we are forcing kill > after 250 ms > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8638) Allow linux container runtimes to be pluggable
[ https://issues.apache.org/jira/browse/YARN-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596868#comment-16596868 ] Kenny Chang commented on YARN-8638: --- +1 Noob, non-binding. > Allow linux container runtimes to be pluggable > -- > > Key: YARN-8638 > URL: https://issues.apache.org/jira/browse/YARN-8638 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 3.2.0 >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Minor > Attachments: YARN-8638.001.patch, YARN-8638.002.patch, > YARN-8638.003.patch, YARN-8638.004.patch > > > YARN currently supports three different Linux container runtimes (default, > docker, and javasandbox). However, it would be relatively straightforward to > support arbitrary runtime implementations. This would enable easier > experimentation with new and emerging runtime technologies (runc, containerd, > etc.) without requiring a rebuild and redeployment of Hadoop. > This could be accomplished via a simple configuration change: > {code:xml} > > yarn.nodemanager.runtime.linux.allowed-runtimes > default,docker,experimental > > > > yarn.nodemanager.runtime.linux.experimental.class > com.somecompany.yarn.runtime.ExperimentalLinuxContainerRuntime > {code} > > In this example, {{yarn.nodemanager.runtime.linux.allowed-runtimes}} would > now allow arbitrary values. Additionally, > {{yarn.nodemanager.runtime.linux.\{RUNTIME_KEY}.class}} would indicate the > {{LinuxContainerRuntime}} implementation to instantiate. A no-argument > constructor should be sufficient, as {{LinuxContainerRuntime}} already > provides an {{initialize()}} method. > {{DockerLinuxContainerRuntime.isDockerContainerRequested(Map > env)}} and {{JavaSandboxLinuxContainerRuntime.isSandboxContainerRequested()}} > could be generalized to {{isRuntimeRequested(Map env)}} and > added to the {{LinuxContainerRuntime}} interface. This would allow > {{DelegatingLinuxContainerRuntime}} to select an appropriate runtime based on > whether that runtime claimed ownership of the current container execution. > For backwards compatibility, the existing values (default,docker,javasandbox) > would continue to be supported as-is. Under the current logic, the evaluation > order is javasandbox, docker, default (with default being chosen if no other > candidates are available). Under the new evaluation logic, pluggable runtimes > would be evaluated after docker and before default, in the order in which > they are defined in the allowed-runtimes list. This will change no behavior > on current clusters (as there would be no pluggable runtimes defined), and > preserves behavior with respect to ordering of existing runtimes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6972) Adding RM ClusterId in AppInfo
[ https://issues.apache.org/jira/browse/YARN-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanuj Nayak updated YARN-6972: -- Attachment: YARN-6972.016.patch > Adding RM ClusterId in AppInfo > -- > > Key: YARN-6972 > URL: https://issues.apache.org/jira/browse/YARN-6972 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Giovanni Matteo Fumarola >Assignee: Tanuj Nayak >Priority: Major > Attachments: YARN-6972.001.patch, YARN-6972.002.patch, > YARN-6972.003.patch, YARN-6972.004.patch, YARN-6972.005.patch, > YARN-6972.006.patch, YARN-6972.007.patch, YARN-6972.008.patch, > YARN-6972.009.patch, YARN-6972.010.patch, YARN-6972.011.patch, > YARN-6972.012.patch, YARN-6972.013.patch, YARN-6972.014.patch, > YARN-6972.015.patch, YARN-6972.016.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-6338) Typos in Docker docs: contains => containers
[ https://issues.apache.org/jira/browse/YARN-6338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Siegl reassigned YARN-6338: -- Assignee: Zoltan Siegl (was: Szilard Nemeth) > Typos in Docker docs: contains => containers > > > Key: YARN-6338 > URL: https://issues.apache.org/jira/browse/YARN-6338 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.9.0, 3.0.0-alpha4 >Reporter: Daniel Templeton >Assignee: Zoltan Siegl >Priority: Minor > Labels: docs > > "allowed to request privileged contains" should be "allowed to request > privileged containers" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8680) YARN NM: Implement Iterable Abstraction for LocalResourceTrackerstate
[ https://issues.apache.org/jira/browse/YARN-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596230#comment-16596230 ] Pradeep Ambati edited comment on YARN-8680 at 8/29/18 6:38 PM: --- Hi [~sunilg], This jira is marked as critical for 3.2. I can definitely take it forward if it is not feasible to complete it in coming weeks. was (Author: pradeepambati): Hi [~sunilg], This jira is marked as critical for 3.2. I can definitely take it forward if it is not feasible to complete it in coming weeks. > YARN NM: Implement Iterable Abstraction for LocalResourceTrackerstate > - > > Key: YARN-8680 > URL: https://issues.apache.org/jira/browse/YARN-8680 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Pradeep Ambati >Assignee: Pradeep Ambati >Priority: Critical > Attachments: YARN-8680.00.patch, YARN-8680.01.patch > > > Similar to YARN-8242, implement iterable abstraction for > LocalResourceTrackerState to load completed and in progress resources when > needed rather than loading them all at a time for a respective state. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596699#comment-16596699 ] Jason Lowe commented on YARN-8648: -- Thanks for the patch! Why was the postComplete call moved in reapContainer to before the container is removed via docker? Shouldn't docker first remove its cgroups for the container before we remove ours? Is there a reason to separate removing docker cgroups from removing the docker container? This seems like a natural extension to cleaning up after a container run by docker, and that's already covered by the reap command. The patch would remain a docker-only change but without needing to modify the container-executor interface. Nit: PROC_MOUNT_PATH should be a macro (i.e.: #define) or lower-cased. Similar for CGROUP_MOUNT. The snprintf result should be checked for truncation in addition to output errors (i.e.: result >= PATH_MAX means it was truncated) otherwise we formulate an incomplete path targeted for deletion if that somehow occurs. Alternatively the code could use make_string or asprintf to allocate an appropriately sized buffer for each entry rather than trying to reuse a manually sized buffer. Is there any point in logging to the error file that a path we want to delete has already been deleted? This seems like it will just be noise, especially if systemd or something else is periodically cleaning some of these empty cgroups. Related to the previous comment, the rmdir result should be checked for ENOENT and treat that as success. Nit: I think lineptr should be freed in the cleanup label in case someone later adds a fatal error that jumps to cleanup. > Container cgroups are leaked when using docker > -- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-8648.001.patch > > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd.So for instance, docker creates > {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but > it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up > the {{container_id}} cgroups for these other resources. On one of our busy > clusters, we found > 100,000 of these leaked cgroups. > I found this in our 2.8-based version of hadoop, but I have been able to > repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8706) DelayedProcessKiller is executed for Docker containers even though docker stop sends a KILL signal after the specified grace period
[ https://issues.apache.org/jira/browse/YARN-8706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596637#comment-16596637 ] Chandni Singh commented on YARN-8706: - {quote}I am not entirely sure about globally identical killing mechanism for all container type, is a sane approach to brute force container shutdown. {quote} I am not sure what you mean. NM does a graceful shutdown for all types of containers. It first sends a {{SIGTERM}} and then after a grace period, sends {{SIGKILL}}. The {{SIGTERM}} for docker is handled by docker stop, which has the following problems: 1. grace period can be specified only in seconds 2. clubs {{SIGKILL}} with stop. Docker first sends a {{STOPSIGNAL}} to the root process and then after the grace period, sends {{SIGKILL}} to the root process. This is not what NM wants with the stop and docker stop doesn't give any option to NOT send {{SIGKILL}} The proposed change by [~ebadger] will just send the {{STOPSIGNAL}} which solves our problem. {quote}10 seconds default is probably more sensible to give the container a chance to shutdown gracefully without causing corruption to data. {quote} Why is this specific to docker containers? Other types of containers maybe dealing with data and if the default grace period of 250 millis is too small, then it can be changed with the config {{NM_SLEEP_DELAY_BEFORE_SIGKILL_MS}}. Maybe this should be something that the application could specify as well, but that is a different discussion. > DelayedProcessKiller is executed for Docker containers even though docker > stop sends a KILL signal after the specified grace period > --- > > Key: YARN-8706 > URL: https://issues.apache.org/jira/browse/YARN-8706 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Labels: docker > > {{DockerStopCommand}} adds a grace period of 10 seconds. > 10 seconds is also the default grace time use by docker stop > [https://docs.docker.com/engine/reference/commandline/stop/] > Documentation of the docker stop: > {quote}the main process inside the container will receive {{SIGTERM}}, and > after a grace period, {{SIGKILL}}. > {quote} > There is a {{DelayedProcessKiller}} in {{ContainerExcecutor}} which executes > for all containers after a delay when {{sleepDelayBeforeSigKill>0}}. By > default this is set to {{250 milliseconds}} and so irrespective of the > container type, it will always get executed. > > For a docker container, {{docker stop}} takes care of sending a {{SIGKILL}} > after the grace period > - when sleepDelayBeforeSigKill > 10 seconds, then there is no point of > executing DelayedProcessKiller > - when sleepDelayBeforeSigKill < 1 second, then the grace period should be > the smallest value, which is 1 second, because anyways we are forcing kill > after 250 ms > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8642) Add support for tmpfs mounts with the Docker runtime
[ https://issues.apache.org/jira/browse/YARN-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596591#comment-16596591 ] Eric Yang edited comment on YARN-8642 at 8/29/18 5:04 PM: -- {quote} This need will be dependent on what is running in the container. It would be nice to be able to reference UID and GID by variable, as you've outlined. Maybe resolving those variables within the mount related environment variables is a task the YARN Services AM could handle? Could we discuss in a follow on since this seems like a useful feature beyond just the tmpfs mounts?{quote} I did some experiment and found that docker will create unique sandbox_file in tmpfs per container. There is no need to pre-partition in the command that we supply to docker. Therefore, there is no security concern if multiple container uses /run as tmpfs. Data is not going to be shared among containers. Therefore, the patch is good. We don't need a follow up JIRA for this. Thank you for the commit [~shaneku...@gmail.com]. was (Author: eyang): {quote} This need will be dependent on what is running in the container. It would be nice to be able to reference UID and GID by variable, as you've outlined. Maybe resolving those variables within the mount related environment variables is a task the YARN Services AM could handle? Could we discuss in a follow on since this seems like a useful feature beyond just the tmpfs mounts?{quote} I did some experiment and found that docker will create unique sandbox_file in tmpfs per container. There is no need to pre-partition in the command that we supply to docker. Therefore, there is no security concern if multiple container uses /run as tmpfs. Data is not going to be shared among containers. Therefore, the patch is good. Thank you for the commit [~shaneku...@gmail.com]. > Add support for tmpfs mounts with the Docker runtime > > > Key: YARN-8642 > URL: https://issues.apache.org/jira/browse/YARN-8642 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Shane Kumpf >Assignee: Craig Condit >Priority: Major > Labels: Docker > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8642.001.patch, YARN-8642.002.patch > > > Add support to the existing Docker runtime to allow the user to request tmpfs > mounts for their containers. For example: > {code}/usr/bin/docker run --name=container_name --tmpfs /run image > /bootstrap/start-systemd > {code} > One use case is to allow systemd to run as PID 1 in a non-privileged > container, /run is expected to be a tmpfs mount in the container for that to > work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8642) Add support for tmpfs mounts with the Docker runtime
[ https://issues.apache.org/jira/browse/YARN-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596591#comment-16596591 ] Eric Yang edited comment on YARN-8642 at 8/29/18 5:02 PM: -- {quote} This need will be dependent on what is running in the container. It would be nice to be able to reference UID and GID by variable, as you've outlined. Maybe resolving those variables within the mount related environment variables is a task the YARN Services AM could handle? Could we discuss in a follow on since this seems like a useful feature beyond just the tmpfs mounts?{quote} I did some experiment and found that docker will create unique sandbox_file in tmpfs per container. There is no need to pre-partition in the command that we supply to docker. Therefore, there is no security concern if multiple container uses /run as tmpfs. Data is not going to be shared among containers. Therefore, the patch is good. Thank you for the commit [~shaneku...@gmail.com]. was (Author: eyang): {quote} This need will be dependent on what is running in the container. It would be nice to be able to reference UID and GID by variable, as you've outlined. Maybe resolving those variables within the mount related environment variables is a task the YARN Services AM could handle? Could we discuss in a follow on since this seems like a useful feature beyond just the tmpfs mounts?{quote} I was thinking to partition it via container id or some mechanism that can be clean up automatically. YARN service or upper layer don't have the visibility to absolute path or convention used in container-executor. It might be better to keep this path logic contained in DockerLinuxContainerRuntime. Sorry, I replied too late, will open another JIRA to refine this. > Add support for tmpfs mounts with the Docker runtime > > > Key: YARN-8642 > URL: https://issues.apache.org/jira/browse/YARN-8642 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Shane Kumpf >Assignee: Craig Condit >Priority: Major > Labels: Docker > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8642.001.patch, YARN-8642.002.patch > > > Add support to the existing Docker runtime to allow the user to request tmpfs > mounts for their containers. For example: > {code}/usr/bin/docker run --name=container_name --tmpfs /run image > /bootstrap/start-systemd > {code} > One use case is to allow systemd to run as PID 1 in a non-privileged > container, /run is expected to be a tmpfs mount in the container for that to > work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8642) Add support for tmpfs mounts with the Docker runtime
[ https://issues.apache.org/jira/browse/YARN-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596591#comment-16596591 ] Eric Yang commented on YARN-8642: - {quote} This need will be dependent on what is running in the container. It would be nice to be able to reference UID and GID by variable, as you've outlined. Maybe resolving those variables within the mount related environment variables is a task the YARN Services AM could handle? Could we discuss in a follow on since this seems like a useful feature beyond just the tmpfs mounts?{quote} I was thinking to partition it via container id or some mechanism that can be clean up automatically. YARN service or upper layer don't have the visibility to absolute path or convention used in container-executor. It might be better to keep this path logic contained in DockerLinuxContainerRuntime. Sorry, I replied too late, will open another JIRA to refine this. > Add support for tmpfs mounts with the Docker runtime > > > Key: YARN-8642 > URL: https://issues.apache.org/jira/browse/YARN-8642 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Shane Kumpf >Assignee: Craig Condit >Priority: Major > Labels: Docker > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8642.001.patch, YARN-8642.002.patch > > > Add support to the existing Docker runtime to allow the user to request tmpfs > mounts for their containers. For example: > {code}/usr/bin/docker run --name=container_name --tmpfs /run image > /bootstrap/start-systemd > {code} > One use case is to allow systemd to run as PID 1 in a non-privileged > container, /run is expected to be a tmpfs mount in the container for that to > work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8569) Create an interface to provide cluster information to application
[ https://issues.apache.org/jira/browse/YARN-8569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596578#comment-16596578 ] Eric Yang commented on YARN-8569: - {quote}There were tons of debates regarding to yarn user should be treated as root or not before. We saw some issues of c-e causes yarn user can manipulate other user's directories, or directly escalate to root user. All of these issues become CVE.{quote} If I recall correctly, I reported and fixed container-executor security issues like YARN-7590 and YARN-8207. I think I have written proper security check to make sure the caller via network has the right Kerberos tgt that matches end user's container directory and also validated the source of the data is coming from node manager private directory. There is a permission validation to copy spec file information from node manager private directory to end user container directory. This design is similar to transporting delegation tokens to container working directory. I think there are good enough security validations to ensure no security hole has been added to this work. Let me know if you find security holes. {quote} >From YARN's design purpose, ideally all NM/RM logics should be as general as >possible, all service-related stuffs should be handled by service framework >like API server or ServiceMaster. I really don't like the idea of adding >service-specific API to NM API.{quote} The new API is not YARN service framework specific. ContainerExecutor provide basic API for starting, stopping, and clean up containers, but it is missing more sophisticated API like synchronize configuration among containers. The new API syncYarnSysFS is proposed to allow ContainerExecutor developer to write their own implementation of populating text information to /hadoop/yarn/sysfs. Custom AM can be written to use the new API and populate other text information. The newly added API in node manager is generic and avoid the double serialization cost exists in container manager protobuf and rpc code. There is no extra serialization cost of the content during transport, thus the new API is more efficient and light weight. There is nothing specific to YARN service, although YARN service is the first consumer for this API. {quote} 1) ServiceMaster ro mount a local directory (under the container's local dir) when launch docker container (example like: ./service-info -> /service/sys/fs/) 2) ServiceMaster request to re-localize new service spec json file to the ./service-info folder. {quote} What is "ro mount" in the first sentence? Is it remote mount or read-only mount? ServiceMaster is not guarantee to run on the same node as other container. Hence, there is no practical way to mount service master's directory to container's local directory cross nodes. > Create an interface to provide cluster information to application > - > > Key: YARN-8569 > URL: https://issues.apache.org/jira/browse/YARN-8569 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: Docker > Attachments: YARN-8569.001.patch, YARN-8569.002.patch > > > Some program requires container hostnames to be known for application to run. > For example, distributed tensorflow requires launch_command that looks like: > {code} > # On ps0.example.com: > $ python trainer.py \ > --ps_hosts=ps0.example.com:,ps1.example.com: \ > --worker_hosts=worker0.example.com:,worker1.example.com: \ > --job_name=ps --task_index=0 > # On ps1.example.com: > $ python trainer.py \ > --ps_hosts=ps0.example.com:,ps1.example.com: \ > --worker_hosts=worker0.example.com:,worker1.example.com: \ > --job_name=ps --task_index=1 > # On worker0.example.com: > $ python trainer.py \ > --ps_hosts=ps0.example.com:,ps1.example.com: \ > --worker_hosts=worker0.example.com:,worker1.example.com: \ > --job_name=worker --task_index=0 > # On worker1.example.com: > $ python trainer.py \ > --ps_hosts=ps0.example.com:,ps1.example.com: \ > --worker_hosts=worker0.example.com:,worker1.example.com: \ > --job_name=worker --task_index=1 > {code} > This is a bit cumbersome to orchestrate via Distributed Shell, or YARN > services launch_command. In addition, the dynamic parameters do not work > with YARN flex command. This is the classic pain point for application > developer attempt to automate system environment settings as parameter to end > user application. > It would be great if YARN Docker integration can provide a simple option to > expose hostnames of the yarn service via a mounted file. The file content > gets updated whe
[jira] [Commented] (YARN-8644) Make RMAppImpl$FinalTransition more readable + add more test coverage
[ https://issues.apache.org/jira/browse/YARN-8644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596476#comment-16596476 ] Szilard Nemeth commented on YARN-8644: -- Hi [~zsiegl]! Thanks for your review comments, both findings are good catches. Please check my updated patch! Thanks! > Make RMAppImpl$FinalTransition more readable + add more test coverage > - > > Key: YARN-8644 > URL: https://issues.apache.org/jira/browse/YARN-8644 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Minor > Attachments: YARN-8644.001.patch, YARN-8644.002.patch, > YARN-8644.003.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8644) Make RMAppImpl$FinalTransition more readable + add more test coverage
[ https://issues.apache.org/jira/browse/YARN-8644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-8644: - Attachment: YARN-8644.003.patch > Make RMAppImpl$FinalTransition more readable + add more test coverage > - > > Key: YARN-8644 > URL: https://issues.apache.org/jira/browse/YARN-8644 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Minor > Attachments: YARN-8644.001.patch, YARN-8644.002.patch, > YARN-8644.003.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7865) Node attributes documentation
[ https://issues.apache.org/jira/browse/YARN-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596412#comment-16596412 ] Sunil Govindan commented on YARN-7865: -- Thanks [~Naganarasimha] Some quick comments. # Node Attributes page looks to be linked to left panel. Pls add it in the hadoop-yarn site pjct. # Cud we also mention whats not supported in REST. # Distributed node-to-Attributes mapping ==> Distributed Node Attributes mapping. # Unlike Node Labels, Node Attributes need not be explicitly enabled as it will be always existing and would have no impact in terms of performance or compatability even if not used. ==> Unlike Node Labels, Node Attributes need not to be explicitly enabled as it will always exists and would have no impact in terms of performance or compatibility even if feature is not used. > Node attributes documentation > - > > Key: YARN-7865 > URL: https://issues.apache.org/jira/browse/YARN-7865 > Project: Hadoop YARN > Issue Type: Sub-task > Components: documentation >Reporter: Weiwei Yang >Assignee: Naganarasimha G R >Priority: Major > Attachments: NodeAttributes.html, YARN-7865-YARN-3409.001.patch, > YARN-7865-YARN-3409.002.patch, YARN-7865-YARN-3409.003.patch > > > We need proper docs to introduce how to enable node-attributes how to > configure providers, how to specify script paths, arguments in configuration, > what should be the proper permission of the script and who will run the > script. Also it would be good to add more info to the description of the > configuration properties. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-3409) Support Node Attribute functionality
[ https://issues.apache.org/jira/browse/YARN-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-3409: Attachment: Node-Attributes-Requirements-Design-doc_v2.pdf > Support Node Attribute functionality > > > Key: YARN-3409 > URL: https://issues.apache.org/jira/browse/YARN-3409 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, client, RM >Reporter: Wangda Tan >Assignee: Naganarasimha G R >Priority: Major > Attachments: 3409-apiChanges_v2.pdf (4).pdf, > Constraint-Node-Labels-Requirements-Design-doc_v1.pdf, > Node-Attributes-Requirements-Design-doc_v2.pdf, YARN-3409.WIP.001.patch > > > Specify only one label for each node (IAW, partition a cluster) is a way to > determinate how resources of a special set of nodes could be shared by a > group of entities (like teams, departments, etc.). Partitions of a cluster > has following characteristics: > - Cluster divided to several disjoint sub clusters. > - ACL/priority can apply on partition (Only market team / marke team has > priority to use the partition). > - Percentage of capacities can apply on partition (Market team has 40% > minimum capacity and Dev team has 60% of minimum capacity of the partition). > Attributes are orthogonal to partition, they’re describing features of node’s > hardware/software just for affinity. Some example of attributes: > - glibc version > - JDK version > - Type of CPU (x86_64/i686) > - Type of OS (windows, linux, etc.) > With this, application can be able to ask for resource has (glibc.version >= > 2.20 && JDK.version >= 8u20 && x86_64). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7903) Method getStarvedResourceRequests() only consider the first encountered resource
[ https://issues.apache.org/jira/browse/YARN-7903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth reassigned YARN-7903: Assignee: Szilard Nemeth > Method getStarvedResourceRequests() only consider the first encountered > resource > > > Key: YARN-7903 > URL: https://issues.apache.org/jira/browse/YARN-7903 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 3.1.0 >Reporter: Yufei Gu >Assignee: Szilard Nemeth >Priority: Major > > We need to specify rack and ANY while submitting a node local resource > request, as YARN-7561 discussed. For example: > {code} > ResourceRequest nodeRequest = > createResourceRequest(GB, node1.getHostName(), 1, 1, false); > ResourceRequest rackRequest = > createResourceRequest(GB, node1.getRackName(), 1, 1, false); > ResourceRequest anyRequest = > createResourceRequest(GB, ResourceRequest.ANY, 1, 1, false); > List resourceRequests = > Arrays.asList(nodeRequest, rackRequest, anyRequest); > {code} > However, method getStarvedResourceRequests() only consider the first > encountered resource, which most likely is ResourceRequest.ANY. That's a > mismatch for locality request. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8286) Add NMClient callback on container relaunch
[ https://issues.apache.org/jira/browse/YARN-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Billie Rinaldi updated YARN-8286: - Target Version/s: 3.3.0 (was: 3.2.0) > Add NMClient callback on container relaunch > --- > > Key: YARN-8286 > URL: https://issues.apache.org/jira/browse/YARN-8286 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi >Priority: Critical > > The AM may need to perform actions when a container has been relaunched. For > example, the service AM would want to change the state it has recorded for > the container and retrieve new container status for the container, in case > the container IP has changed. (The NM would also need to remove the IP it has > stored for the container, so container status calls don't return an IP for a > container that is not currently running.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker
[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596345#comment-16596345 ] Jim Brennan commented on YARN-8648: --- Looks like this is ready for review. > Container cgroups are leaked when using docker > -- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > Attachments: YARN-8648.001.patch > > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd.So for instance, docker creates > {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but > it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up > the {{container_id}} cgroups for these other resources. On one of our busy > clusters, we found > 100,000 of these leaked cgroups. > I found this in our 2.8-based version of hadoop, but I have been able to > repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8642) Add support for tmpfs mounts with the Docker runtime
[ https://issues.apache.org/jira/browse/YARN-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596307#comment-16596307 ] Shane Kumpf commented on YARN-8642: --- Thanks to [~ccondit-target] for the contribution and [~ebadger] and [~eyang] for the reviews! I committed this to trunk and branch-3.1. > Add support for tmpfs mounts with the Docker runtime > > > Key: YARN-8642 > URL: https://issues.apache.org/jira/browse/YARN-8642 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Shane Kumpf >Assignee: Craig Condit >Priority: Major > Labels: Docker > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8642.001.patch, YARN-8642.002.patch > > > Add support to the existing Docker runtime to allow the user to request tmpfs > mounts for their containers. For example: > {code}/usr/bin/docker run --name=container_name --tmpfs /run image > /bootstrap/start-systemd > {code} > One use case is to allow systemd to run as PID 1 in a non-privileged > container, /run is expected to be a tmpfs mount in the container for that to > work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8729) Node status updater thread could be lost after it restarted
[ https://issues.apache.org/jira/browse/YARN-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-8729: --- Attachment: YARN-8729.001.patch > Node status updater thread could be lost after it restarted > --- > > Key: YARN-8729 > URL: https://issues.apache.org/jira/browse/YARN-8729 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.2.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Critical > Attachments: YARN-8729.001.patch > > > Today I found a lost NM whose node status updater thread was not exist after > this thread restarted. In > {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}, isStopped > flag is not updated to be false before executing {{statusUpdater.start()}}, > so that if the thread is immediately started and found isStopped==true, it > will exit without any log. > Key codes in > {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}: > {code:java} > statusUpdater.join(); > registerWithRM(); > statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater"); > statusUpdater.start(); > this.isStopped = false; //this line should be moved before > statusUpdater.start(); > LOG.info("NodeStatusUpdater thread is reRegistered and restarted"); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8729) Node status updater thread could be lost after it restarted
[ https://issues.apache.org/jira/browse/YARN-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596235#comment-16596235 ] Tao Yang commented on YARN-8729: Attached v1 patch for review. > Node status updater thread could be lost after it restarted > --- > > Key: YARN-8729 > URL: https://issues.apache.org/jira/browse/YARN-8729 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.2.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Critical > Attachments: YARN-8729.001.patch > > > Today I found a lost NM whose node status updater thread was not exist after > this thread restarted. In > {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}, isStopped > flag is not updated to be false before executing {{statusUpdater.start()}}, > so that if the thread is immediately started and found isStopped==true, it > will exit without any log. > Key codes in > {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}: > {code:java} > statusUpdater.join(); > registerWithRM(); > statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater"); > statusUpdater.start(); > this.isStopped = false; //this line should be moved before > statusUpdater.start(); > LOG.info("NodeStatusUpdater thread is reRegistered and restarted"); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8729) Node status updater thread could be lost after it restarted
Tao Yang created YARN-8729: -- Summary: Node status updater thread could be lost after it restarted Key: YARN-8729 URL: https://issues.apache.org/jira/browse/YARN-8729 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.2.0 Reporter: Tao Yang Assignee: Tao Yang Today I found a lost NM whose node status updater thread was not exist after this thread restarted. In {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}, isStopped flag is not updated to be false before executing {{statusUpdater.start()}}, so that if the thread is immediately started and found isStopped==true, it will exit without any log. Key codes in {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}: {code:java} statusUpdater.join(); registerWithRM(); statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater"); statusUpdater.start(); this.isStopped = false; //this line should be moved before statusUpdater.start(); LOG.info("NodeStatusUpdater thread is reRegistered and restarted"); {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8680) YARN NM: Implement Iterable Abstraction for LocalResourceTrackerstate
[ https://issues.apache.org/jira/browse/YARN-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596230#comment-16596230 ] Pradeep Ambati commented on YARN-8680: -- Hi [~sunilg], This jira is marked as critical for 3.2. I can definitely take it forward if it is not feasible to complete it in coming weeks. > YARN NM: Implement Iterable Abstraction for LocalResourceTrackerstate > - > > Key: YARN-8680 > URL: https://issues.apache.org/jira/browse/YARN-8680 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Pradeep Ambati >Assignee: Pradeep Ambati >Priority: Critical > Attachments: YARN-8680.00.patch, YARN-8680.01.patch > > > Similar to YARN-8242, implement iterable abstraction for > LocalResourceTrackerState to load completed and in progress resources when > needed rather than loading them all at a time for a respective state. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8709) intra-queue preemption checker always fail since one under-served queue was deleted
[ https://issues.apache.org/jira/browse/YARN-8709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596219#comment-16596219 ] Tao Yang commented on YARN-8709: Attached v1 patch for review. [~eepayne], [~sunilg], please help to review in your free time. Thanks! > intra-queue preemption checker always fail since one under-served queue was > deleted > --- > > Key: YARN-8709 > URL: https://issues.apache.org/jira/browse/YARN-8709 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler preemption >Affects Versions: 3.2.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-8709.001.patch > > > After some queues deleted, the preemption checker in SchedulingMonitor was > always skipped because of YarnRuntimeException for every run. > Error logs: > {noformat} > ERROR [SchedulingMonitor (ProportionalCapacityPreemptionPolicy)] > org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor: > Exception raised while executing preemption checker, skip this run..., > exception= > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: This shouldn't > happen, cannot find TempQueuePerPartition for queueName=1535075839208 > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.getQueueByPartition(ProportionalCapacityPreemptionPolicy.java:701) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.computeIntraQueuePreemptionDemand(IntraQueueCandidatesSelector.java:302) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.selectCandidates(IntraQueueCandidatesSelector.java:128) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:514) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:348) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:186) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:300) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) > at java.lang.Thread.run(Thread.java:834) > {noformat} > I think there is something wrong with partitionToUnderServedQueues field in > ProportionalCapacityPreemptionPolicy. Items of partitionToUnderServedQueues > can be add but never be removed, except rebuilding this policy. For example, > once under-served queue "a" is added into this structure, it will always be > there and never be removed, intra-queue preemption checker will try to get > all queues info for partitionToUnderServedQueues in > IntraQueueCandidatesSelector#selectCandidates and will throw > YarnRuntimeException if not found. So that after queue "a" is deleted from > queue structure, the preemption checker will always fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8728) Wrong available resource in AllocateResponse when allocating containers to different partitions in the same queue
[ https://issues.apache.org/jira/browse/YARN-8728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-8728: --- Attachment: (was: YARN-8728.001.patch) > Wrong available resource in AllocateResponse when allocating containers to > different partitions in the same queue > - > > Key: YARN-8728 > URL: https://issues.apache.org/jira/browse/YARN-8728 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > > Recently I found some apps' available resource in AllocateResponse was > changing between two different resources. After check the code, I think > {{LeafQueue#queueResourceLimitsInfo}} is wrongly updated in > {{LeafQueue#computeUserLimitAndSetHeadroom}} for all partitions, and this > data should be updated only for default partition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8709) intra-queue preemption checker always fail since one under-served queue was deleted
[ https://issues.apache.org/jira/browse/YARN-8709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-8709: --- Attachment: YARN-8709.001.patch > intra-queue preemption checker always fail since one under-served queue was > deleted > --- > > Key: YARN-8709 > URL: https://issues.apache.org/jira/browse/YARN-8709 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler preemption >Affects Versions: 3.2.0 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-8709.001.patch > > > After some queues deleted, the preemption checker in SchedulingMonitor was > always skipped because of YarnRuntimeException for every run. > Error logs: > {noformat} > ERROR [SchedulingMonitor (ProportionalCapacityPreemptionPolicy)] > org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor: > Exception raised while executing preemption checker, skip this run..., > exception= > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: This shouldn't > happen, cannot find TempQueuePerPartition for queueName=1535075839208 > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.getQueueByPartition(ProportionalCapacityPreemptionPolicy.java:701) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.computeIntraQueuePreemptionDemand(IntraQueueCandidatesSelector.java:302) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.selectCandidates(IntraQueueCandidatesSelector.java:128) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:514) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:348) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:186) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:300) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) > at java.lang.Thread.run(Thread.java:834) > {noformat} > I think there is something wrong with partitionToUnderServedQueues field in > ProportionalCapacityPreemptionPolicy. Items of partitionToUnderServedQueues > can be add but never be removed, except rebuilding this policy. For example, > once under-served queue "a" is added into this structure, it will always be > there and never be removed, intra-queue preemption checker will try to get > all queues info for partitionToUnderServedQueues in > IntraQueueCandidatesSelector#selectCandidates and will throw > YarnRuntimeException if not found. So that after queue "a" is deleted from > queue structure, the preemption checker will always fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8728) Wrong available resource in AllocateResponse when allocating containers to different partitions in the same queue
[ https://issues.apache.org/jira/browse/YARN-8728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596198#comment-16596198 ] Tao Yang commented on YARN-8728: Attached v1 patch for review. > Wrong available resource in AllocateResponse when allocating containers to > different partitions in the same queue > - > > Key: YARN-8728 > URL: https://issues.apache.org/jira/browse/YARN-8728 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-8728.001.patch > > > Recently I found some apps' available resource in AllocateResponse was > changing between two different resources. After check the code, I think > {{LeafQueue#queueResourceLimitsInfo}} is wrongly updated in > {{LeafQueue#computeUserLimitAndSetHeadroom}} for all partitions, and this > data should be updated only for default partition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8728) Wrong available resource in AllocateResponse when allocating containers to different partitions in the same queue
[ https://issues.apache.org/jira/browse/YARN-8728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-8728: --- Summary: Wrong available resource in AllocateResponse when allocating containers to different partitions in the same queue (was: Wrong available resource in AllocateResponse in queue with multiple partitions) > Wrong available resource in AllocateResponse when allocating containers to > different partitions in the same queue > - > > Key: YARN-8728 > URL: https://issues.apache.org/jira/browse/YARN-8728 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-8728.001.patch > > > Recently I found some apps' available resource in AllocateResponse was > changing between two different resources. After check the code, I think > {{LeafQueue#queueResourceLimitsInfo}} is wrongly updated in > {{LeafQueue#computeUserLimitAndSetHeadroom}} for all partitions, and this > data should be updated only for default partition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8728) Wrong available resource in AllocateResponse in queue with multiple partitions
[ https://issues.apache.org/jira/browse/YARN-8728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-8728: --- Attachment: YARN-8728.001.patch > Wrong available resource in AllocateResponse in queue with multiple partitions > -- > > Key: YARN-8728 > URL: https://issues.apache.org/jira/browse/YARN-8728 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-8728.001.patch > > > Recently I found some apps' available resource in AllocateResponse was > changing between two different resources. After check the code, I think > {{LeafQueue#queueResourceLimitsInfo}} is wrongly updated in > {{LeafQueue#computeUserLimitAndSetHeadroom}} for all partitions, and this > data should be updated only for default partition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8728) Wrong available resource in AllocateResponse in queue with multiple partitions
Tao Yang created YARN-8728: -- Summary: Wrong available resource in AllocateResponse in queue with multiple partitions Key: YARN-8728 URL: https://issues.apache.org/jira/browse/YARN-8728 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Reporter: Tao Yang Assignee: Tao Yang Recently I found some apps' available resource in AllocateResponse was changing between two different resources. After check the code, I think {{LeafQueue#queueResourceLimitsInfo}} is wrongly updated in {{LeafQueue#computeUserLimitAndSetHeadroom}} for all partitions, and this data should be updated only for default partition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8468) Limit container sizes per queue in FairScheduler
[ https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596174#comment-16596174 ] Antal Bálint Steinbach commented on YARN-8468: -- Thank you [~leftnoteasy] for checking the patch. I uploaded a new one. 1) The ApplicationMaster gets the maximum allocation value here just like in the other case so the AM can handle it like before submitting a request, but I think we still need the normalization/validation of the request just in case the application did something wrong. Furthermore, there was a validation/normalization step before it was just using only the scheduler level maximums. 2) I reverted the formattings. I run an autoformatting unintentionally on SchedulerUtils. Please let me know if I misunderstand something. > Limit container sizes per queue in FairScheduler > > > Key: YARN-8468 > URL: https://issues.apache.org/jira/browse/YARN-8468 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.1.0 >Reporter: Antal Bálint Steinbach >Assignee: Antal Bálint Steinbach >Priority: Critical > Attachments: YARN-8468.000.patch, YARN-8468.001.patch, > YARN-8468.002.patch, YARN-8468.003.patch, YARN-8468.004.patch, > YARN-8468.005.patch, YARN-8468.006.patch, YARN-8468.007.patch, > YARN-8468.008.patch > > > When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" > to limit the overall size of a container. This applies globally to all > containers and cannot be limited by queue or and is not scheduler dependent. > > The goal of this ticket is to allow this value to be set on a per queue basis. > > The use case: User has two pools, one for ad hoc jobs and one for enterprise > apps. User wants to limit ad hoc jobs to small containers but allow > enterprise apps to request as many resources as needed. Setting > yarn.scheduler.maximum-allocation-mb sets a default value for maximum > container size for all queues and setting maximum resources per queue with > “maxContainerResources” queue config value. > > Suggested solution: > > All the infrastructure is already in the code. We need to do the following: > * add the setting to the queue properties for all queue types (parent and > leaf), this will cover dynamically created queues. > * if we set it on the root we override the scheduler setting and we should > not allow that. > * make sure that queue resource cap can not be larger than scheduler max > resource cap in the config. > * implement getMaximumResourceCapability(String queueName) in the > FairScheduler > * implement getMaximumResourceCapability() in both FSParentQueue and > FSLeafQueue as follows > * expose the setting in the queue information in the RM web UI. > * expose the setting in the metrics etc for the queue. > * write JUnit tests. > * update the scheduler documentation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8468) Limit container sizes per queue in FairScheduler
[ https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antal Bálint Steinbach updated YARN-8468: - Attachment: YARN-8468.008.patch > Limit container sizes per queue in FairScheduler > > > Key: YARN-8468 > URL: https://issues.apache.org/jira/browse/YARN-8468 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.1.0 >Reporter: Antal Bálint Steinbach >Assignee: Antal Bálint Steinbach >Priority: Critical > Attachments: YARN-8468.000.patch, YARN-8468.001.patch, > YARN-8468.002.patch, YARN-8468.003.patch, YARN-8468.004.patch, > YARN-8468.005.patch, YARN-8468.006.patch, YARN-8468.007.patch, > YARN-8468.008.patch > > > When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" > to limit the overall size of a container. This applies globally to all > containers and cannot be limited by queue or and is not scheduler dependent. > > The goal of this ticket is to allow this value to be set on a per queue basis. > > The use case: User has two pools, one for ad hoc jobs and one for enterprise > apps. User wants to limit ad hoc jobs to small containers but allow > enterprise apps to request as many resources as needed. Setting > yarn.scheduler.maximum-allocation-mb sets a default value for maximum > container size for all queues and setting maximum resources per queue with > “maxContainerResources” queue config value. > > Suggested solution: > > All the infrastructure is already in the code. We need to do the following: > * add the setting to the queue properties for all queue types (parent and > leaf), this will cover dynamically created queues. > * if we set it on the root we override the scheduler setting and we should > not allow that. > * make sure that queue resource cap can not be larger than scheduler max > resource cap in the config. > * implement getMaximumResourceCapability(String queueName) in the > FairScheduler > * implement getMaximumResourceCapability() in both FSParentQueue and > FSLeafQueue as follows > * expose the setting in the queue information in the RM web UI. > * expose the setting in the metrics etc for the queue. > * write JUnit tests. > * update the scheduler documentation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8726) [UI2] YARN UI2 is not accessible when config.env file failed to load
[ https://issues.apache.org/jira/browse/YARN-8726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596104#comment-16596104 ] Sunil Govindan commented on YARN-8726: -- Seems jenkins is down. > [UI2] YARN UI2 is not accessible when config.env file failed to load > > > Key: YARN-8726 > URL: https://issues.apache.org/jira/browse/YARN-8726 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Akhil PB >Assignee: Akhil PB >Priority: Critical > Attachments: YARN-8726.001.patch > > > It is observed that yarn UI2 is not accessible. When UI2 is inspected, it > gives below error > {code:java} > index.html:1 Refused to execute script from > 'http://ctr-e138-1518143905142-456429-01-05.hwx.site:8088/ui2/config/configs.env' > because its MIME type ('text/plain') is not executable, and strict MIME type > checking is enabled. > yarn-ui.js:219 base url: > vendor.js:1978 ReferenceError: ENV is not defined > at updateConfigs (yarn-ui.js:212) > at Object.initialize (yarn-ui.js:218) > at vendor.js:824 > at vendor.js:825 > at visit (vendor.js:3025) > at Object.visit [as default] (vendor.js:3024) > at DAG.topsort (vendor.js:750) > at Class._runInitializer (vendor.js:825) > at Class.runInitializers (vendor.js:824) > at Class._bootSync (vendor.js:823) > onerrorDefault @ vendor.js:1978 > trigger @ vendor.js:2967 > (anonymous) @ vendor.js:3006 > invoke @ vendor.js:626 > flush @ vendor.js:629 > flush @ vendor.js:619 > end @ vendor.js:642 > run @ vendor.js:648 > join @ vendor.js:648 > run.join @ vendor.js:1510 > (anonymous) @ vendor.js:1512 > fire @ vendor.js:230 > fireWith @ vendor.js:235 > ready @ vendor.js:242 > completed @ vendor.js:242 > vendor.js:823 Uncaught ReferenceError: ENV is not defined > at updateConfigs (yarn-ui.js:212) > at Object.initialize (yarn-ui.js:218) > at vendor.js:824 > at vendor.js:825 > at visit (vendor.js:3025) > at Object.visit [as default] (vendor.js:3024) > at DAG.topsort (vendor.js:750) > at Class._runInitializer (vendor.js:825) > at Class.runInitializers (vendor.js:824) > at Class._bootSync (vendor.js:823) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples
[ https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil Govindan resolved YARN-8220. -- Resolution: Done With Submarine, we have a better implementation for this. Hence let us close this and migrate the enhancements to Submarine. YARN-8135 > Running Tensorflow on YARN with GPU and Docker - Examples > - > > Key: YARN-8220 > URL: https://issues.apache.org/jira/browse/YARN-8220 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Critical > Attachments: YARN-8220.001.patch, YARN-8220.002.patch, > YARN-8220.003.patch, YARN-8220.004.patch > > > Tensorflow could be run on YARN and could leverage YARN's distributed > features. > This spec fill will help to run Tensorflow on yarn with GPU/docker -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples
[ https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil Govindan updated YARN-8220: - Target Version/s: (was: 3.2.0) > Running Tensorflow on YARN with GPU and Docker - Examples > - > > Key: YARN-8220 > URL: https://issues.apache.org/jira/browse/YARN-8220 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Critical > Attachments: YARN-8220.001.patch, YARN-8220.002.patch, > YARN-8220.003.patch, YARN-8220.004.patch > > > Tensorflow could be run on YARN and could leverage YARN's distributed > features. > This spec fill will help to run Tensorflow on yarn with GPU/docker -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8727) NPE in RouterRMAdminService while stopping service.
[ https://issues.apache.org/jira/browse/YARN-8727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596039#comment-16596039 ] Y. SREENIVASULU REDDY commented on YARN-8727: - i have attached the patch, please review. > NPE in RouterRMAdminService while stopping service. > --- > > Key: YARN-8727 > URL: https://issues.apache.org/jira/browse/YARN-8727 > Project: Hadoop YARN > Issue Type: Bug > Components: federation, router >Affects Versions: 3.1.1 >Reporter: Y. SREENIVASULU REDDY >Assignee: Y. SREENIVASULU REDDY >Priority: Major > Attachments: YARN-8727.001.patch > > > while stopping the service router throw NPE > {noformat} > 2018-08-23 22:52:00,596 INFO org.apache.hadoop.service.AbstractService: > Service org.apache.hadoop.yarn.server.router.rmadmin.RouterRMAdminService > failed in state STOPPED > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.router.rmadmin.RouterRMAdminService.serviceStop(RouterRMAdminService.java:143) > at > org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102) > at > org.apache.hadoop.service.CompositeService.stop(CompositeService.java:158) > at > org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:132) > at > org.apache.hadoop.yarn.server.router.Router.serviceStop(Router.java:128) > at > org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:202) > at org.apache.hadoop.yarn.server.router.Router.main(Router.java:182) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8727) NPE in RouterRMAdminService while stopping service.
[ https://issues.apache.org/jira/browse/YARN-8727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Y. SREENIVASULU REDDY updated YARN-8727: Attachment: YARN-8727.001.patch > NPE in RouterRMAdminService while stopping service. > --- > > Key: YARN-8727 > URL: https://issues.apache.org/jira/browse/YARN-8727 > Project: Hadoop YARN > Issue Type: Bug > Components: federation, router >Affects Versions: 3.1.1 >Reporter: Y. SREENIVASULU REDDY >Assignee: Y. SREENIVASULU REDDY >Priority: Major > Attachments: YARN-8727.001.patch > > > while stopping the service router throw NPE > {noformat} > 2018-08-23 22:52:00,596 INFO org.apache.hadoop.service.AbstractService: > Service org.apache.hadoop.yarn.server.router.rmadmin.RouterRMAdminService > failed in state STOPPED > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.router.rmadmin.RouterRMAdminService.serviceStop(RouterRMAdminService.java:143) > at > org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102) > at > org.apache.hadoop.service.CompositeService.stop(CompositeService.java:158) > at > org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:132) > at > org.apache.hadoop.yarn.server.router.Router.serviceStop(Router.java:128) > at > org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:202) > at org.apache.hadoop.yarn.server.router.Router.main(Router.java:182) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8569) Create an interface to provide cluster information to application
[ https://issues.apache.org/jira/browse/YARN-8569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596014#comment-16596014 ] Wangda Tan commented on YARN-8569: -- [~eyang], {quote}Unless malicious user already hacked into yarn user account and populate data as yarn user, there is no easy parameter hacking to container-executor to trigger exploits. {quote} There were tons of debates regarding to yarn user should be treated as root or not before. We saw some issues of c-e causes yarn user can manipulate other user's directories, or directly escalate to root user. All of these issues become CVE. {quote}This is the reason that this solution is invented to lower the bar of writing clustering software for Hadoop. {quote} It gonna be help if you can share some real-world examples. >From YARN's design purpose, ideally all NM/RM logics should be as general as >possible, all service-related stuffs should be handled by service framework >like API server or ServiceMaster. I really don't like the idea of adding >service-specific API to NM API. If you do think update service spec json file is important, another approach could be: 1) ServiceMaster ro mount a local directory (under the container's local dir) when launch docker container (example like: ./service-info -> /service/sys/fs/) 2) ServiceMaster request to re-localize new service spec json file to the ./service-info folder. > Create an interface to provide cluster information to application > - > > Key: YARN-8569 > URL: https://issues.apache.org/jira/browse/YARN-8569 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: Docker > Attachments: YARN-8569.001.patch, YARN-8569.002.patch > > > Some program requires container hostnames to be known for application to run. > For example, distributed tensorflow requires launch_command that looks like: > {code} > # On ps0.example.com: > $ python trainer.py \ > --ps_hosts=ps0.example.com:,ps1.example.com: \ > --worker_hosts=worker0.example.com:,worker1.example.com: \ > --job_name=ps --task_index=0 > # On ps1.example.com: > $ python trainer.py \ > --ps_hosts=ps0.example.com:,ps1.example.com: \ > --worker_hosts=worker0.example.com:,worker1.example.com: \ > --job_name=ps --task_index=1 > # On worker0.example.com: > $ python trainer.py \ > --ps_hosts=ps0.example.com:,ps1.example.com: \ > --worker_hosts=worker0.example.com:,worker1.example.com: \ > --job_name=worker --task_index=0 > # On worker1.example.com: > $ python trainer.py \ > --ps_hosts=ps0.example.com:,ps1.example.com: \ > --worker_hosts=worker0.example.com:,worker1.example.com: \ > --job_name=worker --task_index=1 > {code} > This is a bit cumbersome to orchestrate via Distributed Shell, or YARN > services launch_command. In addition, the dynamic parameters do not work > with YARN flex command. This is the classic pain point for application > developer attempt to automate system environment settings as parameter to end > user application. > It would be great if YARN Docker integration can provide a simple option to > expose hostnames of the yarn service via a mounted file. The file content > gets updated when flex command is performed. This allows application > developer to consume system environment settings via a standard interface. > It is like /proc/devices for Linux, but for Hadoop. This may involve > updating a file in distributed cache, and allow mounting of the file via > container-executor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org