[jira] [Commented] (YARN-8709) intra-queue preemption checker always fail since one under-served queue was deleted

2018-08-29 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597101#comment-16597101
 ] 

Weiwei Yang commented on YARN-8709:
---

LGTM, changing to PA

> intra-queue preemption checker always fail since one under-served queue was 
> deleted
> ---
>
> Key: YARN-8709
> URL: https://issues.apache.org/jira/browse/YARN-8709
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler preemption
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8709.001.patch
>
>
> After some queues deleted, the preemption checker in SchedulingMonitor was 
> always skipped  because of YarnRuntimeException for every run.
> Error logs:
> {noformat}
> ERROR [SchedulingMonitor (ProportionalCapacityPreemptionPolicy)] 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor: 
> Exception raised while executing preemption checker, skip this run..., 
> exception=
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: This shouldn't 
> happen, cannot find TempQueuePerPartition for queueName=1535075839208
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.getQueueByPartition(ProportionalCapacityPreemptionPolicy.java:701)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.computeIntraQueuePreemptionDemand(IntraQueueCandidatesSelector.java:302)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.selectCandidates(IntraQueueCandidatesSelector.java:128)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:514)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:348)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:186)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:300)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
> at java.lang.Thread.run(Thread.java:834)
> {noformat}
> I think there is something wrong with partitionToUnderServedQueues field in 
> ProportionalCapacityPreemptionPolicy. Items of partitionToUnderServedQueues 
> can be add but never be removed, except rebuilding this policy. For example, 
> once under-served queue "a" is added into this structure, it will always be 
> there and never be removed, intra-queue preemption checker will try to get 
> all queues info for partitionToUnderServedQueues in 
> IntraQueueCandidatesSelector#selectCandidates and will throw 
> YarnRuntimeException if not found. So that after queue "a" is deleted from 
> queue structure, the preemption checker will always fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8102) Retrospect on having enable and disable flag for Node Attribute

2018-08-29 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597083#comment-16597083
 ] 

Weiwei Yang edited comment on YARN-8102 at 8/30/18 5:54 AM:


I agree with [~Naganarasimha] about simplifying the configs, can we use
{noformat}
file://{hadoop.tmp.dir}/yarn/system/node-attributes
file://{hadoop.tmp.dir}/yarn/system/node-labels
{noformat}
as the default dir? But if we change the default path for node-labels, that 
will be an incompatible change right? Maybe at least we can have #2 done?


was (Author: cheersyang):
I agree with [~Naganarasimha] about simplifying the configs, can we use

{noformat}

[file://$|file://%24/] \{hadoop.tmp.dir}/yarn/system/node-attributes

[file://$|file://%24/] \{hadoop.tmp.dir}/yarn/system/node-labels

{noformat}

as the default dir? But if we change the default path for node-labels, that 
will be an incompatible change right? Maybe at least we can have #2 done?

> Retrospect on having enable and disable flag for Node Attribute
> ---
>
> Key: YARN-8102
> URL: https://issues.apache.org/jira/browse/YARN-8102
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Major
>
> Currently node attribute feature is by default enabled. We have to revisit on 
> the same.
> Enabling by default means will try to create store for all cluster 
> installation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8102) Retrospect on having enable and disable flag for Node Attribute

2018-08-29 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597083#comment-16597083
 ] 

Weiwei Yang edited comment on YARN-8102 at 8/30/18 5:54 AM:


I agree with [~Naganarasimha] about simplifying the configs, can we use
 * [file://$|file://%24/] \{hadoop.tmp.dir}/yarn/system/node-attributes
 * [file://$|file://%24/]{hadoop.tmp.dir}/yarn/system/node-labels

as the default dir? But if we change the default path for node-labels, that 
will be an incompatible change right? Maybe at least we can have #2 done?


was (Author: cheersyang):
I agree with [~Naganarasimha] about simplifying the configs, can we use
 * [file://$|file://%24/]{hadoop.tmp.dir}/yarn/system/node-attributes
 * [file://$|file://%24/]{hadoop.tmp.dir}/yarn/system/node-labels

as the default dir? But if we change the default path for node-labels, that 
will be an incompatible change right? Maybe at least we can have #2 done?

> Retrospect on having enable and disable flag for Node Attribute
> ---
>
> Key: YARN-8102
> URL: https://issues.apache.org/jira/browse/YARN-8102
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Major
>
> Currently node attribute feature is by default enabled. We have to revisit on 
> the same.
> Enabling by default means will try to create store for all cluster 
> installation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8102) Retrospect on having enable and disable flag for Node Attribute

2018-08-29 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597083#comment-16597083
 ] 

Weiwei Yang edited comment on YARN-8102 at 8/30/18 5:54 AM:


I agree with [~Naganarasimha] about simplifying the configs, can we use

{noformat}

[file://$|file://%24/] \{hadoop.tmp.dir}/yarn/system/node-attributes

[file://$|file://%24/] \{hadoop.tmp.dir}/yarn/system/node-labels

{noformat}

as the default dir? But if we change the default path for node-labels, that 
will be an incompatible change right? Maybe at least we can have #2 done?


was (Author: cheersyang):
I agree with [~Naganarasimha] about simplifying the configs, can we use
 * [file://$|file://%24/] \{hadoop.tmp.dir}/yarn/system/node-attributes
 * [file://$|file://%24/]{hadoop.tmp.dir}/yarn/system/node-labels

as the default dir? But if we change the default path for node-labels, that 
will be an incompatible change right? Maybe at least we can have #2 done?

> Retrospect on having enable and disable flag for Node Attribute
> ---
>
> Key: YARN-8102
> URL: https://issues.apache.org/jira/browse/YARN-8102
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Major
>
> Currently node attribute feature is by default enabled. We have to revisit on 
> the same.
> Enabling by default means will try to create store for all cluster 
> installation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8102) Retrospect on having enable and disable flag for Node Attribute

2018-08-29 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597083#comment-16597083
 ] 

Weiwei Yang commented on YARN-8102:
---

I agree with [~Naganarasimha] about simplifying the configs, can we use
 * [file://$|file://%24/]{hadoop.tmp.dir}/yarn/system/node-attributes
 * [file://$|file://%24/]{hadoop.tmp.dir}/yarn/system/node-labels

as the default dir? But if we change the default path for node-labels, that 
will be an incompatible change right? Maybe at least we can have #2 done?

> Retrospect on having enable and disable flag for Node Attribute
> ---
>
> Key: YARN-8102
> URL: https://issues.apache.org/jira/browse/YARN-8102
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Major
>
> Currently node attribute feature is by default enabled. We have to revisit on 
> the same.
> Enabling by default means will try to create store for all cluster 
> installation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8666) [UI2] Remove application tab from Yarn Queue Page

2018-08-29 Thread Akhil PB (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akhil PB updated YARN-8666:
---
Summary: [UI2] Remove application tab from Yarn Queue Page  (was: Remove 
application tab from Yarn Queue Page)

> [UI2] Remove application tab from Yarn Queue Page
> -
>
> Key: YARN-8666
> URL: https://issues.apache.org/jira/browse/YARN-8666
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.1.1
>Reporter: Yesha Vora
>Assignee: Yesha Vora
>Priority: Major
> Attachments: Screen Shot 2018-08-14 at 3.43.18 PM.png
>
>
> Yarn UI2 Queue page puts Application button. This button does not redirect to 
> any other page. In addition to that running application table is also 
> available on same page. 
> Thus, there is no need to have a button for application in Queue page. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8112) Fix min cardinality check for same source and target tags in intra-app constraints

2018-08-29 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8112:
--
Parent Issue: YARN-8731  (was: YARN-7812)

> Fix min cardinality check for same source and target tags in intra-app 
> constraints
> --
>
> Key: YARN-8112
> URL: https://issues.apache.org/jira/browse/YARN-8112
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Weiwei Yang
>Assignee: Konstantinos Karanasos
>Priority: Major
>
> The min cardinality constraint (min cardinality = _k_) ensures that a 
> container is placed at a node that has already k occurrences of the target 
> tag. For example, a constraint _zk=3,CARDINALITY,NODE,hb,2,10_ will place 
> each of the three zk containers on a node with at least 2 hb instances (and 
> no more than 10 for the max cardinality).
> Affinity constraints is a special case of this where min cardinality is 1.
> Currently we do not support min cardinality when the source and the target of 
> the constraint are the same in an intra-app constraint.
> Therefore, zk=3,CARDINALITY,NODE,zk,2,10 is not supported, and neither is 
> zk=3,IN,NODE,zk.
> This Jira will address this problem by placing the first k containers on the 
> same node (or any other specified scope, e.g., rack), so that min cardinality 
> can be met when placing the subsequent containers with the same tag.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7858) Support special Node Attribute scopes in addition to NODE and RACK

2018-08-29 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YARN-7858.
---
Resolution: Duplicate

> Support special Node Attribute scopes in addition to NODE and RACK
> --
>
> Key: YARN-7858
> URL: https://issues.apache.org/jira/browse/YARN-7858
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun Suresh
>Assignee: Weiwei Yang
>Priority: Major
>
> Currently, we have only two scopes defined: NODE and RACK against which we 
> check the cardinality of the placement.
> This idea should be extended to support node-attribute scopes. For eg: 
> Placement of containers across *upgrade domains* and *failure domains*. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8555) Parameterize TestSchedulingRequestContainerAllocation(Async) to cover both PC handler options

2018-08-29 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8555:
--
Parent Issue: YARN-8731  (was: YARN-7812)

> Parameterize TestSchedulingRequestContainerAllocation(Async) to cover both PC 
> handler options 
> --
>
> Key: YARN-8555
> URL: https://issues.apache.org/jira/browse/YARN-8555
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Weiwei Yang
>Priority: Minor
>
> Current test cases in this 2 classes are only targeting for 1 handler type, 
> \{{scheduler}} or \{{processor}}. Once YARN-8015 is done, we should modify 
> them to be parameterized in order to cover both cases. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8555) Parameterize TestSchedulingRequestContainerAllocation(Async) to cover both PC handler options

2018-08-29 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8555:
--
Labels:   (was: newbie)

> Parameterize TestSchedulingRequestContainerAllocation(Async) to cover both PC 
> handler options 
> --
>
> Key: YARN-8555
> URL: https://issues.apache.org/jira/browse/YARN-8555
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Weiwei Yang
>Priority: Minor
>
> Current test cases in this 2 classes are only targeting for 1 handler type, 
> \{{scheduler}} or \{{processor}}. Once YARN-8015 is done, we should modify 
> them to be parameterized in order to cover both cases. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6621) Validate Placement Constraints

2018-08-29 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-6621:
--
Parent Issue: YARN-8731  (was: YARN-7812)

> Validate Placement Constraints
> --
>
> Key: YARN-6621
> URL: https://issues.apache.org/jira/browse/YARN-6621
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Konstantinos Karanasos
>Assignee: Konstantinos Karanasos
>Priority: Major
>
> This library will be used to validate placement constraints.
> It can serve multiple validation purposes:
> 1) Check if the placement constraint has a valid form (e.g., a cardinality 
> constraint should not have an associated target expression, a DELAYED_OR 
> compound expression should only appear in specific places in a constraint 
> tree, etc.)
> 2) Check if the constraints given by a user are conflicting (e.g., 
> cardinality more than 5 in a host and less than 3 in a rack).
> 3) Check that the constraints are properly added in the Placement Constraint 
> Manager.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7800) Bind node constraint once a container is proposed to be placed on this node

2018-08-29 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-7800:
--
Parent Issue: YARN-8731  (was: YARN-7812)

> Bind node constraint once a container is proposed to be placed on this node
> ---
>
> Key: YARN-7800
> URL: https://issues.apache.org/jira/browse/YARN-7800
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: RM
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: bind_node_constraint.pdf
>
>
> We found when there is circular dependency between multiple scheduling 
> requests, allocation decisions made by placement constraint algorithm might 
> be conflicting with related tags. More describing of the issue please refer 
> to YARN-7783.
> To solve this issue, a possible solution is to bind *node constraint*. If the 
> algorithm wants to place any new container to this node, except for checking 
> if it satisfies the placement constraint, it also check if it satisfies the 
> node constraint. For example
> 1) "foo", anti-affinity with "foo"
> +Implies node constraint:+ on each node, it cannot have more than 1 foo tags
> 2) "bar", anti-affinity with "foo"
> +Implies node constraint:+ on each node, it cannot have both "bar" and "foo" 
> tags
> With such constraint, it works like
>  * req2 is placed on any of nodes, e.g n2, +a node constraint[1] is added to 
> n2 that constrains this node cannot have both "bar" and "foo" tags+
>  * when the algorithm wants to place req1 on n2, it checks if its placement 
> constraint is satisfied. It should be as there is no foo container on this 
> node yet.
>  * Then the algorithm checks if the node constraint is satisfied. It is not 
> because it violate node constraint [1].
> This avoids to do additional re-attempt like what was done in YARN-7783.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7752) Handle AllocationTags for Opportunistic containers.

2018-08-29 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-7752:
--
Parent Issue: YARN-8731  (was: YARN-7812)

> Handle AllocationTags for Opportunistic containers.
> ---
>
> Key: YARN-7752
> URL: https://issues.apache.org/jira/browse/YARN-7752
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun Suresh
>Priority: Major
>
> JIRA to track how opportunistic containers are handled w.r.t 
> AllocationTagsManager creation and removal of tags.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7819) Allow PlacementProcessor to be used with the FairScheduler

2018-08-29 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-7819:
--
Parent Issue: YARN-8731  (was: YARN-7812)

> Allow PlacementProcessor to be used with the FairScheduler
> --
>
> Key: YARN-7819
> URL: https://issues.apache.org/jira/browse/YARN-7819
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>Priority: Major
> Attachments: YARN-7819-YARN-6592.001.patch, 
> YARN-7819-YARN-7812.001.patch, YARN-7819.002.patch, YARN-7819.003.patch, 
> YARN-7819.004.patch
>
>
> The FairScheduler needs to implement the 
> {{ResourceScheduler#attemptAllocationOnNode}} function for the processor to 
> support the FairScheduler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7746) Fix PlacementProcessor to support app priority

2018-08-29 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-7746:
--
Parent Issue: YARN-8731  (was: YARN-7812)

> Fix PlacementProcessor to support app priority
> --
>
> Key: YARN-7746
> URL: https://issues.apache.org/jira/browse/YARN-7746
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun Suresh
>Assignee: Arun Suresh
>Priority: Major
> Attachments: YARN-7746.001.patch
>
>
> The Threadpools used in the Processor should be modified to take a priority 
> blocking queue that respects application priority.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7812) Improvements to Rich Placement Constraints in YARN

2018-08-29 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-7812:
--
Issue Type: New Feature  (was: Improvement)

> Improvements to Rich Placement Constraints in YARN
> --
>
> Key: YARN-7812
> URL: https://issues.apache.org/jira/browse/YARN-7812
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Arun Suresh
>Priority: Major
> Fix For: 3.2.0
>
>
> This umbrella tracks the efforts for supporting following features
> # Inter-app placement constraints
> # Composite placement constraints, such as AND/OR expressions
> # Support placement constraints in Capacity Scheduler



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7812) Improvements to Rich Placement Constraints in YARN

2018-08-29 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597066#comment-16597066
 ] 

Weiwei Yang edited comment on YARN-7812 at 8/30/18 5:10 AM:


Thanks [~asuresh], [~kkaranasos], [~leftnoteasy], [~sunilg] for completing this 
feature, since main functionality are done, I think we can close this umbrella 
and set the fixed version to 3.2.0. There are some remaining enhancements, lets 
use YARN-8731 to track.

Thanks!


was (Author: cheersyang):
Thanks [~asuresh], [~kkaranasos], [~leftnoteasy] for completing this feature, 
since main functionality are done, I think we can close this umbrella and set 
the fixed version to 3.2.0. There are some remaining enhancements, lets use 
YARN-8731 to track.

Thanks!

> Improvements to Rich Placement Constraints in YARN
> --
>
> Key: YARN-7812
> URL: https://issues.apache.org/jira/browse/YARN-7812
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Priority: Major
> Fix For: 3.2.0
>
>
> This umbrella tracks the efforts for supporting following features
> # Inter-app placement constraints
> # Composite placement constraints, such as AND/OR expressions
> # Support placement constraints in Capacity Scheduler



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7812) Improvements to Rich Placement Constraints in YARN

2018-08-29 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved YARN-7812.
---
   Resolution: Fixed
Fix Version/s: 3.2.0

Thanks [~asuresh], [~kkaranasos], [~leftnoteasy] for completing this feature, 
since main functionality are done, I think we can close this umbrella and set 
the fixed version to 3.2.0. There are some remaining enhancements, lets use 
YARN-8731 to track.

Thanks!

> Improvements to Rich Placement Constraints in YARN
> --
>
> Key: YARN-7812
> URL: https://issues.apache.org/jira/browse/YARN-7812
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Priority: Major
> Fix For: 3.2.0
>
>
> This umbrella tracks the efforts for supporting following features
> # Inter-app placement constraints
> # Composite placement constraints, such as AND/OR expressions
> # Support placement constraints in Capacity Scheduler



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7812) Improvements to Rich Placement Constraints in YARN

2018-08-29 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-7812:
--
Description: 
This umbrella tracks the efforts for supporting following features
# Inter-app placement constraints
# Composite placement constraints, such as AND/OR expressions
# Support placement constraints in Capacity Scheduler

> Improvements to Rich Placement Constraints in YARN
> --
>
> Key: YARN-7812
> URL: https://issues.apache.org/jira/browse/YARN-7812
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Arun Suresh
>Priority: Major
>
> This umbrella tracks the efforts for supporting following features
> # Inter-app placement constraints
> # Composite placement constraints, such as AND/OR expressions
> # Support placement constraints in Capacity Scheduler



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8731) Rich Placement constraints optimization and enhancements

2018-08-29 Thread Weiwei Yang (JIRA)
Weiwei Yang created YARN-8731:
-

 Summary: Rich Placement constraints optimization and enhancements
 Key: YARN-8731
 URL: https://issues.apache.org/jira/browse/YARN-8731
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Weiwei Yang


We have supported main functionality of rich placement constraints in v3.2.0, 
this umbrella is opened to track remaining items of placement constraints 
optimization and enhancements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7865) Node attributes documentation

2018-08-29 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597060#comment-16597060
 ] 

Weiwei Yang commented on YARN-7865:
---

Hi [~Naganarasimha]

Thanks for working on the documentation, it looks really good. Just go over the 
doc, I have following comments/suggestions:

(1) *Distributed node-to-Attributes mapping*

There are two rows describing 
{{yarn.nodemanager.node-attributes.provider.fetch-timeout-ms}}, the first one 
should be replaced by 
{{yarn.nodemanager.node-attributes.provider.fetch-interval-ms}} correct? The 
description seems to be saying the interval, not timeout.

(2) *Specifying node attributes for application*

It would be nice if we add some explanation right after the java code sample. 
Something like: 

{noformat}
The above SchedulingRequest applies 1 container on nodes that must satisfy 
following constraints:
1) Node attribute rm.yarn.io/python doesn't exist on the node or it exist but 
its value is not equal to 3
2) Node attribute rm.yarn.io/java must exist on the node and its value is equal 
to 1.8
{noformat}

BTW, can we rename the variable {{schedulingRequest1}} to just 
{{schedulingRequest}} in this example?

(3) It will also be good if we mention node attribtue constraints are HARD 
limits, something like

{noformat}
Node attribute constraints are hard limits, that says the allocation can only 
be made if the node satisfies the node attribute constraint. In another word, 
the request keeps pending until it finds a valid node satisfying the 
constraint. There is no relax policy at present.
{noformat}

this can be added to the features section I guess. Feel free to rephrase if you 
like.

Thanks

> Node attributes documentation
> -
>
> Key: YARN-7865
> URL: https://issues.apache.org/jira/browse/YARN-7865
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Weiwei Yang
>Assignee: Naganarasimha G R
>Priority: Major
> Attachments: NodeAttributes.html, YARN-7865-YARN-3409.001.patch, 
> YARN-7865-YARN-3409.002.patch, YARN-7865-YARN-3409.003.patch, 
> YARN-7865-YARN-3409.004.patch
>
>
> We need proper docs to introduce how to enable node-attributes how to 
> configure providers, how to specify script paths, arguments in configuration, 
> what should be the proper permission of the script and who will run the 
> script. Also it would be good to add more info to the description of the 
> configuration properties.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4879) Enhance Allocate Protocol to Identify Requests Explicitly

2018-08-29 Thread qiuliang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597011#comment-16597011
 ] 

qiuliang commented on YARN-4879:


Thanks for putting the doc together with all the details. And I have a question 
that why the number of Rack2 for Req2 is 3? Node1 and Node4 on Rack1, Node5 on 
Rack2, so the number of Rack1 is 4 and Rack2 is 2. Where am I wrong?

> Enhance Allocate Protocol to Identify Requests Explicitly
> -
>
> Key: YARN-4879
> URL: https://issues.apache.org/jira/browse/YARN-4879
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: applications, resourcemanager
>Reporter: Subru Krishnan
>Assignee: Subru Krishnan
>Priority: Major
> Fix For: 2.9.0, 3.0.0-beta1
>
> Attachments: SimpleAllocateProtocolProposal-v1.pdf, 
> SimpleAllocateProtocolProposal-v2.pdf
>
>
> For legacy reasons, the current allocate protocol expects expanded requests 
> which represent the cumulative request for any change in resource 
> constraints. This is not only very difficult to comprehend but makes it 
> impossible for the scheduler to associate container allocations to the 
> original requests. This problem is amplified by the fact that the expansion 
> is managed by the AMRMClient which makes it cumbersome for non-Java clients 
> as they all have to replicate the non-trivial logic. In this JIRA, we are 
> proposing enhancement to the Allocate Protocol to allow AMs to identify 
> requests explicitly.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8680) YARN NM: Implement Iterable Abstraction for LocalResourceTrackerstate

2018-08-29 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596999#comment-16596999
 ] 

Sunil Govindan commented on YARN-8680:
--

Thanks [~pradeepambati].
I ll also try to help u review this. 

cc [~cheersyang], cud u also pls take a look.

> YARN NM: Implement Iterable Abstraction for LocalResourceTrackerstate
> -
>
> Key: YARN-8680
> URL: https://issues.apache.org/jira/browse/YARN-8680
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Pradeep Ambati
>Assignee: Pradeep Ambati
>Priority: Critical
> Attachments: YARN-8680.00.patch, YARN-8680.01.patch
>
>
> Similar to YARN-8242, implement iterable abstraction for 
> LocalResourceTrackerState to load completed and in progress resources when 
> needed rather than loading them all at a time for a respective state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8102) Retrospect on having enable and disable flag for Node Attribute

2018-08-29 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596967#comment-16596967
 ] 

Naganarasimha G R commented on YARN-8102:
-

Hi [~bibinchundatt] & [~sunilg],

I think we had sufficient discussion on this, IMHO its not required and it can 
be safely configured to local host's tmp where in most of the clusters will 
have access else we can choose relative path to  *_"hadoop.tmp.dir"_*  which is 
what was also used in case of ATSv1 timeline store. I would prefer here to 
avoid any kind of new configuration being introduced. If agree then we can go 
ahead and do the required above changes and close this jira.

cc [~cheersyang]

> Retrospect on having enable and disable flag for Node Attribute
> ---
>
> Key: YARN-8102
> URL: https://issues.apache.org/jira/browse/YARN-8102
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Major
>
> Currently node attribute feature is by default enabled. We have to revisit on 
> the same.
> Enabling by default means will try to create store for all cluster 
> installation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7865) Node attributes documentation

2018-08-29 Thread Naganarasimha G R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596963#comment-16596963
 ] 

Naganarasimha G R commented on YARN-7865:
-

Thanks [~sunilg], for the review,

Please find the attached patch addressing your comments.

> Node attributes documentation
> -
>
> Key: YARN-7865
> URL: https://issues.apache.org/jira/browse/YARN-7865
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Weiwei Yang
>Assignee: Naganarasimha G R
>Priority: Major
> Attachments: NodeAttributes.html, YARN-7865-YARN-3409.001.patch, 
> YARN-7865-YARN-3409.002.patch, YARN-7865-YARN-3409.003.patch, 
> YARN-7865-YARN-3409.004.patch
>
>
> We need proper docs to introduce how to enable node-attributes how to 
> configure providers, how to specify script paths, arguments in configuration, 
> what should be the proper permission of the script and who will run the 
> script. Also it would be good to add more info to the description of the 
> configuration properties.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7865) Node attributes documentation

2018-08-29 Thread Naganarasimha G R (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-7865:

Attachment: YARN-7865-YARN-3409.004.patch

> Node attributes documentation
> -
>
> Key: YARN-7865
> URL: https://issues.apache.org/jira/browse/YARN-7865
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Weiwei Yang
>Assignee: Naganarasimha G R
>Priority: Major
> Attachments: NodeAttributes.html, YARN-7865-YARN-3409.001.patch, 
> YARN-7865-YARN-3409.002.patch, YARN-7865-YARN-3409.003.patch, 
> YARN-7865-YARN-3409.004.patch
>
>
> We need proper docs to introduce how to enable node-attributes how to 
> configure providers, how to specify script paths, arguments in configuration, 
> what should be the proper permission of the script and who will run the 
> script. Also it would be good to add more info to the description of the 
> configuration properties.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7865) Node attributes documentation

2018-08-29 Thread Naganarasimha G R (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-7865:

Attachment: (was: NodeAttributes.html)

> Node attributes documentation
> -
>
> Key: YARN-7865
> URL: https://issues.apache.org/jira/browse/YARN-7865
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Weiwei Yang
>Assignee: Naganarasimha G R
>Priority: Major
> Attachments: NodeAttributes.html, YARN-7865-YARN-3409.001.patch, 
> YARN-7865-YARN-3409.002.patch, YARN-7865-YARN-3409.003.patch
>
>
> We need proper docs to introduce how to enable node-attributes how to 
> configure providers, how to specify script paths, arguments in configuration, 
> what should be the proper permission of the script and who will run the 
> script. Also it would be good to add more info to the description of the 
> configuration properties.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7865) Node attributes documentation

2018-08-29 Thread Naganarasimha G R (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-7865:

Attachment: NodeAttributes.html

> Node attributes documentation
> -
>
> Key: YARN-7865
> URL: https://issues.apache.org/jira/browse/YARN-7865
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Weiwei Yang
>Assignee: Naganarasimha G R
>Priority: Major
> Attachments: NodeAttributes.html, YARN-7865-YARN-3409.001.patch, 
> YARN-7865-YARN-3409.002.patch, YARN-7865-YARN-3409.003.patch
>
>
> We need proper docs to introduce how to enable node-attributes how to 
> configure providers, how to specify script paths, arguments in configuration, 
> what should be the proper permission of the script and who will run the 
> script. Also it would be good to add more info to the description of the 
> configuration properties.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8468) Limit container sizes per queue in FairScheduler

2018-08-29 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596926#comment-16596926
 ] 

Wangda Tan commented on YARN-8468:
--

[~bsteinbach],

Thanks, I think it makes sense to normalize/validate the max allocation.

Few things:

1) Could u add basic tests to CS to make sure it works? 

2) Inside validateIncreaseDecreaseRequest, it gets maximumAllocation of queue 
twice, let's try to avoid this if possible. 

3) And similarly, {{normalizeAndvalidateRequest}} calls getMaximumAllocation of 
queue again and again. Let's try to call the getMaximumAllocation once for 
every {{allocate}} call. 

And request another set if eyes to check details of the patch. cc: [~sunilg], 
[~cheersyang]

> Limit container sizes per queue in FairScheduler
> 
>
> Key: YARN-8468
> URL: https://issues.apache.org/jira/browse/YARN-8468
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Antal Bálint Steinbach
>Assignee: Antal Bálint Steinbach
>Priority: Critical
> Attachments: YARN-8468.000.patch, YARN-8468.001.patch, 
> YARN-8468.002.patch, YARN-8468.003.patch, YARN-8468.004.patch, 
> YARN-8468.005.patch, YARN-8468.006.patch, YARN-8468.007.patch, 
> YARN-8468.008.patch
>
>
> When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" 
> to limit the overall size of a container. This applies globally to all 
> containers and cannot be limited by queue or and is not scheduler dependent.
>  
> The goal of this ticket is to allow this value to be set on a per queue basis.
>  
> The use case: User has two pools, one for ad hoc jobs and one for enterprise 
> apps. User wants to limit ad hoc jobs to small containers but allow 
> enterprise apps to request as many resources as needed. Setting 
> yarn.scheduler.maximum-allocation-mb sets a default value for maximum 
> container size for all queues and setting maximum resources per queue with 
> “maxContainerResources” queue config value.
>  
> Suggested solution:
>  
> All the infrastructure is already in the code. We need to do the following:
>  * add the setting to the queue properties for all queue types (parent and 
> leaf), this will cover dynamically created queues.
>  * if we set it on the root we override the scheduler setting and we should 
> not allow that.
>  * make sure that queue resource cap can not be larger than scheduler max 
> resource cap in the config.
>  * implement getMaximumResourceCapability(String queueName) in the 
> FairScheduler
>  * implement getMaximumResourceCapability() in both FSParentQueue and 
> FSLeafQueue as follows
>  * expose the setting in the queue information in the RM web UI.
>  * expose the setting in the metrics etc for the queue.
>  * write JUnit tests.
>  * update the scheduler documentation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8730) TestRMWebServiceAppsNodelabel#testAppsRunning fails

2018-08-29 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596921#comment-16596921
 ] 

Jason Lowe commented on YARN-8730:
--

trunk and other releases ahead of 2.8 do not do this since they changed the 
annotation of the ResourceInfo class from XmlAccessType.FIELD to 
XmlAccessType.NONE in YARN-6232 when the same Resource field was added to 
ResourceInfo.  branch-2.8 needs a similar setup so the "res" field is not 
advertised in the REST API.


> TestRMWebServiceAppsNodelabel#testAppsRunning fails
> ---
>
> Key: YARN-8730
> URL: https://issues.apache.org/jira/browse/YARN-8730
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.4
>Reporter: Jason Lowe
>Priority: Major
>
> TestRMWebServiceAppsNodelabel is failing in branch-2.8:
> {noformat}
> Running 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.473 sec <<< 
> FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel
> testAppsRunning(org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel)
>   Time elapsed: 6.708 sec  <<< FAILURE!
> org.junit.ComparisonFailure: partition amused 
> expected:<{"[]memory":1024,"vCores...> but 
> was:<{"[res":{"memory":1024,"memorySize":1024,"virtualCores":1},"]memory":1024,"vCores...>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel.verifyResource(TestRMWebServiceAppsNodelabel.java:222)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel.testAppsRunning(TestRMWebServiceAppsNodelabel.java:205)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7619) Max AM Resource value in Capacity Scheduler UI has to be refreshed for every user

2018-08-29 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596922#comment-16596922
 ] 

Jason Lowe commented on YARN-7619:
--

The branch-2.8 version of this patch unfortunately affected the 2.8 REST API 
when it modified the ResourceInfo DAO class.  See YARN-8730 for details.

> Max AM Resource value in Capacity Scheduler UI has to be refreshed for every 
> user
> -
>
> Key: YARN-7619
> URL: https://issues.apache.org/jira/browse/YARN-7619
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2, 3.1.0
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.1, 2.8.4
>
> Attachments: Max AM Resources is Different for Each User.png, 
> YARN-7619.001.patch, YARN-7619.002.patch, YARN-7619.003.patch, 
> YARN-7619.004.branch-2.8.patch, YARN-7619.004.branch-3.0.patch, 
> YARN-7619.004.patch, YARN-7619.005.branch-2.8.patch, 
> YARN-7619.005.branch-3.0.patch, YARN-7619.005.patch
>
>
> YARN-7245 addressed the problem that the {{Max AM Resource}} in the capacity 
> scheduler UI used to contain the queue-level AM limit instead of the 
> user-level AM limit. It fixed this by using the user-specific AM limit that 
> is calculated in {{LeafQueue#activateApplications}}, stored in each user's 
> {{LeafQueue#User}} object, and retrieved via 
> {{UserInfo#getResourceUsageInfo}}.
> The problem is that this user-specific AM limit depends on the activity of 
> other users and other applications in a queue, and it is only calculated and 
> updated when a user's application is activated. So, when 
> {{CapacitySchedulerPage}} retrieves the user-specific AM limit, it is a stale 
> value unless an application was recently activated for a particular user.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8730) TestRMWebServiceAppsNodelabel#testAppsRunning fails

2018-08-29 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-8730:
-
Affects Version/s: 2.8.4

> TestRMWebServiceAppsNodelabel#testAppsRunning fails
> ---
>
> Key: YARN-8730
> URL: https://issues.apache.org/jira/browse/YARN-8730
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.4
>Reporter: Jason Lowe
>Priority: Major
>
> TestRMWebServiceAppsNodelabel is failing in branch-2.8:
> {noformat}
> Running 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.473 sec <<< 
> FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel
> testAppsRunning(org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel)
>   Time elapsed: 6.708 sec  <<< FAILURE!
> org.junit.ComparisonFailure: partition amused 
> expected:<{"[]memory":1024,"vCores...> but 
> was:<{"[res":{"memory":1024,"memorySize":1024,"virtualCores":1},"]memory":1024,"vCores...>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel.verifyResource(TestRMWebServiceAppsNodelabel.java:222)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel.testAppsRunning(TestRMWebServiceAppsNodelabel.java:205)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8730) TestRMWebServiceAppsNodelabel#testAppsRunning fails

2018-08-29 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596911#comment-16596911
 ] 

Jason Lowe commented on YARN-8730:
--

git bisect narrows this down to YARN-7619.  A "res" field was added to the 
ResourceInfo DAO object, and since all fields are advertised by default it 
incorrectly appears in query output.  The trunk version of ResourceInfo does 
not automatically advertise fields by default.


> TestRMWebServiceAppsNodelabel#testAppsRunning fails
> ---
>
> Key: YARN-8730
> URL: https://issues.apache.org/jira/browse/YARN-8730
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Jason Lowe
>Priority: Major
>
> TestRMWebServiceAppsNodelabel is failing in branch-2.8:
> {noformat}
> Running 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.473 sec <<< 
> FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel
> testAppsRunning(org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel)
>   Time elapsed: 6.708 sec  <<< FAILURE!
> org.junit.ComparisonFailure: partition amused 
> expected:<{"[]memory":1024,"vCores...> but 
> was:<{"[res":{"memory":1024,"memorySize":1024,"virtualCores":1},"]memory":1024,"vCores...>
>   at org.junit.Assert.assertEquals(Assert.java:115)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel.verifyResource(TestRMWebServiceAppsNodelabel.java:222)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel.testAppsRunning(TestRMWebServiceAppsNodelabel.java:205)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8730) TestRMWebServiceAppsNodelabel#testAppsRunning fails

2018-08-29 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-8730:


 Summary: TestRMWebServiceAppsNodelabel#testAppsRunning fails
 Key: YARN-8730
 URL: https://issues.apache.org/jira/browse/YARN-8730
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Jason Lowe


TestRMWebServiceAppsNodelabel is failing in branch-2.8:
{noformat}
Running 
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.473 sec <<< 
FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel
testAppsRunning(org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel)
  Time elapsed: 6.708 sec  <<< FAILURE!
org.junit.ComparisonFailure: partition amused 
expected:<{"[]memory":1024,"vCores...> but 
was:<{"[res":{"memory":1024,"memorySize":1024,"virtualCores":1},"]memory":1024,"vCores...>
at org.junit.Assert.assertEquals(Assert.java:115)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel.verifyResource(TestRMWebServiceAppsNodelabel.java:222)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel.testAppsRunning(TestRMWebServiceAppsNodelabel.java:205)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8051) TestRMEmbeddedElector#testCallbackSynchronization is flakey

2018-08-29 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596905#comment-16596905
 ] 

Eric Payne commented on YARN-8051:
--

{quote}
-1  unit336m 4s hadoop-yarn-server-resourcemanager in the patch 
failed.
-1  asflicense  0m 32s  The patch generated 1 ASF License warnings. 
{quote}

I think these are unrelated. The unit test that was modified does have a 
previously included ASF section, and I manually verified that the unit test 
succeeds for branch-2, branch-2.9, and branch-2.8.
 

> TestRMEmbeddedElector#testCallbackSynchronization is flakey
> ---
>
> Key: YARN-8051
> URL: https://issues.apache.org/jira/browse/YARN-8051
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.10.0, 2.9.1, 2.8.4, 3.0.2, 3.2.0, 3.1.1
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Fix For: 3.2.0, 3.0.4, 3.1.2
>
> Attachments: YARN-8051-branch-2.002.patch, YARN-8051.001.patch, 
> YARN-8051.002.patch
>
>
> We've seen some rare flakey failures in 
> {{TestRMEmbeddedElector#testCallbackSynchronization}}:
> {noformat}
> org.mockito.exceptions.verification.WantedButNotInvoked: 
> Wanted but not invoked:
> adminService.transitionToStandby();
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronizationNeutral(TestRMEmbeddedElector.java:215)
> Actually, there were zero interactions with this mock.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronizationNeutral(TestRMEmbeddedElector.java:215)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronization(TestRMEmbeddedElector.java:146)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronization(TestRMEmbeddedElector.java:109)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8051) TestRMEmbeddedElector#testCallbackSynchronization is flakey

2018-08-29 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596901#comment-16596901
 ] 

Eric Payne commented on YARN-8051:
--

+1. Committing to branch-2, branch-2.9, and branch-2.8

> TestRMEmbeddedElector#testCallbackSynchronization is flakey
> ---
>
> Key: YARN-8051
> URL: https://issues.apache.org/jira/browse/YARN-8051
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.10.0, 2.9.1, 2.8.4, 3.0.2, 3.2.0, 3.1.1
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Fix For: 3.2.0, 3.0.4, 3.1.2
>
> Attachments: YARN-8051-branch-2.002.patch, YARN-8051.001.patch, 
> YARN-8051.002.patch
>
>
> We've seen some rare flakey failures in 
> {{TestRMEmbeddedElector#testCallbackSynchronization}}:
> {noformat}
> org.mockito.exceptions.verification.WantedButNotInvoked: 
> Wanted but not invoked:
> adminService.transitionToStandby();
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronizationNeutral(TestRMEmbeddedElector.java:215)
> Actually, there were zero interactions with this mock.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronizationNeutral(TestRMEmbeddedElector.java:215)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronization(TestRMEmbeddedElector.java:146)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector.testCallbackSynchronization(TestRMEmbeddedElector.java:109)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8706) DelayedProcessKiller is executed for Docker containers even though docker stop sends a KILL signal after the specified grace period

2018-08-29 Thread Eric Badger (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596887#comment-16596887
 ] 

Eric Badger commented on YARN-8706:
---

I think the default sleep delay between STOPSIGNAL and SIGKILL is irrelevant to 
this specific JIRA. It's certainly something that we can discuss and maybe it 
is reasonable to increase the value, but I don't think that it has anything to 
do with how we send the signals. That discussion is a separate issue. I am in 
still in favor of my original proposal

> DelayedProcessKiller is executed for Docker containers even though docker 
> stop sends a KILL signal after the specified grace period
> ---
>
> Key: YARN-8706
> URL: https://issues.apache.org/jira/browse/YARN-8706
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: docker
>
> {{DockerStopCommand}} adds a grace period of 10 seconds.
> 10 seconds is also the default grace time use by docker stop
>  [https://docs.docker.com/engine/reference/commandline/stop/]
> Documentation of the docker stop:
> {quote}the main process inside the container will receive {{SIGTERM}}, and 
> after a grace period, {{SIGKILL}}.
> {quote}
> There is a {{DelayedProcessKiller}} in {{ContainerExcecutor}} which executes 
> for all containers after a delay when {{sleepDelayBeforeSigKill>0}}. By 
> default this is set to {{250 milliseconds}} and so irrespective of the 
> container type, it will always get executed.
>  
> For a docker container, {{docker stop}} takes care of sending a {{SIGKILL}} 
> after the grace period
> - when sleepDelayBeforeSigKill > 10 seconds, then there is no point of 
> executing DelayedProcessKiller
> - when sleepDelayBeforeSigKill < 1 second, then the grace period should be 
> the smallest value, which is 1 second, because anyways we are forcing kill 
> after 250 ms
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8706) DelayedProcessKiller is executed for Docker containers even though docker stop sends a KILL signal after the specified grace period

2018-08-29 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596883#comment-16596883
 ] 

Eric Yang commented on YARN-8706:
-

{quote}Why is this specific to docker containers? Other types of containers 
maybe dealing with data and if the default grace period of 250 millis is too 
small, then it can be changed with the config NM_SLEEP_DELAY_BEFORE_SIGKILL_MS. 
Maybe this should be something that the application could specify as well, but 
that is a different discussion.{quote}

YARN containers were mostly stateless, and not reused.  The short termination 
wait time can work without causing problem to Hadoop specific application.  
With introduction of Docker container, it might take several seconds to 
gracefully shutdown a database daemon.  10 seconds default seems like a safer 
wait time if docker container is persisted and reused.  There isn't much data 
point to show waiting longer is better at this time.  The default setting may 
be revisited later.

> DelayedProcessKiller is executed for Docker containers even though docker 
> stop sends a KILL signal after the specified grace period
> ---
>
> Key: YARN-8706
> URL: https://issues.apache.org/jira/browse/YARN-8706
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: docker
>
> {{DockerStopCommand}} adds a grace period of 10 seconds.
> 10 seconds is also the default grace time use by docker stop
>  [https://docs.docker.com/engine/reference/commandline/stop/]
> Documentation of the docker stop:
> {quote}the main process inside the container will receive {{SIGTERM}}, and 
> after a grace period, {{SIGKILL}}.
> {quote}
> There is a {{DelayedProcessKiller}} in {{ContainerExcecutor}} which executes 
> for all containers after a delay when {{sleepDelayBeforeSigKill>0}}. By 
> default this is set to {{250 milliseconds}} and so irrespective of the 
> container type, it will always get executed.
>  
> For a docker container, {{docker stop}} takes care of sending a {{SIGKILL}} 
> after the grace period
> - when sleepDelayBeforeSigKill > 10 seconds, then there is no point of 
> executing DelayedProcessKiller
> - when sleepDelayBeforeSigKill < 1 second, then the grace period should be 
> the smallest value, which is 1 second, because anyways we are forcing kill 
> after 250 ms
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8638) Allow linux container runtimes to be pluggable

2018-08-29 Thread Kenny Chang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596868#comment-16596868
 ] 

Kenny Chang commented on YARN-8638:
---

+1 Noob, non-binding.

> Allow linux container runtimes to be pluggable
> --
>
> Key: YARN-8638
> URL: https://issues.apache.org/jira/browse/YARN-8638
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Minor
> Attachments: YARN-8638.001.patch, YARN-8638.002.patch, 
> YARN-8638.003.patch, YARN-8638.004.patch
>
>
> YARN currently supports three different Linux container runtimes (default, 
> docker, and javasandbox). However, it would be relatively straightforward to 
> support arbitrary runtime implementations. This would enable easier 
> experimentation with new and emerging runtime technologies (runc, containerd, 
> etc.) without requiring a rebuild and redeployment of Hadoop. 
> This could be accomplished via a simple configuration change:
> {code:xml}
> 
>  yarn.nodemanager.runtime.linux.allowed-runtimes
>  default,docker,experimental
> 
>  
> 
>  yarn.nodemanager.runtime.linux.experimental.class
>  com.somecompany.yarn.runtime.ExperimentalLinuxContainerRuntime
> {code}
>  
> In this example, {{yarn.nodemanager.runtime.linux.allowed-runtimes}} would 
> now allow arbitrary values. Additionally, 
> {{yarn.nodemanager.runtime.linux.\{RUNTIME_KEY}.class}} would indicate the 
> {{LinuxContainerRuntime}} implementation to instantiate. A no-argument 
> constructor should be sufficient, as {{LinuxContainerRuntime}} already 
> provides an {{initialize()}} method.
> {{DockerLinuxContainerRuntime.isDockerContainerRequested(Map 
> env)}} and {{JavaSandboxLinuxContainerRuntime.isSandboxContainerRequested()}} 
> could be generalized to {{isRuntimeRequested(Map env)}} and 
> added to the {{LinuxContainerRuntime}} interface. This would allow 
> {{DelegatingLinuxContainerRuntime}} to select an appropriate runtime based on 
> whether that runtime claimed ownership of the current container execution.
> For backwards compatibility, the existing values (default,docker,javasandbox) 
> would continue to be supported as-is. Under the current logic, the evaluation 
> order is javasandbox, docker, default (with default being chosen if no other 
> candidates are available). Under the new evaluation logic, pluggable runtimes 
> would be evaluated after docker and before default, in the order in which 
> they are defined in the allowed-runtimes list. This will change no behavior 
> on current clusters (as there would be no pluggable runtimes defined), and 
> preserves behavior with respect to ordering of existing runtimes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6972) Adding RM ClusterId in AppInfo

2018-08-29 Thread Tanuj Nayak (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanuj Nayak updated YARN-6972:
--
Attachment: YARN-6972.016.patch

> Adding RM ClusterId in AppInfo
> --
>
> Key: YARN-6972
> URL: https://issues.apache.org/jira/browse/YARN-6972
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Giovanni Matteo Fumarola
>Assignee: Tanuj Nayak
>Priority: Major
> Attachments: YARN-6972.001.patch, YARN-6972.002.patch, 
> YARN-6972.003.patch, YARN-6972.004.patch, YARN-6972.005.patch, 
> YARN-6972.006.patch, YARN-6972.007.patch, YARN-6972.008.patch, 
> YARN-6972.009.patch, YARN-6972.010.patch, YARN-6972.011.patch, 
> YARN-6972.012.patch, YARN-6972.013.patch, YARN-6972.014.patch, 
> YARN-6972.015.patch, YARN-6972.016.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6338) Typos in Docker docs: contains => containers

2018-08-29 Thread Zoltan Siegl (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Siegl reassigned YARN-6338:
--

Assignee: Zoltan Siegl  (was: Szilard Nemeth)

> Typos in Docker docs: contains => containers
> 
>
> Key: YARN-6338
> URL: https://issues.apache.org/jira/browse/YARN-6338
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.9.0, 3.0.0-alpha4
>Reporter: Daniel Templeton
>Assignee: Zoltan Siegl
>Priority: Minor
>  Labels: docs
>
> "allowed to request privileged contains" should be "allowed to request 
> privileged containers"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8680) YARN NM: Implement Iterable Abstraction for LocalResourceTrackerstate

2018-08-29 Thread Pradeep Ambati (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596230#comment-16596230
 ] 

Pradeep Ambati edited comment on YARN-8680 at 8/29/18 6:38 PM:
---

Hi [~sunilg],

This jira is marked as critical for 3.2. I can definitely take it forward if it 
is not feasible to complete it in coming weeks.


was (Author: pradeepambati):
Hi [~sunilg],

 

This jira is marked as critical for 3.2. I can definitely take it forward if it 
is not feasible to complete it in coming weeks.

> YARN NM: Implement Iterable Abstraction for LocalResourceTrackerstate
> -
>
> Key: YARN-8680
> URL: https://issues.apache.org/jira/browse/YARN-8680
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Pradeep Ambati
>Assignee: Pradeep Ambati
>Priority: Critical
> Attachments: YARN-8680.00.patch, YARN-8680.01.patch
>
>
> Similar to YARN-8242, implement iterable abstraction for 
> LocalResourceTrackerState to load completed and in progress resources when 
> needed rather than loading them all at a time for a respective state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker

2018-08-29 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596699#comment-16596699
 ] 

Jason Lowe commented on YARN-8648:
--

Thanks for the patch!

Why was the postComplete call moved in reapContainer to before the container is 
removed via docker?  Shouldn't docker first remove its cgroups for the 
container before we remove ours?

Is there a reason to separate removing docker cgroups from removing the docker 
container?  This seems like a natural extension to cleaning up after a 
container run by docker, and that's already covered by the reap command.  The 
patch would remain a docker-only change but without needing to modify the 
container-executor interface.

Nit: PROC_MOUNT_PATH should be a macro (i.e.: #define) or lower-cased.  Similar 
for CGROUP_MOUNT.

The snprintf result should be checked for truncation in addition to output 
errors (i.e.: result >= PATH_MAX means it was truncated) otherwise we formulate 
an incomplete path targeted for deletion if that somehow occurs.  Alternatively 
the code could use make_string or asprintf to allocate an appropriately sized 
buffer for each entry rather than trying to reuse a manually sized buffer.

Is there any point in logging to the error file that a path we want to delete 
has already been deleted?  This seems like it will just be noise, especially if 
systemd or something else is periodically cleaning some of these empty cgroups.

Related to the previous comment, the rmdir result should be checked for ENOENT 
and treat that as success.

Nit: I think lineptr should be freed in the cleanup label in case someone later 
adds a fatal error that jumps to cleanup.

> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8648.001.patch
>
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8706) DelayedProcessKiller is executed for Docker containers even though docker stop sends a KILL signal after the specified grace period

2018-08-29 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596637#comment-16596637
 ] 

Chandni Singh commented on YARN-8706:
-

{quote}I am not entirely sure about globally identical killing mechanism for 
all container type, is a sane approach to brute force container shutdown.
{quote}
I am not sure what you mean. NM does a graceful shutdown for all types of 
containers. It first sends a {{SIGTERM}} and then after a grace period, sends 
{{SIGKILL}}. 
The {{SIGTERM}} for docker is handled by docker stop, which has the following 
problems:
1. grace period can be specified only in seconds
2. clubs {{SIGKILL}} with stop. Docker first sends a {{STOPSIGNAL}} to the root 
process and then after the grace period, sends {{SIGKILL}} to the root process. 
This is not what NM wants with the stop and docker stop doesn't give any option 
to NOT send {{SIGKILL}}
The proposed change by [~ebadger] will just send the {{STOPSIGNAL}} which 
solves our problem.
{quote}10 seconds default is probably more sensible to give the container a 
chance to shutdown gracefully without causing corruption to data.
{quote}
Why is this specific to docker containers? Other types of containers maybe 
dealing with data and if the default grace period of 250 millis is too small, 
then it can be changed with the config {{NM_SLEEP_DELAY_BEFORE_SIGKILL_MS}}. 
Maybe this should be something that the application could specify as well, but 
that is a different discussion.

> DelayedProcessKiller is executed for Docker containers even though docker 
> stop sends a KILL signal after the specified grace period
> ---
>
> Key: YARN-8706
> URL: https://issues.apache.org/jira/browse/YARN-8706
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>  Labels: docker
>
> {{DockerStopCommand}} adds a grace period of 10 seconds.
> 10 seconds is also the default grace time use by docker stop
>  [https://docs.docker.com/engine/reference/commandline/stop/]
> Documentation of the docker stop:
> {quote}the main process inside the container will receive {{SIGTERM}}, and 
> after a grace period, {{SIGKILL}}.
> {quote}
> There is a {{DelayedProcessKiller}} in {{ContainerExcecutor}} which executes 
> for all containers after a delay when {{sleepDelayBeforeSigKill>0}}. By 
> default this is set to {{250 milliseconds}} and so irrespective of the 
> container type, it will always get executed.
>  
> For a docker container, {{docker stop}} takes care of sending a {{SIGKILL}} 
> after the grace period
> - when sleepDelayBeforeSigKill > 10 seconds, then there is no point of 
> executing DelayedProcessKiller
> - when sleepDelayBeforeSigKill < 1 second, then the grace period should be 
> the smallest value, which is 1 second, because anyways we are forcing kill 
> after 250 ms
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8642) Add support for tmpfs mounts with the Docker runtime

2018-08-29 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596591#comment-16596591
 ] 

Eric Yang edited comment on YARN-8642 at 8/29/18 5:04 PM:
--

{quote}
This need will be dependent on what is running in the container. It would be 
nice to be able to reference UID and GID by variable, as you've outlined. Maybe 
resolving those variables within the mount related environment variables is a 
task the YARN Services AM could handle? Could we discuss in a follow on since 
this seems like a useful feature beyond just the tmpfs mounts?{quote}

I did some experiment and found that docker will create unique sandbox_file in 
tmpfs per container.  There is no need to pre-partition in the command that we 
supply to docker.  Therefore, there is no security concern if multiple 
container uses /run as tmpfs.  Data is not going to be shared among containers. 
 Therefore, the patch is good.  We don't need a follow up JIRA for this. 
 Thank you for the commit [~shaneku...@gmail.com].


was (Author: eyang):
{quote}
This need will be dependent on what is running in the container. It would be 
nice to be able to reference UID and GID by variable, as you've outlined. Maybe 
resolving those variables within the mount related environment variables is a 
task the YARN Services AM could handle? Could we discuss in a follow on since 
this seems like a useful feature beyond just the tmpfs mounts?{quote}

I did some experiment and found that docker will create unique sandbox_file in 
tmpfs per container.  There is no need to pre-partition in the command that we 
supply to docker.  Therefore, there is no security concern if multiple 
container uses /run as tmpfs.  Data is not going to be shared among containers. 
 Therefore, the patch is good.  Thank you for the commit 
[~shaneku...@gmail.com].

> Add support for tmpfs mounts with the Docker runtime
> 
>
> Key: YARN-8642
> URL: https://issues.apache.org/jira/browse/YARN-8642
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Shane Kumpf
>Assignee: Craig Condit
>Priority: Major
>  Labels: Docker
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8642.001.patch, YARN-8642.002.patch
>
>
> Add support to the existing Docker runtime to allow the user to request tmpfs 
> mounts for their containers. For example:
> {code}/usr/bin/docker run --name=container_name --tmpfs /run image 
> /bootstrap/start-systemd
> {code}
> One use case is to allow systemd to run as PID 1 in a non-privileged 
> container, /run is expected to be a tmpfs mount in the container for that to 
> work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8642) Add support for tmpfs mounts with the Docker runtime

2018-08-29 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596591#comment-16596591
 ] 

Eric Yang edited comment on YARN-8642 at 8/29/18 5:02 PM:
--

{quote}
This need will be dependent on what is running in the container. It would be 
nice to be able to reference UID and GID by variable, as you've outlined. Maybe 
resolving those variables within the mount related environment variables is a 
task the YARN Services AM could handle? Could we discuss in a follow on since 
this seems like a useful feature beyond just the tmpfs mounts?{quote}

I did some experiment and found that docker will create unique sandbox_file in 
tmpfs per container.  There is no need to pre-partition in the command that we 
supply to docker.  Therefore, there is no security concern if multiple 
container uses /run as tmpfs.  Data is not going to be shared among containers. 
 Therefore, the patch is good.  Thank you for the commit 
[~shaneku...@gmail.com].


was (Author: eyang):
{quote}
This need will be dependent on what is running in the container. It would be 
nice to be able to reference UID and GID by variable, as you've outlined. Maybe 
resolving those variables within the mount related environment variables is a 
task the YARN Services AM could handle? Could we discuss in a follow on since 
this seems like a useful feature beyond just the tmpfs mounts?{quote}

I was thinking to partition it via container id or some mechanism that can be 
clean up automatically.  YARN service or upper layer don't have the visibility 
to absolute path or convention used in container-executor.  It might be better 
to keep this path logic contained in DockerLinuxContainerRuntime.  Sorry, I 
replied too late, will open another JIRA to refine this.

> Add support for tmpfs mounts with the Docker runtime
> 
>
> Key: YARN-8642
> URL: https://issues.apache.org/jira/browse/YARN-8642
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Shane Kumpf
>Assignee: Craig Condit
>Priority: Major
>  Labels: Docker
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8642.001.patch, YARN-8642.002.patch
>
>
> Add support to the existing Docker runtime to allow the user to request tmpfs 
> mounts for their containers. For example:
> {code}/usr/bin/docker run --name=container_name --tmpfs /run image 
> /bootstrap/start-systemd
> {code}
> One use case is to allow systemd to run as PID 1 in a non-privileged 
> container, /run is expected to be a tmpfs mount in the container for that to 
> work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8642) Add support for tmpfs mounts with the Docker runtime

2018-08-29 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596591#comment-16596591
 ] 

Eric Yang commented on YARN-8642:
-

{quote}
This need will be dependent on what is running in the container. It would be 
nice to be able to reference UID and GID by variable, as you've outlined. Maybe 
resolving those variables within the mount related environment variables is a 
task the YARN Services AM could handle? Could we discuss in a follow on since 
this seems like a useful feature beyond just the tmpfs mounts?{quote}

I was thinking to partition it via container id or some mechanism that can be 
clean up automatically.  YARN service or upper layer don't have the visibility 
to absolute path or convention used in container-executor.  It might be better 
to keep this path logic contained in DockerLinuxContainerRuntime.  Sorry, I 
replied too late, will open another JIRA to refine this.

> Add support for tmpfs mounts with the Docker runtime
> 
>
> Key: YARN-8642
> URL: https://issues.apache.org/jira/browse/YARN-8642
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Shane Kumpf
>Assignee: Craig Condit
>Priority: Major
>  Labels: Docker
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8642.001.patch, YARN-8642.002.patch
>
>
> Add support to the existing Docker runtime to allow the user to request tmpfs 
> mounts for their containers. For example:
> {code}/usr/bin/docker run --name=container_name --tmpfs /run image 
> /bootstrap/start-systemd
> {code}
> One use case is to allow systemd to run as PID 1 in a non-privileged 
> container, /run is expected to be a tmpfs mount in the container for that to 
> work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8569) Create an interface to provide cluster information to application

2018-08-29 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596578#comment-16596578
 ] 

Eric Yang commented on YARN-8569:
-

{quote}There were tons of debates regarding to yarn user should be treated as 
root or not before. We saw some issues of c-e causes yarn user can manipulate 
other user's directories, or directly escalate to root user. All of these 
issues become CVE.{quote}

If I recall correctly, I reported and fixed container-executor security issues 
like YARN-7590 and YARN-8207.  I think I have written proper security check to 
make sure the caller via network has the right Kerberos tgt that matches end 
user's container directory and also validated the source of the data is coming 
from node manager private directory.  There is a permission validation to copy 
spec file information from node manager private directory to end user container 
directory.  This design is similar to transporting delegation tokens to 
container working directory.  I think there are good enough security 
validations to ensure no security hole has been added to this work.  Let me 
know if you find security holes.

{quote}
>From YARN's design purpose, ideally all NM/RM logics should be as general as 
>possible, all service-related stuffs should be handled by service framework 
>like API server or ServiceMaster. I really don't like the idea of adding 
>service-specific API to NM API.{quote}

The new API is not YARN service framework specific.  ContainerExecutor provide 
basic API for starting, stopping, and clean up containers, but it is missing 
more sophisticated API like synchronize configuration among containers.  The 
new API syncYarnSysFS is proposed to allow ContainerExecutor developer to write 
their own implementation of populating text information to /hadoop/yarn/sysfs.  
Custom AM can be written to use the new API and populate other text 
information.  The newly added API in node manager is generic and avoid the 
double serialization cost exists in container manager protobuf and rpc code.  
There is no extra serialization cost of the content during transport, thus the 
new API is more efficient and light weight.  There is nothing specific to YARN 
service, although YARN service is the first consumer for this API.

{quote}
1) ServiceMaster ro mount a local directory (under the container's local dir) 
when launch docker container (example like: ./service-info -> /service/sys/fs/) 

2) ServiceMaster request to re-localize new service spec json file to the 
./service-info folder.
{quote}

What is "ro mount" in the first sentence?  Is it remote mount or read-only 
mount?  ServiceMaster is not guarantee to run on the same node as other 
container.  Hence, there is no practical way to mount service master's 
directory to container's local directory cross nodes.

> Create an interface to provide cluster information to application
> -
>
> Key: YARN-8569
> URL: https://issues.apache.org/jira/browse/YARN-8569
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8569.001.patch, YARN-8569.002.patch
>
>
> Some program requires container hostnames to be known for application to run. 
>  For example, distributed tensorflow requires launch_command that looks like:
> {code}
> # On ps0.example.com:
> $ python trainer.py \
>  --ps_hosts=ps0.example.com:,ps1.example.com: \
>  --worker_hosts=worker0.example.com:,worker1.example.com: \
>  --job_name=ps --task_index=0
> # On ps1.example.com:
> $ python trainer.py \
>  --ps_hosts=ps0.example.com:,ps1.example.com: \
>  --worker_hosts=worker0.example.com:,worker1.example.com: \
>  --job_name=ps --task_index=1
> # On worker0.example.com:
> $ python trainer.py \
>  --ps_hosts=ps0.example.com:,ps1.example.com: \
>  --worker_hosts=worker0.example.com:,worker1.example.com: \
>  --job_name=worker --task_index=0
> # On worker1.example.com:
> $ python trainer.py \
>  --ps_hosts=ps0.example.com:,ps1.example.com: \
>  --worker_hosts=worker0.example.com:,worker1.example.com: \
>  --job_name=worker --task_index=1
> {code}
> This is a bit cumbersome to orchestrate via Distributed Shell, or YARN 
> services launch_command.  In addition, the dynamic parameters do not work 
> with YARN flex command.  This is the classic pain point for application 
> developer attempt to automate system environment settings as parameter to end 
> user application.
> It would be great if YARN Docker integration can provide a simple option to 
> expose hostnames of the yarn service via a mounted file.  The file content 
> gets updated whe

[jira] [Commented] (YARN-8644) Make RMAppImpl$FinalTransition more readable + add more test coverage

2018-08-29 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596476#comment-16596476
 ] 

Szilard Nemeth commented on YARN-8644:
--

Hi [~zsiegl]!
Thanks for your review comments, both findings are good catches.
Please check my updated patch!
Thanks!

> Make RMAppImpl$FinalTransition more readable + add more test coverage
> -
>
> Key: YARN-8644
> URL: https://issues.apache.org/jira/browse/YARN-8644
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
> Attachments: YARN-8644.001.patch, YARN-8644.002.patch, 
> YARN-8644.003.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8644) Make RMAppImpl$FinalTransition more readable + add more test coverage

2018-08-29 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-8644:
-
Attachment: YARN-8644.003.patch

> Make RMAppImpl$FinalTransition more readable + add more test coverage
> -
>
> Key: YARN-8644
> URL: https://issues.apache.org/jira/browse/YARN-8644
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
> Attachments: YARN-8644.001.patch, YARN-8644.002.patch, 
> YARN-8644.003.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7865) Node attributes documentation

2018-08-29 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596412#comment-16596412
 ] 

Sunil Govindan commented on YARN-7865:
--

Thanks [~Naganarasimha]
Some quick comments.
 # Node Attributes page looks to be linked to left panel. Pls add it in the 
hadoop-yarn site pjct.
 # Cud we also mention whats not supported in REST.
 # Distributed node-to-Attributes mapping ==> Distributed Node Attributes 
mapping.
 # Unlike Node Labels, Node Attributes need not be explicitly enabled as it 
will be always existing and would have no impact in terms of performance or 
compatability even if not used. ==>  Unlike Node Labels, Node Attributes need 
not to be explicitly enabled as it will always exists and would have no impact 
in terms of performance or compatibility even if feature is not used.

> Node attributes documentation
> -
>
> Key: YARN-7865
> URL: https://issues.apache.org/jira/browse/YARN-7865
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Weiwei Yang
>Assignee: Naganarasimha G R
>Priority: Major
> Attachments: NodeAttributes.html, YARN-7865-YARN-3409.001.patch, 
> YARN-7865-YARN-3409.002.patch, YARN-7865-YARN-3409.003.patch
>
>
> We need proper docs to introduce how to enable node-attributes how to 
> configure providers, how to specify script paths, arguments in configuration, 
> what should be the proper permission of the script and who will run the 
> script. Also it would be good to add more info to the description of the 
> configuration properties.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-3409) Support Node Attribute functionality

2018-08-29 Thread Naganarasimha G R (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R updated YARN-3409:

Attachment: Node-Attributes-Requirements-Design-doc_v2.pdf

> Support Node Attribute functionality
> 
>
> Key: YARN-3409
> URL: https://issues.apache.org/jira/browse/YARN-3409
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, client, RM
>Reporter: Wangda Tan
>Assignee: Naganarasimha G R
>Priority: Major
> Attachments: 3409-apiChanges_v2.pdf (4).pdf, 
> Constraint-Node-Labels-Requirements-Design-doc_v1.pdf, 
> Node-Attributes-Requirements-Design-doc_v2.pdf, YARN-3409.WIP.001.patch
>
>
> Specify only one label for each node (IAW, partition a cluster) is a way to 
> determinate how resources of a special set of nodes could be shared by a 
> group of entities (like teams, departments, etc.). Partitions of a cluster 
> has following characteristics:
> - Cluster divided to several disjoint sub clusters.
> - ACL/priority can apply on partition (Only market team / marke team has 
> priority to use the partition).
> - Percentage of capacities can apply on partition (Market team has 40% 
> minimum capacity and Dev team has 60% of minimum capacity of the partition).
> Attributes are orthogonal to partition, they’re describing features of node’s 
> hardware/software just for affinity. Some example of attributes:
> - glibc version
> - JDK version
> - Type of CPU (x86_64/i686)
> - Type of OS (windows, linux, etc.)
> With this, application can be able to ask for resource has (glibc.version >= 
> 2.20 && JDK.version >= 8u20 && x86_64).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7903) Method getStarvedResourceRequests() only consider the first encountered resource

2018-08-29 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth reassigned YARN-7903:


Assignee: Szilard Nemeth

> Method getStarvedResourceRequests() only consider the first encountered 
> resource
> 
>
> Key: YARN-7903
> URL: https://issues.apache.org/jira/browse/YARN-7903
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Yufei Gu
>Assignee: Szilard Nemeth
>Priority: Major
>
> We need to specify rack and ANY while submitting a node local resource 
> request, as YARN-7561 discussed. For example:
> {code}
> ResourceRequest nodeRequest =
> createResourceRequest(GB, node1.getHostName(), 1, 1, false);
> ResourceRequest rackRequest =
> createResourceRequest(GB, node1.getRackName(), 1, 1, false);
> ResourceRequest anyRequest =
> createResourceRequest(GB, ResourceRequest.ANY, 1, 1, false);
> List resourceRequests =
> Arrays.asList(nodeRequest, rackRequest, anyRequest);
> {code}
> However, method getStarvedResourceRequests() only consider the first 
> encountered resource, which most likely is ResourceRequest.ANY. That's a 
> mismatch for locality request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8286) Add NMClient callback on container relaunch

2018-08-29 Thread Billie Rinaldi (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Billie Rinaldi updated YARN-8286:
-
Target Version/s: 3.3.0  (was: 3.2.0)

> Add NMClient callback on container relaunch
> ---
>
> Key: YARN-8286
> URL: https://issues.apache.org/jira/browse/YARN-8286
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Billie Rinaldi
>Priority: Critical
>
> The AM may need to perform actions when a container has been relaunched. For 
> example, the service AM would want to change the state it has recorded for 
> the container and retrieve new container status for the container, in case 
> the container IP has changed. (The NM would also need to remove the IP it has 
> stored for the container, so container status calls don't return an IP for a 
> container that is not currently running.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8648) Container cgroups are leaked when using docker

2018-08-29 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596345#comment-16596345
 ] 

Jim Brennan commented on YARN-8648:
---

Looks like this is ready for review.

 

> Container cgroups are leaked when using docker
> --
>
> Key: YARN-8648
> URL: https://issues.apache.org/jira/browse/YARN-8648
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8648.001.patch
>
>
> When you run with docker and enable cgroups for cpu, docker creates cgroups 
> for all resources on the system, not just for cpu.  For instance, if the 
> {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
> the nodemanager will create a cgroup for each container under 
> {{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path 
> via the {{--cgroup-parent}} command line argument.   Docker then creates a 
> cgroup for the docker container under that, for instance: 
> {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.
> When the container exits, docker cleans up the {{docker_container_id}} 
> cgroup, and the nodemanager cleans up the {{container_id}} cgroup,   All is 
> good under {{/sys/fs/cgroup/hadoop-yarn}}.
> The problem is that docker also creates that same hierarchy under every 
> resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these 
> are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
> perf_event, and systemd.So for instance, docker creates 
> {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but 
> it only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up 
> the {{container_id}} cgroups for these other resources.  On one of our busy 
> clusters, we found > 100,000 of these leaked cgroups.
> I found this in our 2.8-based version of hadoop, but I have been able to 
> repro with current hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8642) Add support for tmpfs mounts with the Docker runtime

2018-08-29 Thread Shane Kumpf (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596307#comment-16596307
 ] 

Shane Kumpf commented on YARN-8642:
---

Thanks to [~ccondit-target] for the contribution and [~ebadger] and [~eyang] 
for the reviews! I committed this to trunk and branch-3.1.

> Add support for tmpfs mounts with the Docker runtime
> 
>
> Key: YARN-8642
> URL: https://issues.apache.org/jira/browse/YARN-8642
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Shane Kumpf
>Assignee: Craig Condit
>Priority: Major
>  Labels: Docker
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8642.001.patch, YARN-8642.002.patch
>
>
> Add support to the existing Docker runtime to allow the user to request tmpfs 
> mounts for their containers. For example:
> {code}/usr/bin/docker run --name=container_name --tmpfs /run image 
> /bootstrap/start-systemd
> {code}
> One use case is to allow systemd to run as PID 1 in a non-privileged 
> container, /run is expected to be a tmpfs mount in the container for that to 
> work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8729) Node status updater thread could be lost after it restarted

2018-08-29 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8729:
---
Attachment: YARN-8729.001.patch

> Node status updater thread could be lost after it restarted
> ---
>
> Key: YARN-8729
> URL: https://issues.apache.org/jira/browse/YARN-8729
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8729.001.patch
>
>
> Today I found a lost NM whose node status updater thread was not exist after 
> this thread restarted. In 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}, isStopped 
> flag is not updated to be false before executing {{statusUpdater.start()}}, 
> so that if the thread is immediately started and found isStopped==true, it 
> will exit without any log.
> Key codes in 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}:
> {code:java}
>  statusUpdater.join();
>  registerWithRM();
>  statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater");
>  statusUpdater.start();
>  this.isStopped = false;   //this line should be moved before 
> statusUpdater.start();
>  LOG.info("NodeStatusUpdater thread is reRegistered and restarted");
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8729) Node status updater thread could be lost after it restarted

2018-08-29 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596235#comment-16596235
 ] 

Tao Yang commented on YARN-8729:


Attached v1 patch for review.

> Node status updater thread could be lost after it restarted
> ---
>
> Key: YARN-8729
> URL: https://issues.apache.org/jira/browse/YARN-8729
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8729.001.patch
>
>
> Today I found a lost NM whose node status updater thread was not exist after 
> this thread restarted. In 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}, isStopped 
> flag is not updated to be false before executing {{statusUpdater.start()}}, 
> so that if the thread is immediately started and found isStopped==true, it 
> will exit without any log.
> Key codes in 
> {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}:
> {code:java}
>  statusUpdater.join();
>  registerWithRM();
>  statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater");
>  statusUpdater.start();
>  this.isStopped = false;   //this line should be moved before 
> statusUpdater.start();
>  LOG.info("NodeStatusUpdater thread is reRegistered and restarted");
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8729) Node status updater thread could be lost after it restarted

2018-08-29 Thread Tao Yang (JIRA)
Tao Yang created YARN-8729:
--

 Summary: Node status updater thread could be lost after it 
restarted
 Key: YARN-8729
 URL: https://issues.apache.org/jira/browse/YARN-8729
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.2.0
Reporter: Tao Yang
Assignee: Tao Yang


Today I found a lost NM whose node status updater thread was not exist after 
this thread restarted. In 
{{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}, isStopped 
flag is not updated to be false before executing {{statusUpdater.start()}}, so 
that if the thread is immediately started and found isStopped==true, it will 
exit without any log.

Key codes in {{NodeStatusUpdaterImpl#rebootNodeStatusUpdaterAndRegisterWithRM}}:
{code:java}
 statusUpdater.join();
 registerWithRM();
 statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater");
 statusUpdater.start();
 this.isStopped = false;   //this line should be moved before 
statusUpdater.start();
 LOG.info("NodeStatusUpdater thread is reRegistered and restarted");

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8680) YARN NM: Implement Iterable Abstraction for LocalResourceTrackerstate

2018-08-29 Thread Pradeep Ambati (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596230#comment-16596230
 ] 

Pradeep Ambati commented on YARN-8680:
--

Hi [~sunilg],

 

This jira is marked as critical for 3.2. I can definitely take it forward if it 
is not feasible to complete it in coming weeks.

> YARN NM: Implement Iterable Abstraction for LocalResourceTrackerstate
> -
>
> Key: YARN-8680
> URL: https://issues.apache.org/jira/browse/YARN-8680
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Pradeep Ambati
>Assignee: Pradeep Ambati
>Priority: Critical
> Attachments: YARN-8680.00.patch, YARN-8680.01.patch
>
>
> Similar to YARN-8242, implement iterable abstraction for 
> LocalResourceTrackerState to load completed and in progress resources when 
> needed rather than loading them all at a time for a respective state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8709) intra-queue preemption checker always fail since one under-served queue was deleted

2018-08-29 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596219#comment-16596219
 ] 

Tao Yang commented on YARN-8709:


Attached v1 patch for review.

[~eepayne], [~sunilg], please help to review in your free time.

Thanks!

> intra-queue preemption checker always fail since one under-served queue was 
> deleted
> ---
>
> Key: YARN-8709
> URL: https://issues.apache.org/jira/browse/YARN-8709
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler preemption
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8709.001.patch
>
>
> After some queues deleted, the preemption checker in SchedulingMonitor was 
> always skipped  because of YarnRuntimeException for every run.
> Error logs:
> {noformat}
> ERROR [SchedulingMonitor (ProportionalCapacityPreemptionPolicy)] 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor: 
> Exception raised while executing preemption checker, skip this run..., 
> exception=
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: This shouldn't 
> happen, cannot find TempQueuePerPartition for queueName=1535075839208
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.getQueueByPartition(ProportionalCapacityPreemptionPolicy.java:701)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.computeIntraQueuePreemptionDemand(IntraQueueCandidatesSelector.java:302)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.selectCandidates(IntraQueueCandidatesSelector.java:128)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:514)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:348)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:186)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:300)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
> at java.lang.Thread.run(Thread.java:834)
> {noformat}
> I think there is something wrong with partitionToUnderServedQueues field in 
> ProportionalCapacityPreemptionPolicy. Items of partitionToUnderServedQueues 
> can be add but never be removed, except rebuilding this policy. For example, 
> once under-served queue "a" is added into this structure, it will always be 
> there and never be removed, intra-queue preemption checker will try to get 
> all queues info for partitionToUnderServedQueues in 
> IntraQueueCandidatesSelector#selectCandidates and will throw 
> YarnRuntimeException if not found. So that after queue "a" is deleted from 
> queue structure, the preemption checker will always fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8728) Wrong available resource in AllocateResponse when allocating containers to different partitions in the same queue

2018-08-29 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8728:
---
Attachment: (was: YARN-8728.001.patch)

> Wrong available resource in AllocateResponse when allocating containers to 
> different partitions in the same queue
> -
>
> Key: YARN-8728
> URL: https://issues.apache.org/jira/browse/YARN-8728
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>
> Recently I found some apps' available resource in AllocateResponse was 
> changing between two different resources. After check the code, I think 
> {{LeafQueue#queueResourceLimitsInfo}} is wrongly updated in 
> {{LeafQueue#computeUserLimitAndSetHeadroom}} for all partitions, and this 
> data should be updated only for default partition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8709) intra-queue preemption checker always fail since one under-served queue was deleted

2018-08-29 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8709:
---
Attachment: YARN-8709.001.patch

> intra-queue preemption checker always fail since one under-served queue was 
> deleted
> ---
>
> Key: YARN-8709
> URL: https://issues.apache.org/jira/browse/YARN-8709
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler preemption
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8709.001.patch
>
>
> After some queues deleted, the preemption checker in SchedulingMonitor was 
> always skipped  because of YarnRuntimeException for every run.
> Error logs:
> {noformat}
> ERROR [SchedulingMonitor (ProportionalCapacityPreemptionPolicy)] 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor: 
> Exception raised while executing preemption checker, skip this run..., 
> exception=
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: This shouldn't 
> happen, cannot find TempQueuePerPartition for queueName=1535075839208
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.getQueueByPartition(ProportionalCapacityPreemptionPolicy.java:701)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.computeIntraQueuePreemptionDemand(IntraQueueCandidatesSelector.java:302)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.selectCandidates(IntraQueueCandidatesSelector.java:128)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:514)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:348)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:186)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:300)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
> at java.lang.Thread.run(Thread.java:834)
> {noformat}
> I think there is something wrong with partitionToUnderServedQueues field in 
> ProportionalCapacityPreemptionPolicy. Items of partitionToUnderServedQueues 
> can be add but never be removed, except rebuilding this policy. For example, 
> once under-served queue "a" is added into this structure, it will always be 
> there and never be removed, intra-queue preemption checker will try to get 
> all queues info for partitionToUnderServedQueues in 
> IntraQueueCandidatesSelector#selectCandidates and will throw 
> YarnRuntimeException if not found. So that after queue "a" is deleted from 
> queue structure, the preemption checker will always fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8728) Wrong available resource in AllocateResponse when allocating containers to different partitions in the same queue

2018-08-29 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596198#comment-16596198
 ] 

Tao Yang commented on YARN-8728:


Attached v1 patch for review.

> Wrong available resource in AllocateResponse when allocating containers to 
> different partitions in the same queue
> -
>
> Key: YARN-8728
> URL: https://issues.apache.org/jira/browse/YARN-8728
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8728.001.patch
>
>
> Recently I found some apps' available resource in AllocateResponse was 
> changing between two different resources. After check the code, I think 
> {{LeafQueue#queueResourceLimitsInfo}} is wrongly updated in 
> {{LeafQueue#computeUserLimitAndSetHeadroom}} for all partitions, and this 
> data should be updated only for default partition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8728) Wrong available resource in AllocateResponse when allocating containers to different partitions in the same queue

2018-08-29 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8728:
---
Summary: Wrong available resource in AllocateResponse when allocating 
containers to different partitions in the same queue  (was: Wrong available 
resource in AllocateResponse in queue with multiple partitions)

> Wrong available resource in AllocateResponse when allocating containers to 
> different partitions in the same queue
> -
>
> Key: YARN-8728
> URL: https://issues.apache.org/jira/browse/YARN-8728
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8728.001.patch
>
>
> Recently I found some apps' available resource in AllocateResponse was 
> changing between two different resources. After check the code, I think 
> {{LeafQueue#queueResourceLimitsInfo}} is wrongly updated in 
> {{LeafQueue#computeUserLimitAndSetHeadroom}} for all partitions, and this 
> data should be updated only for default partition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8728) Wrong available resource in AllocateResponse in queue with multiple partitions

2018-08-29 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8728:
---
Attachment: YARN-8728.001.patch

> Wrong available resource in AllocateResponse in queue with multiple partitions
> --
>
> Key: YARN-8728
> URL: https://issues.apache.org/jira/browse/YARN-8728
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8728.001.patch
>
>
> Recently I found some apps' available resource in AllocateResponse was 
> changing between two different resources. After check the code, I think 
> {{LeafQueue#queueResourceLimitsInfo}} is wrongly updated in 
> {{LeafQueue#computeUserLimitAndSetHeadroom}} for all partitions, and this 
> data should be updated only for default partition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8728) Wrong available resource in AllocateResponse in queue with multiple partitions

2018-08-29 Thread Tao Yang (JIRA)
Tao Yang created YARN-8728:
--

 Summary: Wrong available resource in AllocateResponse in queue 
with multiple partitions
 Key: YARN-8728
 URL: https://issues.apache.org/jira/browse/YARN-8728
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Reporter: Tao Yang
Assignee: Tao Yang


Recently I found some apps' available resource in AllocateResponse was changing 
between two different resources. After check the code, I think 
{{LeafQueue#queueResourceLimitsInfo}} is wrongly updated in 
{{LeafQueue#computeUserLimitAndSetHeadroom}} for all partitions, and this data 
should be updated only for default partition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8468) Limit container sizes per queue in FairScheduler

2018-08-29 Thread JIRA


[ 
https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596174#comment-16596174
 ] 

Antal Bálint Steinbach commented on YARN-8468:
--

Thank you [~leftnoteasy] for checking the patch. I uploaded a new one.

1) The ApplicationMaster gets the maximum allocation value here just like in 
the other case so the AM can handle it like before submitting a request, but I 
think we still need the normalization/validation of the request just in case 
the application did something wrong. Furthermore, there was a 
validation/normalization step before it was just using only the scheduler level 
maximums.

2) I reverted the formattings. I run an autoformatting unintentionally on 
SchedulerUtils.

Please let me know if I misunderstand something.

> Limit container sizes per queue in FairScheduler
> 
>
> Key: YARN-8468
> URL: https://issues.apache.org/jira/browse/YARN-8468
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Antal Bálint Steinbach
>Assignee: Antal Bálint Steinbach
>Priority: Critical
> Attachments: YARN-8468.000.patch, YARN-8468.001.patch, 
> YARN-8468.002.patch, YARN-8468.003.patch, YARN-8468.004.patch, 
> YARN-8468.005.patch, YARN-8468.006.patch, YARN-8468.007.patch, 
> YARN-8468.008.patch
>
>
> When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" 
> to limit the overall size of a container. This applies globally to all 
> containers and cannot be limited by queue or and is not scheduler dependent.
>  
> The goal of this ticket is to allow this value to be set on a per queue basis.
>  
> The use case: User has two pools, one for ad hoc jobs and one for enterprise 
> apps. User wants to limit ad hoc jobs to small containers but allow 
> enterprise apps to request as many resources as needed. Setting 
> yarn.scheduler.maximum-allocation-mb sets a default value for maximum 
> container size for all queues and setting maximum resources per queue with 
> “maxContainerResources” queue config value.
>  
> Suggested solution:
>  
> All the infrastructure is already in the code. We need to do the following:
>  * add the setting to the queue properties for all queue types (parent and 
> leaf), this will cover dynamically created queues.
>  * if we set it on the root we override the scheduler setting and we should 
> not allow that.
>  * make sure that queue resource cap can not be larger than scheduler max 
> resource cap in the config.
>  * implement getMaximumResourceCapability(String queueName) in the 
> FairScheduler
>  * implement getMaximumResourceCapability() in both FSParentQueue and 
> FSLeafQueue as follows
>  * expose the setting in the queue information in the RM web UI.
>  * expose the setting in the metrics etc for the queue.
>  * write JUnit tests.
>  * update the scheduler documentation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8468) Limit container sizes per queue in FairScheduler

2018-08-29 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antal Bálint Steinbach updated YARN-8468:
-
Attachment: YARN-8468.008.patch

> Limit container sizes per queue in FairScheduler
> 
>
> Key: YARN-8468
> URL: https://issues.apache.org/jira/browse/YARN-8468
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Antal Bálint Steinbach
>Assignee: Antal Bálint Steinbach
>Priority: Critical
> Attachments: YARN-8468.000.patch, YARN-8468.001.patch, 
> YARN-8468.002.patch, YARN-8468.003.patch, YARN-8468.004.patch, 
> YARN-8468.005.patch, YARN-8468.006.patch, YARN-8468.007.patch, 
> YARN-8468.008.patch
>
>
> When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" 
> to limit the overall size of a container. This applies globally to all 
> containers and cannot be limited by queue or and is not scheduler dependent.
>  
> The goal of this ticket is to allow this value to be set on a per queue basis.
>  
> The use case: User has two pools, one for ad hoc jobs and one for enterprise 
> apps. User wants to limit ad hoc jobs to small containers but allow 
> enterprise apps to request as many resources as needed. Setting 
> yarn.scheduler.maximum-allocation-mb sets a default value for maximum 
> container size for all queues and setting maximum resources per queue with 
> “maxContainerResources” queue config value.
>  
> Suggested solution:
>  
> All the infrastructure is already in the code. We need to do the following:
>  * add the setting to the queue properties for all queue types (parent and 
> leaf), this will cover dynamically created queues.
>  * if we set it on the root we override the scheduler setting and we should 
> not allow that.
>  * make sure that queue resource cap can not be larger than scheduler max 
> resource cap in the config.
>  * implement getMaximumResourceCapability(String queueName) in the 
> FairScheduler
>  * implement getMaximumResourceCapability() in both FSParentQueue and 
> FSLeafQueue as follows
>  * expose the setting in the queue information in the RM web UI.
>  * expose the setting in the metrics etc for the queue.
>  * write JUnit tests.
>  * update the scheduler documentation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8726) [UI2] YARN UI2 is not accessible when config.env file failed to load

2018-08-29 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596104#comment-16596104
 ] 

Sunil Govindan commented on YARN-8726:
--

Seems jenkins is down.

> [UI2] YARN UI2 is not accessible when config.env file failed to load
> 
>
> Key: YARN-8726
> URL: https://issues.apache.org/jira/browse/YARN-8726
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Akhil PB
>Assignee: Akhil PB
>Priority: Critical
> Attachments: YARN-8726.001.patch
>
>
> It is observed that yarn UI2 is not accessible. When UI2 is inspected, it 
> gives below error
> {code:java}
> index.html:1 Refused to execute script from 
> 'http://ctr-e138-1518143905142-456429-01-05.hwx.site:8088/ui2/config/configs.env'
>  because its MIME type ('text/plain') is not executable, and strict MIME type 
> checking is enabled.
> yarn-ui.js:219 base url:
> vendor.js:1978 ReferenceError: ENV is not defined
>  at updateConfigs (yarn-ui.js:212)
>  at Object.initialize (yarn-ui.js:218)
>  at vendor.js:824
>  at vendor.js:825
>  at visit (vendor.js:3025)
>  at Object.visit [as default] (vendor.js:3024)
>  at DAG.topsort (vendor.js:750)
>  at Class._runInitializer (vendor.js:825)
>  at Class.runInitializers (vendor.js:824)
>  at Class._bootSync (vendor.js:823)
> onerrorDefault @ vendor.js:1978
> trigger @ vendor.js:2967
> (anonymous) @ vendor.js:3006
> invoke @ vendor.js:626
> flush @ vendor.js:629
> flush @ vendor.js:619
> end @ vendor.js:642
> run @ vendor.js:648
> join @ vendor.js:648
> run.join @ vendor.js:1510
> (anonymous) @ vendor.js:1512
> fire @ vendor.js:230
> fireWith @ vendor.js:235
> ready @ vendor.js:242
> completed @ vendor.js:242
> vendor.js:823 Uncaught ReferenceError: ENV is not defined
>  at updateConfigs (yarn-ui.js:212)
>  at Object.initialize (yarn-ui.js:218)
>  at vendor.js:824
>  at vendor.js:825
>  at visit (vendor.js:3025)
>  at Object.visit [as default] (vendor.js:3024)
>  at DAG.topsort (vendor.js:750)
>  at Class._runInitializer (vendor.js:825)
>  at Class.runInitializers (vendor.js:824)
>  at Class._bootSync (vendor.js:823)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-08-29 Thread Sunil Govindan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan resolved YARN-8220.
--
Resolution: Done

With Submarine, we have a better implementation for this. Hence let us close 
this and migrate the enhancements to Submarine. YARN-8135

> Running Tensorflow on YARN with GPU and Docker - Examples
> -
>
> Key: YARN-8220
> URL: https://issues.apache.org/jira/browse/YARN-8220
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8220.001.patch, YARN-8220.002.patch, 
> YARN-8220.003.patch, YARN-8220.004.patch
>
>
> Tensorflow could be run on YARN and could leverage YARN's distributed 
> features.
> This spec fill will help to run Tensorflow on yarn with GPU/docker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-08-29 Thread Sunil Govindan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-8220:
-
Target Version/s:   (was: 3.2.0)

> Running Tensorflow on YARN with GPU and Docker - Examples
> -
>
> Key: YARN-8220
> URL: https://issues.apache.org/jira/browse/YARN-8220
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8220.001.patch, YARN-8220.002.patch, 
> YARN-8220.003.patch, YARN-8220.004.patch
>
>
> Tensorflow could be run on YARN and could leverage YARN's distributed 
> features.
> This spec fill will help to run Tensorflow on yarn with GPU/docker



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8727) NPE in RouterRMAdminService while stopping service.

2018-08-29 Thread Y. SREENIVASULU REDDY (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596039#comment-16596039
 ] 

Y. SREENIVASULU REDDY commented on YARN-8727:
-

i have attached the patch, please review.

> NPE in RouterRMAdminService while stopping service.
> ---
>
> Key: YARN-8727
> URL: https://issues.apache.org/jira/browse/YARN-8727
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation, router
>Affects Versions: 3.1.1
>Reporter: Y. SREENIVASULU REDDY
>Assignee: Y. SREENIVASULU REDDY
>Priority: Major
> Attachments: YARN-8727.001.patch
>
>
> while stopping the service router throw NPE
> {noformat}
> 2018-08-23 22:52:00,596 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.server.router.rmadmin.RouterRMAdminService 
> failed in state STOPPED
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.router.rmadmin.RouterRMAdminService.serviceStop(RouterRMAdminService.java:143)
> at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
> at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
> at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102)
> at 
> org.apache.hadoop.service.CompositeService.stop(CompositeService.java:158)
> at 
> org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:132)
> at 
> org.apache.hadoop.yarn.server.router.Router.serviceStop(Router.java:128)
> at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
> at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
> at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:202)
> at org.apache.hadoop.yarn.server.router.Router.main(Router.java:182)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8727) NPE in RouterRMAdminService while stopping service.

2018-08-29 Thread Y. SREENIVASULU REDDY (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Y. SREENIVASULU REDDY updated YARN-8727:

Attachment: YARN-8727.001.patch

> NPE in RouterRMAdminService while stopping service.
> ---
>
> Key: YARN-8727
> URL: https://issues.apache.org/jira/browse/YARN-8727
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation, router
>Affects Versions: 3.1.1
>Reporter: Y. SREENIVASULU REDDY
>Assignee: Y. SREENIVASULU REDDY
>Priority: Major
> Attachments: YARN-8727.001.patch
>
>
> while stopping the service router throw NPE
> {noformat}
> 2018-08-23 22:52:00,596 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.server.router.rmadmin.RouterRMAdminService 
> failed in state STOPPED
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.router.rmadmin.RouterRMAdminService.serviceStop(RouterRMAdminService.java:143)
> at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
> at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
> at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102)
> at 
> org.apache.hadoop.service.CompositeService.stop(CompositeService.java:158)
> at 
> org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:132)
> at 
> org.apache.hadoop.yarn.server.router.Router.serviceStop(Router.java:128)
> at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
> at 
> org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
> at 
> org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:202)
> at org.apache.hadoop.yarn.server.router.Router.main(Router.java:182)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8569) Create an interface to provide cluster information to application

2018-08-29 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596014#comment-16596014
 ] 

Wangda Tan commented on YARN-8569:
--

[~eyang], 
{quote}Unless malicious user already hacked into yarn user account and populate 
data as yarn user, there is no easy parameter hacking to container-executor to 
trigger exploits.
{quote}
 

There were tons of debates regarding to yarn user should be treated as root or 
not before. We saw some issues of c-e causes yarn user can manipulate other 
user's directories, or directly escalate to root user. All of these issues 
become CVE.

 
{quote}This is the reason that this solution is invented to lower the bar of 
writing clustering software for Hadoop.
{quote}
 

It gonna be help if you can share some real-world examples. 

>From YARN's design purpose, ideally all NM/RM logics should be as general as 
>possible, all service-related stuffs should be handled by service framework 
>like API server or ServiceMaster. I really don't like the idea of adding 
>service-specific API to NM API.

If you do think update service spec json file is important, another approach 
could be:

1) ServiceMaster ro mount a local directory (under the container's local dir) 
when launch docker container (example like: ./service-info -> /service/sys/fs/) 

2) ServiceMaster request to re-localize new service spec json file to the 
./service-info folder.

 

> Create an interface to provide cluster information to application
> -
>
> Key: YARN-8569
> URL: https://issues.apache.org/jira/browse/YARN-8569
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8569.001.patch, YARN-8569.002.patch
>
>
> Some program requires container hostnames to be known for application to run. 
>  For example, distributed tensorflow requires launch_command that looks like:
> {code}
> # On ps0.example.com:
> $ python trainer.py \
>  --ps_hosts=ps0.example.com:,ps1.example.com: \
>  --worker_hosts=worker0.example.com:,worker1.example.com: \
>  --job_name=ps --task_index=0
> # On ps1.example.com:
> $ python trainer.py \
>  --ps_hosts=ps0.example.com:,ps1.example.com: \
>  --worker_hosts=worker0.example.com:,worker1.example.com: \
>  --job_name=ps --task_index=1
> # On worker0.example.com:
> $ python trainer.py \
>  --ps_hosts=ps0.example.com:,ps1.example.com: \
>  --worker_hosts=worker0.example.com:,worker1.example.com: \
>  --job_name=worker --task_index=0
> # On worker1.example.com:
> $ python trainer.py \
>  --ps_hosts=ps0.example.com:,ps1.example.com: \
>  --worker_hosts=worker0.example.com:,worker1.example.com: \
>  --job_name=worker --task_index=1
> {code}
> This is a bit cumbersome to orchestrate via Distributed Shell, or YARN 
> services launch_command.  In addition, the dynamic parameters do not work 
> with YARN flex command.  This is the classic pain point for application 
> developer attempt to automate system environment settings as parameter to end 
> user application.
> It would be great if YARN Docker integration can provide a simple option to 
> expose hostnames of the yarn service via a mounted file.  The file content 
> gets updated when flex command is performed.  This allows application 
> developer to consume system environment settings via a standard interface.  
> It is like /proc/devices for Linux, but for Hadoop.  This may involve 
> updating a file in distributed cache, and allow mounting of the file via 
> container-executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org