[jira] [Commented] (YARN-6960) definition of active queue allows idle long-running apps to distort fair shares

2017-08-22 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16137446#comment-16137446
 ] 

Steven Rand commented on YARN-6960:
---

Thanks, Daniel. Having thought about this some more, I don't think that either 
of the two patches I've posted is a good solution. In the first patch, inactive 
queues have fair shares of zero, and AM containers are subject to preemption 
even when running in high-priority queues. And in the second patch, 
applications running in idle queues define what their fair shares are 
irrespective of cluster-side settings, which doesn't make sense.

I'll think about this some more and try to come up with a better idea, but I'd 
also be quite interested in hearing your opinion and those of others. 

> definition of active queue allows idle long-running apps to distort fair 
> shares
> ---
>
> Key: YARN-6960
> URL: https://issues.apache.org/jira/browse/YARN-6960
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.1, 3.0.0-alpha4
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-6960.001.patch, YARN-6960.002.patch
>
>
> YARN-2026 introduced the notion of only considering active queues when 
> computing the fair share of each queue. The definition of an active queue is 
> a queue with at least one runnable app:
> {code}
>   public boolean isActive() {
> return getNumRunnableApps() > 0;
>   }
> {code}
> One case that this definition of activity doesn't account for is that of 
> long-running applications that scale dynamically. Such an application might 
> request many containers when jobs are running, but scale down to very few 
> containers, or only the AM container, when no jobs are running.
> Even when such an application has scaled down to a negligible amount of 
> demand and utilization, the queue that it's in is still considered to be 
> active, which defeats the purpose of YARN-2026. For example, consider this 
> scenario:
> 1. We have queues {{root.a}}, {{root.b}}, {{root.c}}, and {{root.d}}, all of 
> which have the same weight.
> 2. Queues {{root.a}} and {{root.b}} contain long-running applications that 
> currently have only one container each (the AM).
> 3. An application in queue {{root.c}} starts, and uses the whole cluster 
> except for the small amount in use by {{root.a}} and {{root.b}}. An 
> application in {{root.d}} starts, and has a high enough demand to be able to 
> use half of the cluster. Because all four queues are active, the app in 
> {{root.d}} can only preempt the app in {{root.c}} up to roughly 25% of the 
> cluster's resources, while the app in {{root.c}} keeps about 75%.
> Ideally in this example, the app in {{root.d}} would be able to preempt the 
> app in {{root.c}} up to 50% of the cluster, which would be possible if the 
> idle apps in {{root.a}} and {{root.b}} didn't cause those queues to be 
> considered active.
> One way to address this is to update the definition of an active queue to be 
> a queue containing 1 or more non-AM containers. This way if all apps in a 
> queue scale down to only the AM, other queues' fair shares aren't affected.
> The benefit of this approach is that it's quite simple. The downside is that 
> it doesn't account for apps that are idle and using almost no resources, but 
> still have at least one non-AM container.
> There are a couple of other options that seem plausible to me, but they're 
> much more complicated, and it seems to me that this proposal makes good 
> progress while adding minimal extra complexity.
> Does this seem like a reasonable change? I'm certainly open to better ideas 
> as well.
> Thanks,
> Steve



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6960) definition of active queue allows idle long-running apps to distort fair shares

2017-08-22 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16137152#comment-16137152
 ] 

Daniel Templeton commented on YARN-6960:


I'll take a look when I get a chance.

> definition of active queue allows idle long-running apps to distort fair 
> shares
> ---
>
> Key: YARN-6960
> URL: https://issues.apache.org/jira/browse/YARN-6960
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.1, 3.0.0-alpha4
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-6960.001.patch, YARN-6960.002.patch
>
>
> YARN-2026 introduced the notion of only considering active queues when 
> computing the fair share of each queue. The definition of an active queue is 
> a queue with at least one runnable app:
> {code}
>   public boolean isActive() {
> return getNumRunnableApps() > 0;
>   }
> {code}
> One case that this definition of activity doesn't account for is that of 
> long-running applications that scale dynamically. Such an application might 
> request many containers when jobs are running, but scale down to very few 
> containers, or only the AM container, when no jobs are running.
> Even when such an application has scaled down to a negligible amount of 
> demand and utilization, the queue that it's in is still considered to be 
> active, which defeats the purpose of YARN-2026. For example, consider this 
> scenario:
> 1. We have queues {{root.a}}, {{root.b}}, {{root.c}}, and {{root.d}}, all of 
> which have the same weight.
> 2. Queues {{root.a}} and {{root.b}} contain long-running applications that 
> currently have only one container each (the AM).
> 3. An application in queue {{root.c}} starts, and uses the whole cluster 
> except for the small amount in use by {{root.a}} and {{root.b}}. An 
> application in {{root.d}} starts, and has a high enough demand to be able to 
> use half of the cluster. Because all four queues are active, the app in 
> {{root.d}} can only preempt the app in {{root.c}} up to roughly 25% of the 
> cluster's resources, while the app in {{root.c}} keeps about 75%.
> Ideally in this example, the app in {{root.d}} would be able to preempt the 
> app in {{root.c}} up to 50% of the cluster, which would be possible if the 
> idle apps in {{root.a}} and {{root.b}} didn't cause those queues to be 
> considered active.
> One way to address this is to update the definition of an active queue to be 
> a queue containing 1 or more non-AM containers. This way if all apps in a 
> queue scale down to only the AM, other queues' fair shares aren't affected.
> The benefit of this approach is that it's quite simple. The downside is that 
> it doesn't account for apps that are idle and using almost no resources, but 
> still have at least one non-AM container.
> There are a couple of other options that seem plausible to me, but they're 
> much more complicated, and it seems to me that this proposal makes good 
> progress while adding minimal extra complexity.
> Does this seem like a reasonable change? I'm certainly open to better ideas 
> as well.
> Thanks,
> Steve



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6960) definition of active queue allows idle long-running apps to distort fair shares

2017-08-20 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16134416#comment-16134416
 ] 

Steven Rand commented on YARN-6960:
---

[~dan...@cloudera.com], I've uploaded a patch proposing a new definition of 
queue activity. It also needs tests, but I wanted to first see how the 
community feels about this change, and revise it as necessary based on feedback 
before writing tests for it.

My understanding of a queue's demand is that it's the cumulative current usage 
of all apps in the queue plus the cumulative requested additional resources for 
all apps in the queue. Therefore if no apps are requesting additional 
resources, the demand will be equal to the usage of the AMs. Then, as soon as 
any app attempts to do anything, it's demand will be greater than the AM usage, 
and the queue will become active.

I've tested this patch and it seems to have the desired effect. Going back to 
the example in the description, {{root.c}} and {{root.d}} have equal fair 
shares despite the idle applications in {{root.a}} and {{root.b}}.

> definition of active queue allows idle long-running apps to distort fair 
> shares
> ---
>
> Key: YARN-6960
> URL: https://issues.apache.org/jira/browse/YARN-6960
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.1, 3.0.0-alpha4
>Reporter: Steven Rand
>Assignee: Steven Rand
> Attachments: YARN-6960.001.patch
>
>
> YARN-2026 introduced the notion of only considering active queues when 
> computing the fair share of each queue. The definition of an active queue is 
> a queue with at least one runnable app:
> {code}
>   public boolean isActive() {
> return getNumRunnableApps() > 0;
>   }
> {code}
> One case that this definition of activity doesn't account for is that of 
> long-running applications that scale dynamically. Such an application might 
> request many containers when jobs are running, but scale down to very few 
> containers, or only the AM container, when no jobs are running.
> Even when such an application has scaled down to a negligible amount of 
> demand and utilization, the queue that it's in is still considered to be 
> active, which defeats the purpose of YARN-2026. For example, consider this 
> scenario:
> 1. We have queues {{root.a}}, {{root.b}}, {{root.c}}, and {{root.d}}, all of 
> which have the same weight.
> 2. Queues {{root.a}} and {{root.b}} contain long-running applications that 
> currently have only one container each (the AM).
> 3. An application in queue {{root.c}} starts, and uses the whole cluster 
> except for the small amount in use by {{root.a}} and {{root.b}}. An 
> application in {{root.d}} starts, and has a high enough demand to be able to 
> use half of the cluster. Because all four queues are active, the app in 
> {{root.d}} can only preempt the app in {{root.c}} up to roughly 25% of the 
> cluster's resources, while the app in {{root.c}} keeps about 75%.
> Ideally in this example, the app in {{root.d}} would be able to preempt the 
> app in {{root.c}} up to 50% of the cluster, which would be possible if the 
> idle apps in {{root.a}} and {{root.b}} didn't cause those queues to be 
> considered active.
> One way to address this is to update the definition of an active queue to be 
> a queue containing 1 or more non-AM containers. This way if all apps in a 
> queue scale down to only the AM, other queues' fair shares aren't affected.
> The benefit of this approach is that it's quite simple. The downside is that 
> it doesn't account for apps that are idle and using almost no resources, but 
> still have at least one non-AM container.
> There are a couple of other options that seem plausible to me, but they're 
> much more complicated, and it seems to me that this proposal makes good 
> progress while adding minimal extra complexity.
> Does this seem like a reasonable change? I'm certainly open to better ideas 
> as well.
> Thanks,
> Steve



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6960) definition of active queue allows idle long-running apps to distort fair shares

2017-08-08 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16118146#comment-16118146
 ] 

Steven Rand commented on YARN-6960:
---

Yep, that concern is definitely valid. I wrote a patch that implements this 
definition of activity, and ran into exactly the problem you're describing 
while testing it. A new proposal then would be that a leaf queue is active if 
either of these conditions is met:

* There is at least one non-AM container running in the queue
* The cumulative demand of applications in the queue is greater than zero

That way, in the example you give above, the fair share of {{root.a}} becomes 
1/3 as soon as it attempts to run another job.

Backing up a step to the use case, we have interactive Spark applications the 
expectation for which is that results are returned to the user on the order of 
seconds, or at worst a few minutes (assuming that the query is reasonable). We 
don't want to have to create a new {{SparkContext}} and upload + localize JARs 
for each query, since that would inflate query execution time, so one of these 
applications will keep the same {{SparkContext}} around indefinitely, and will 
thus be a long-running YARN application. When one of these apps isn't running 
any queries/jobs, it'll scale down its executor count to make room for other 
YARN applications. So sometimes we wind up with multiple YARN applications with 
minimal resource usage and no demand, and we've observed that this causes 
unequal distribution of resources between other running applications, even 
though they're in equally weighted queues. The example in the description is 
kind of silly/simplistic, but it's essentially what we see happen.

> definition of active queue allows idle long-running apps to distort fair 
> shares
> ---
>
> Key: YARN-6960
> URL: https://issues.apache.org/jira/browse/YARN-6960
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.1, 3.0.0-alpha4
>Reporter: Steven Rand
>Assignee: Steven Rand
>
> YARN-2026 introduced the notion of only considering active queues when 
> computing the fair share of each queue. The definition of an active queue is 
> a queue with at least one runnable app:
> {code}
>   public boolean isActive() {
> return getNumRunnableApps() > 0;
>   }
> {code}
> One case that this definition of activity doesn't account for is that of 
> long-running applications that scale dynamically. Such an application might 
> request many containers when jobs are running, but scale down to very few 
> containers, or only the AM container, when no jobs are running.
> Even when such an application has scaled down to a negligible amount of 
> demand and utilization, the queue that it's in is still considered to be 
> active, which defeats the purpose of YARN-2026. For example, consider this 
> scenario:
> 1. We have queues {{root.a}}, {{root.b}}, {{root.c}}, and {{root.d}}, all of 
> which have the same weight.
> 2. Queues {{root.a}} and {{root.b}} contain long-running applications that 
> currently have only one container each (the AM).
> 3. An application in queue {{root.c}} starts, and uses the whole cluster 
> except for the small amount in use by {{root.a}} and {{root.b}}. An 
> application in {{root.d}} starts, and has a high enough demand to be able to 
> use half of the cluster. Because all four queues are active, the app in 
> {{root.d}} can only preempt the app in {{root.c}} up to roughly 25% of the 
> cluster's resources, while the app in {{root.c}} keeps about 75%.
> Ideally in this example, the app in {{root.d}} would be able to preempt the 
> app in {{root.c}} up to 50% of the cluster, which would be possible if the 
> idle apps in {{root.a}} and {{root.b}} didn't cause those queues to be 
> considered active.
> One way to address this is to update the definition of an active queue to be 
> a queue containing 1 or more non-AM containers. This way if all apps in a 
> queue scale down to only the AM, other queues' fair shares aren't affected.
> The benefit of this approach is that it's quite simple. The downside is that 
> it doesn't account for apps that are idle and using almost no resources, but 
> still have at least one non-AM container.
> There are a couple of other options that seem plausible to me, but they're 
> much more complicated, and it seems to me that this proposal makes good 
> progress while adding minimal extra complexity.
> Does this seem like a reasonable change? I'm certainly open to better ideas 
> as well.
> Thanks,
> Steve



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional 

[jira] [Commented] (YARN-6960) definition of active queue allows idle long-running apps to distort fair shares

2017-08-07 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16116837#comment-16116837
 ] 

Daniel Templeton commented on YARN-6960:


I'd be curious to understand the use case where you're running into this issue. 
My main concern which that fix is that an app that's entering an inactive queue 
will not be able to preempt its way into running. In your example, assume we 
kill the jobs in root.a and root.b, so that the apps in root.c and root.d share 
the cluster 50/50.  Now we submit a new app to root.a. Since all we have is an 
AM until the AM can run and request other containers, root.a's fair share will 
remain 0, and the app in root.a will never be able to preempt the apps in 
root.c or root.d.

> definition of active queue allows idle long-running apps to distort fair 
> shares
> ---
>
> Key: YARN-6960
> URL: https://issues.apache.org/jira/browse/YARN-6960
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.1, 3.0.0-alpha4
>Reporter: Steven Rand
>Assignee: Steven Rand
>
> YARN-2026 introduced the notion of only considering active queues when 
> computing the fair share of each queue. The definition of an active queue is 
> a queue with at least one runnable app:
> {code}
>   public boolean isActive() {
> return getNumRunnableApps() > 0;
>   }
> {code}
> One case that this definition of activity doesn't account for is that of 
> long-running applications that scale dynamically. Such an application might 
> request many containers when jobs are running, but scale down to very few 
> containers, or only the AM container, when no jobs are running.
> Even when such an application has scaled down to a negligible amount of 
> demand and utilization, the queue that it's in is still considered to be 
> active, which defeats the purpose of YARN-2026. For example, consider this 
> scenario:
> 1. We have queues {{root.a}}, {{root.b}}, {{root.c}}, and {{root.d}}, all of 
> which have the same weight.
> 2. Queues {{root.a}} and {{root.b}} contain long-running applications that 
> currently have only one container each (the AM).
> 3. An application in queue {{root.c}} starts, and uses the whole cluster 
> except for the small amount in use by {{root.a}} and {{root.b}}. An 
> application in {{root.d}} starts, and has a high enough demand to be able to 
> use half of the cluster. Because all four queues are active, the app in 
> {{root.d}} can only preempt the app in {{root.c}} up to roughly 25% of the 
> cluster's resources, while the app in {{root.c}} keeps about 75%.
> Ideally in this example, the app in {{root.d}} would be able to preempt the 
> app in {{root.c}} up to 50% of the cluster, which would be possible if the 
> idle apps in {{root.a}} and {{root.b}} didn't cause those queues to be 
> considered active.
> One way to address this is to update the definition of an active queue to be 
> a queue containing 1 or more non-AM containers. This way if all apps in a 
> queue scale down to only the AM, other queues' fair shares aren't affected.
> The benefit of this approach is that it's quite simple. The downside is that 
> it doesn't account for apps that are idle and using almost no resources, but 
> still have at least one non-AM container.
> There are a couple of other options that seem plausible to me, but they're 
> much more complicated, and it seems to me that this proposal makes good 
> progress while adding minimal extra complexity.
> Does this seem like a reasonable change? I'm certainly open to better ideas 
> as well.
> Thanks,
> Steve



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org