[jira] [Commented] (YARN-1434) Single Job can affect fairshare of others

2013-11-24 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13831180#comment-13831180
 ] 

Sandy Ryza commented on YARN-1434:
--

Srikanth, this would happen if the AM can return containers faster than the RM 
can assign them.  The AM would then, as Carlo said, continually be the one that 
"deserves" a container.

It should definitely be possible to make this problem show up under the right 
circumstances.  When I have time I will try to verify whether YARN-1010 fully 
eliminates the problem.

> Single Job can affect fairshare of others
> -
>
> Key: YARN-1434
> URL: https://issues.apache.org/jira/browse/YARN-1434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Carlo Curino
>Priority: Minor
>
> A job receiving containers and deciding not to use them and yielding them 
> back in the next heartbeat could significantly affect the amount of resources 
> given to other jobs. 
> This is because by yielding containers back the job appears always to be 
> under-capacity (more than others) so it is picked to be the next to receive 
> containers.
> Observed by Robert Grandl, to be independently confirmed.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1434) Single Job can affect fairshare of others

2013-11-22 Thread Carlo Curino (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830368#comment-13830368
 ] 

Carlo Curino commented on YARN-1434:


Srikanth, what we observed (again in a noise environment, so to be validated) 
is that the AM returning containers is maintaining is position as "under 
capacity" w.r.t. other machines, since it returned a bunch of containers, so it 
will be picked again as highest in priority. As a consequence it is wasting 
containers in a way that in our small setup was harming other jobs opportunity 
to get access to containers. 

If Robert has few spare cycles, he will try to make a minimal patch to the MR 
AM that make it behave maliciously and try again on the CapacityScheduler, and 
maybe Sandy could try it with the fair scheduler? 

If we confirm this is indeed a problem, and that is substantial for non-trivial 
scenarios (we noticed it for 2 jobs in 2 queues on 10 machines, not sure 
whether has impact at scale), we might need to tweak the schedulers logics to 
penalize users that yield back lots of containers (e.g., accounting for those 
containers against the user quota for n seconds or something).


> Single Job can affect fairshare of others
> -
>
> Key: YARN-1434
> URL: https://issues.apache.org/jira/browse/YARN-1434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Carlo Curino
>Priority: Minor
>
> A job receiving containers and deciding not to use them and yielding them 
> back in the next heartbeat could significantly affect the amount of resources 
> given to other jobs. 
> This is because by yielding containers back the job appears always to be 
> under-capacity (more than others) so it is picked to be the next to receive 
> containers.
> Observed by Robert Grandl, to be independently confirmed.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1434) Single Job can affect fairshare of others

2013-11-22 Thread Srikanth Kandula (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830129#comment-13830129
 ] 

Srikanth Kandula commented on YARN-1434:


Sandy Ryza,

I get it up to the "receive the next container that the RM allocates".  But, 
why would this starve other AMs? Shouldn't the RM offer some other containers 
to these other jobs if the cluster is idle? 

I can see how some containers may be just tossing back and forth between the RM 
and the picky job. But do not see why other jobs receive less share than they 
would because of the picky job.

> Single Job can affect fairshare of others
> -
>
> Key: YARN-1434
> URL: https://issues.apache.org/jira/browse/YARN-1434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Carlo Curino
>Priority: Minor
>
> A job receiving containers and deciding not to use them and yielding them 
> back in the next heartbeat could significantly affect the amount of resources 
> given to other jobs. 
> This is because by yielding containers back the job appears always to be 
> under-capacity (more than others) so it is picked to be the next to receive 
> containers.
> Observed by Robert Grandl, to be independently confirmed.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1434) Single Job can affect fairshare of others

2013-11-21 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13829761#comment-13829761
 ] 

Sandy Ryza commented on YARN-1434:
--

This seems possible.  To further spell this out:
Imagine an AM that, by fairness, receives a container on an NM heartbeat.  If 
it retrieves the container from the RM and gives it back before any other NM 
can heartbeat, it will also, by fairness, receive the next container that the 
RM allocates.  In this way, it could starve all the other applications on the 
cluster.  An AM that deserves more than a single container could do this with a 
slower heartbeat interval.

For the Fair Scheduler, YARN-1010, which decouples container allocations from 
node heartbeats, should solve this in most cases.  With it, it is nearly 
impossible for an AM to return containers before the RM allocates other free 
space to other applications.

> Single Job can affect fairshare of others
> -
>
> Key: YARN-1434
> URL: https://issues.apache.org/jira/browse/YARN-1434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Carlo Curino
>Priority: Minor
>
> A job receiving containers and deciding not to use them and yielding them 
> back in the next heartbeat could significantly affect the amount of resources 
> given to other jobs. 
> This is because by yielding containers back the job appears always to be 
> under-capacity (more than others) so it is picked to be the next to receive 
> containers.
> Observed by Robert Grandl, to be independently confirmed.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1434) Single Job can affect fairshare of others

2013-11-21 Thread Carlo Curino (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13829377#comment-13829377
 ] 

Carlo Curino commented on YARN-1434:


This has been observed while modifying the mapreduce AM behavior for other 
reasons. If the AM aggressively returns containers, it seems to be able to 
create the illusion to be under-capacity while wasting resources for everyone. 
A second job running in a separate queue (which was supposed to receive 50% of 
the cluster resources) was starved (only getting about 30% of the resources). 
This should be confirmed independently as the environment we observed this in 
had too much going on (i.e., this might be a false positive). 

If confirmed, this might be quite bad, as a single malevolent AM could affect 
the cluster utilization possibly by a lot.
  
[~sandyr], [~acmurthy]  thoughts?

> Single Job can affect fairshare of others
> -
>
> Key: YARN-1434
> URL: https://issues.apache.org/jira/browse/YARN-1434
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Carlo Curino
>Priority: Minor
>
> A job receiving containers and deciding not to use them and yielding them 
> back in the next heartbeat could significantly affect the amount of resources 
> given to other jobs. 
> This is because by yielding containers back the job appears always to be 
> under-capacity (more than others) so it is picked to be the next to receive 
> containers.
> Observed by Robert Grandl, to be independently confirmed.



--
This message was sent by Atlassian JIRA
(v6.1#6144)