[jira] [Commented] (YARN-1434) Single Job can affect fairshare of others
[ https://issues.apache.org/jira/browse/YARN-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13831180#comment-13831180 ] Sandy Ryza commented on YARN-1434: -- Srikanth, this would happen if the AM can return containers faster than the RM can assign them. The AM would then, as Carlo said, continually be the one that "deserves" a container. It should definitely be possible to make this problem show up under the right circumstances. When I have time I will try to verify whether YARN-1010 fully eliminates the problem. > Single Job can affect fairshare of others > - > > Key: YARN-1434 > URL: https://issues.apache.org/jira/browse/YARN-1434 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Carlo Curino >Priority: Minor > > A job receiving containers and deciding not to use them and yielding them > back in the next heartbeat could significantly affect the amount of resources > given to other jobs. > This is because by yielding containers back the job appears always to be > under-capacity (more than others) so it is picked to be the next to receive > containers. > Observed by Robert Grandl, to be independently confirmed. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1434) Single Job can affect fairshare of others
[ https://issues.apache.org/jira/browse/YARN-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830368#comment-13830368 ] Carlo Curino commented on YARN-1434: Srikanth, what we observed (again in a noise environment, so to be validated) is that the AM returning containers is maintaining is position as "under capacity" w.r.t. other machines, since it returned a bunch of containers, so it will be picked again as highest in priority. As a consequence it is wasting containers in a way that in our small setup was harming other jobs opportunity to get access to containers. If Robert has few spare cycles, he will try to make a minimal patch to the MR AM that make it behave maliciously and try again on the CapacityScheduler, and maybe Sandy could try it with the fair scheduler? If we confirm this is indeed a problem, and that is substantial for non-trivial scenarios (we noticed it for 2 jobs in 2 queues on 10 machines, not sure whether has impact at scale), we might need to tweak the schedulers logics to penalize users that yield back lots of containers (e.g., accounting for those containers against the user quota for n seconds or something). > Single Job can affect fairshare of others > - > > Key: YARN-1434 > URL: https://issues.apache.org/jira/browse/YARN-1434 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Carlo Curino >Priority: Minor > > A job receiving containers and deciding not to use them and yielding them > back in the next heartbeat could significantly affect the amount of resources > given to other jobs. > This is because by yielding containers back the job appears always to be > under-capacity (more than others) so it is picked to be the next to receive > containers. > Observed by Robert Grandl, to be independently confirmed. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1434) Single Job can affect fairshare of others
[ https://issues.apache.org/jira/browse/YARN-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830129#comment-13830129 ] Srikanth Kandula commented on YARN-1434: Sandy Ryza, I get it up to the "receive the next container that the RM allocates". But, why would this starve other AMs? Shouldn't the RM offer some other containers to these other jobs if the cluster is idle? I can see how some containers may be just tossing back and forth between the RM and the picky job. But do not see why other jobs receive less share than they would because of the picky job. > Single Job can affect fairshare of others > - > > Key: YARN-1434 > URL: https://issues.apache.org/jira/browse/YARN-1434 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Carlo Curino >Priority: Minor > > A job receiving containers and deciding not to use them and yielding them > back in the next heartbeat could significantly affect the amount of resources > given to other jobs. > This is because by yielding containers back the job appears always to be > under-capacity (more than others) so it is picked to be the next to receive > containers. > Observed by Robert Grandl, to be independently confirmed. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1434) Single Job can affect fairshare of others
[ https://issues.apache.org/jira/browse/YARN-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13829761#comment-13829761 ] Sandy Ryza commented on YARN-1434: -- This seems possible. To further spell this out: Imagine an AM that, by fairness, receives a container on an NM heartbeat. If it retrieves the container from the RM and gives it back before any other NM can heartbeat, it will also, by fairness, receive the next container that the RM allocates. In this way, it could starve all the other applications on the cluster. An AM that deserves more than a single container could do this with a slower heartbeat interval. For the Fair Scheduler, YARN-1010, which decouples container allocations from node heartbeats, should solve this in most cases. With it, it is nearly impossible for an AM to return containers before the RM allocates other free space to other applications. > Single Job can affect fairshare of others > - > > Key: YARN-1434 > URL: https://issues.apache.org/jira/browse/YARN-1434 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Carlo Curino >Priority: Minor > > A job receiving containers and deciding not to use them and yielding them > back in the next heartbeat could significantly affect the amount of resources > given to other jobs. > This is because by yielding containers back the job appears always to be > under-capacity (more than others) so it is picked to be the next to receive > containers. > Observed by Robert Grandl, to be independently confirmed. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1434) Single Job can affect fairshare of others
[ https://issues.apache.org/jira/browse/YARN-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13829377#comment-13829377 ] Carlo Curino commented on YARN-1434: This has been observed while modifying the mapreduce AM behavior for other reasons. If the AM aggressively returns containers, it seems to be able to create the illusion to be under-capacity while wasting resources for everyone. A second job running in a separate queue (which was supposed to receive 50% of the cluster resources) was starved (only getting about 30% of the resources). This should be confirmed independently as the environment we observed this in had too much going on (i.e., this might be a false positive). If confirmed, this might be quite bad, as a single malevolent AM could affect the cluster utilization possibly by a lot. [~sandyr], [~acmurthy] thoughts? > Single Job can affect fairshare of others > - > > Key: YARN-1434 > URL: https://issues.apache.org/jira/browse/YARN-1434 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Carlo Curino >Priority: Minor > > A job receiving containers and deciding not to use them and yielding them > back in the next heartbeat could significantly affect the amount of resources > given to other jobs. > This is because by yielding containers back the job appears always to be > under-capacity (more than others) so it is picked to be the next to receive > containers. > Observed by Robert Grandl, to be independently confirmed. -- This message was sent by Atlassian JIRA (v6.1#6144)