[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387425#comment-14387425
 ] 

Sandy Ryza commented on YARN-3415:
----------------------------------

This looks mostly reasonable.  A few comments:
* In FSAppAttempt, can we change the "If this container is used to run AM" 
comment to "If not running unmanaged, the first container we allocate is always 
the AM. Update the leaf queue's AM usage"?
* The four lines of comment in FSLeafQueue could be reduced to "If isAMRunning 
is true, we're no running an unmanaged AM."
* Would it make sense to move the call to setAMResource that's currently in 
FairScheduler next to the call to getQueue().addAMResourceUsage() so that the 
queue and attempt resource usage get updated at the same time?


> Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
> queue
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-3415
>                 URL: https://issues.apache.org/jira/browse/YARN-3415
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.6.0
>            Reporter: Rohit Agarwal
>            Assignee: zhihai xu
>            Priority: Critical
>         Attachments: YARN-3415.000.patch
>
>
> We encountered this problem while running a spark cluster. The 
> amResourceUsage for a queue became artificially high and then the cluster got 
> deadlocked because the maxAMShare constrain kicked in and no new AM got 
> admitted to the cluster.
> I have described the problem in detail here: 
> https://github.com/apache/spark/pull/5233#issuecomment-87160289
> In summary - the condition for adding the container's memory towards 
> amResourceUsage is fragile. It depends on the number of live containers 
> belonging to the app. We saw that the spark AM went down without explicitly 
> releasing its requested containers and then one of those containers memory 
> was counted towards amResource.
> cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to