[ 
https://issues.apache.org/jira/browse/SLIDER-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901995#comment-14901995
 ] 

Youjie Chen commented on SLIDER-939:
------------------------------------

Hi Steve,
How do we handle reserved containers if there are container requests which put 
YARN node into reserved state, it seems in this state:
1)when using Capacity Scheduler default queue(FIFO), it will cause YARN blocks 
any other container requests. This should be the case in the description above. 
2) when using fair scheduler policy, following mapred jobs can run on the 
unreserved nodes(but the reserved nodes will never can be used in this case).
HBase did not appear above problem, seems HBase will not  reserve containers. 
not very sure what makes nodes reserved. but in the above case, do we have ways 
to unreserve the nodes when doing flex down now ?
Thanks !

> flex down does not cancel the outstanding request
> -------------------------------------------------
>
>                 Key: SLIDER-939
>                 URL: https://issues.apache.org/jira/browse/SLIDER-939
>             Project: Slider
>          Issue Type: Bug
>          Components: core
>    Affects Versions: Slider 0.80
>         Environment: Hadoop 2.7.1 
> Slider 0.80.0
>            Reporter: Youjie Chen
>            Assignee: Steve Loughran
>              Labels: patch
>             Fix For: Slider 0.81
>
>
> I run slider app on  a 6 nodes cluster. To ensure there is only one 
> comonent(worker) instance on each node, I set yarn.memory to 51% of the total 
> memory. 
> Then I flex up to 7 workers,  there would be one worker request(outstanding)  
> that will never be met, this is expected.
> Then I flexed down back to 6 workers, and any container request for any job 
> would be blocked even if there are plenty of memory/core for the job, From RM 
> log, we can see there are continuous output:
> capacity.CapacityScheduler 
> (CapacityScheduler.java:allocateContainersToNode(1240)) - Skipping scheduling 
> since node test.example.com:45454 is reserved by application 
> appattempt_1442384698868_0008_000001
>  It seems  the outstanding requests are not actually cancelled in the 
> requesting container queue but keep trying to request.
> After I flexed down to 5 workers, the other blocked jobs can run.
> This is related to JIRA https://issues.apache.org/jira/browse/SLIDER-490



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to