[jira] [Commented] (YARN-1662) Capacity Scheduler reservation issue cause Job Hang

Sunil G (JIRA) Thu, 30 Jan 2014 21:06:51 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887471#comment-13887471
 ]


Sunil G commented on YARN-1662:
-------------------------------

A timed reservation logic if we can implement here, then it will be safer for 
the fresh allocation to try in some other node.
I have reviewd the scheduler part and found that without a seperate timer 
thread, this can be achieved.

addReReservation() will be invoked when the same node tries to rereserve the 
same applications requests in the node.
This is a multiset, hence the internal count will increment everytime when this 
addReReservation() is performed.
Also this will be incremented in every 1 sec(node heartbeat interval) only.

I wish to add a code like below in LeafQueue::assignContainer() method. If the 
limit exceeds, i will try unreseve the same from the node.
This code will hit when the same application trying to re-reserve again in same 
node. 

    } else {
      // Reserve by 'charging' in advance...
      reserve(application, priority, node, rmContainer, container);
      
      // Check for re-reservation limit. In this case, unreserve and try for a
      // fresh allocation.
      if (RESERVATION_TIME_LIMIT != 0
          && application.getReReservations(priority) > RESERVATION_TIME_LIMIT) {
        unreserve(application, priority, node, rmContainer);
        return Resources.none();
      }

So for the next nodeupdate from some other node, CS can try allocate resource 
to this application.

NB: Reservation is to ensure that same task can stick on to same node where its 
better to run. 
A bigger configurable limit which is based on the nature of the tasks running, 
can still achieve the above behavior.

Please share your thoughts.

> Capacity Scheduler reservation issue cause Job Hang
> ---------------------------------------------------
>
>                 Key: YARN-1662
>                 URL: https://issues.apache.org/jira/browse/YARN-1662
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.2.0
>         Environment: Suse 11 SP1 + Linux
>            Reporter: Sunil G
>
> There are 2 node managers in my cluster.
> NM1 with 8GB
> NM2 with 8GB
> I am submitting a Job with below details:
> AM with 2GB
> Map needs 5GB
> Reducer needs 3GB
> slowstart is enabled with 0.5
> 10maps and 50reducers are assigned.
> 5maps are completed. Now few reducers got scheduled.
> Now NM1 has 2GB AM and 3Gb Reducer_1    [Used 5GB]
> NM2 has 3Gb Reducer_2                          [Used 3GB]
> A Map has now reserved(5GB) in NM1 which has only 3Gb free.
> It hangs forever.
> Potential issue is, reservation is now blocked in NM1 for a Map which needs 
> 5GB.
> But the Reducer_1 hangs by waiting for few map ouputs.
> Reducer side preemption also not happened as few headroom is still available.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (YARN-1662) Capacity Scheduler reservation issue cause Job Hang

Reply via email to