[ 
https://issues.apache.org/jira/browse/MESOS-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5180:
----------------------------------
    Sprint:   (was: Mesosphere Sprint 34)

> Scheduler driver does not detect disconnection with master and reregister.
> --------------------------------------------------------------------------
>
>                 Key: MESOS-5180
>                 URL: https://issues.apache.org/jira/browse/MESOS-5180
>             Project: Mesos
>          Issue Type: Bug
>          Components: scheduler driver
>    Affects Versions: 0.24.0
>            Reporter: Joseph Wu
>            Assignee: Anand Mazumdar
>              Labels: mesosphere
>
> The existing implementation of the scheduler driver does not re-register with 
> the master under some network partition cases.
> When a scheduler registers with the master:
> 1) master links to the framework
> 2) framework links to the master
> It is possible for either of these links to break *without* the master 
> changing.  (Currently, the scheduler driver will only re-register if the 
> master changes).
> If both links break or if just link (1) breaks, the master views the 
> framework as {{inactive}} and {{disconnected}}.  This means the framework 
> will not receive any more events (such as offers) from the master until it 
> re-registers.  There is currently no way for the scheduler to detect a 
> one-way link breakage.
> if link (2) breaks, it makes (almost) no difference to the scheduler.  The 
> scheduler usually uses the link to send messages to the master, but 
> libprocess will create another socket if the persistent one is not available.
> To fix link breakages for (1+2) and (2), the scheduler driver should 
> implement a `::exited` event handler for the master's {{pid}} and trigger a 
> master (re-)detection upon a disconnection. This in turn should make the 
> driver (re)-register with the master. The scheduler library already does 
> this: 
> https://github.com/apache/mesos/blob/master/src/scheduler/scheduler.cpp#L395
> See the related issue MESOS-5181 for link (1) breakage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to