[ https://issues.apache.org/jira/browse/MESOS-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Anand Mazumdar updated MESOS-5180: ---------------------------------- Sprint: (was: Mesosphere Sprint 34) > Scheduler driver does not detect disconnection with master and reregister. > -------------------------------------------------------------------------- > > Key: MESOS-5180 > URL: https://issues.apache.org/jira/browse/MESOS-5180 > Project: Mesos > Issue Type: Bug > Components: scheduler driver > Affects Versions: 0.24.0 > Reporter: Joseph Wu > Assignee: Anand Mazumdar > Labels: mesosphere > > The existing implementation of the scheduler driver does not re-register with > the master under some network partition cases. > When a scheduler registers with the master: > 1) master links to the framework > 2) framework links to the master > It is possible for either of these links to break *without* the master > changing. (Currently, the scheduler driver will only re-register if the > master changes). > If both links break or if just link (1) breaks, the master views the > framework as {{inactive}} and {{disconnected}}. This means the framework > will not receive any more events (such as offers) from the master until it > re-registers. There is currently no way for the scheduler to detect a > one-way link breakage. > if link (2) breaks, it makes (almost) no difference to the scheduler. The > scheduler usually uses the link to send messages to the master, but > libprocess will create another socket if the persistent one is not available. > To fix link breakages for (1+2) and (2), the scheduler driver should > implement a `::exited` event handler for the master's {{pid}} and trigger a > master (re-)detection upon a disconnection. This in turn should make the > driver (re)-register with the master. The scheduler library already does > this: > https://github.com/apache/mesos/blob/master/src/scheduler/scheduler.cpp#L395 > See the related issue MESOS-5181 for link (1) breakage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)