[ https://issues.apache.org/jira/browse/MESOS-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297598#comment-14297598 ]
Alexander Rukletsov commented on MESOS-2301: -------------------------------------------- {quote} Do you mean the master never marks the tasks as LOST or just it takes a long time? If it's the old master it should mark them LOST after health check timeout. If it's a new master, it should mark them LOST after recovery timeout. {quote} The scenario is: master shuts down -> slave sends to master {{UnregisterSlaveMessage}}, which never reaches the master -> slave marks it shouldn't recover the state by cleaning up the {{meta/}} folder -> slave shuts down -> a new master starts -> slave starts and registers with the master. What happens is that a new master expects the slave to re-register, since it hasn't received the {{UnregisterSlaveMessage}}. As a consequence, the tasks are marked {{LOST}} only after {{flags.slave_reregister_timeout}}, which may be undesirable. It looks to me like we have a sort of discrepancy: slave shuts down cleanly and removes its recovery state, but when master comes back it thinks that the slave has failed and will try to failover. A potential solution would be to note on a slave's side that master didn't receive the {{UnregisterSlaveMessage}} and didn't remove that slave. > Slave does not cleanly unregister > --------------------------------- > > Key: MESOS-2301 > URL: https://issues.apache.org/jira/browse/MESOS-2301 > Project: Mesos > Issue Type: Bug > Components: master, slave > Reporter: Dario Rexin > > If a machine running the mesos slave is being rebooted, the mesos slave does > a clean shutdown. It stops alls its executors, unregisters from the master > and removes the symlink to the latest state. > However, if the master is not reachable during the reboot, it will still > remove the symlink to the latest state and will register with a new ID when > restarted. This leads to the master waiting for the slave to come back for > the configured amount if time and not marking the tasks as lost or killed. > This also means, that these tasks will not be restarted by the framework (in > this case Marathon), because it assumes they are still alive. > This problem could be solved by introducing a new message > `SlaveUnregisteredMessage` that gets send by the master when a slave > successfully unregistered. The slav only has to wait for this message and if > it doesn't receive it, it should not remove the symlink to `latest`. -- This message was sent by Atlassian JIRA (v6.3.4#6332)