[ 
https://issues.apache.org/jira/browse/MESOS-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297598#comment-14297598
 ] 

Alexander Rukletsov commented on MESOS-2301:
--------------------------------------------

{quote}
Do you mean the master never marks the tasks as LOST or just it takes a long 
time? If it's the old master it should mark them LOST after health check 
timeout. If it's a new master, it should mark them LOST after recovery timeout.
{quote}
The scenario is: master shuts down -> slave sends to master 
{{UnregisterSlaveMessage}}, which never reaches the master -> slave marks it 
shouldn't recover the state by cleaning up the {{meta/}} folder -> slave shuts 
down -> a new master starts -> slave starts and registers with the master. What 
happens is that a new master expects the slave to re-register, since it hasn't 
received the {{UnregisterSlaveMessage}}. As a consequence, the tasks are marked 
{{LOST}} only after {{flags.slave_reregister_timeout}}, which may be 
undesirable.

It looks to me like we have a sort of discrepancy: slave shuts down cleanly and 
removes its recovery state, but when master comes back it thinks that the slave 
has failed and will try to failover. A potential solution would be to note on a 
slave's side that master didn't receive the {{UnregisterSlaveMessage}} and 
didn't remove that slave.

> Slave does not cleanly unregister
> ---------------------------------
>
>                 Key: MESOS-2301
>                 URL: https://issues.apache.org/jira/browse/MESOS-2301
>             Project: Mesos
>          Issue Type: Bug
>          Components: master, slave
>            Reporter: Dario Rexin
>
> If a machine running the mesos slave is being rebooted, the mesos slave does 
> a clean shutdown. It stops alls its executors, unregisters from the master 
> and removes the symlink to the latest state. 
> However, if the master is not reachable during the reboot, it will still 
> remove the symlink to the latest state and will register with a new ID when 
> restarted. This leads to the master waiting for the slave to come back for 
> the configured amount if time and not marking the tasks as lost or killed. 
> This also means, that these tasks will not be restarted by the framework (in 
> this case Marathon), because it assumes they are still alive.
> This problem could be solved by introducing a new message 
> `SlaveUnregisteredMessage` that gets send by the master when a slave 
> successfully unregistered. The slav only has to wait for this message and if 
> it doesn't receive it, it should not remove the symlink to `latest`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to