Hi All,

We have been working on the design for Restartable tasks (      MESOS-3545) and 
allowing agents to recover and re-register post reboot is a pre-requisite for 
that.
Agent today doesn’t recover its state that includes its SlaveID post a host 
reboot, it short-circuits the recovery upon discovering the reboot and 
registers with the master as a new agent. With Partition Awareness, the mesos 
master even allows agents which have failed master’s health check pings 
(unreachable agents) to re-register with it and reconcile the tasks/executors. 
The executors on a rebooted host are anyway terminated so there is no harm in 
letting such an agent recover and re-register with the master using its old 
SlaveID.
Would like to hear from the folks here if you see any operational concerns with 
letting the agents recover post a host reboot.

MESOS JIRA: https://issues.apache.org/jira/browse/MESOS-6223

Many Thanks
Megha Sharma


Reply via email to