Hi All, We have been working on the design for Restartable tasks ( MESOS-3545) and allowing agents to recover and re-register post reboot is a pre-requisite for that. Agent today doesn’t recover its state that includes its SlaveID post a host reboot, it short-circuits the recovery upon discovering the reboot and registers with the master as a new agent. With Partition Awareness, the mesos master even allows agents which have failed master’s health check pings (unreachable agents) to re-register with it and reconcile the tasks/executors. The executors on a rebooted host are anyway terminated so there is no harm in letting such an agent recover and re-register with the master using its old SlaveID. Would like to hear from the folks here if you see any operational concerns with letting the agents recover post a host reboot.
MESOS JIRA: https://issues.apache.org/jira/browse/MESOS-6223 Many Thanks Megha Sharma