[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870821#comment-15870821 ]
Yan Xu edited comment on MESOS-6223 at 2/16/17 10:54 PM: --------------------------------------------------------- >From my comment on the email thread: {quote} So one thing that was brought up during offline conversations was that if the host reboot is associated with hardware change (e.g., a new memory stick): Currently: the agent would skip the recovery (and the chance of running into incompatible agent info) and register as a new agent. With the change: the agent could run into incompatible agent info due to resource change and flap indefinitely until the operator intervenes. To mitigate this and maintain the current behavior, we can have the agent remove `rm -f <work_dir>/meta/slaves/latest` automatically upon recovery failure but only after the host has rebooted. This way the agent can restart as a new agent without operator intervention. {quote} Of course, even if we do this to maintain the current behavior, it remain true that relying on reboot as a signal for hardware change is not reliable and the fix should be MESOS-1739. was (Author: xujyan): >From my comment on the email thread: {quote} So one thing that was brought up during offline conversations was that if the host reboot is associated with hardware change (e.g., a new memory stick): Currently: the agent would skip the recovery (and the chance of running into incompatible agent info) and register as a new agent. With the change: the agent could run into incompatible agent info due to resource change and flap indefinitely until the operator intervenes. To mitigate this and maintain the current behavior, we can have the agent remove `rm -f <work_dir>/meta/slaves/latest` automatically upon recovery failure but only after the host has rebooted. This way the agent can restart as a new agent without operator intervention. {quote} Of course, even if we do this to maintain the current behavior, it remain true that relying on reboot as a signal for hardware change is not reliable but the fix should be MESOS-1739. > Allow agents to re-register post a host reboot > ---------------------------------------------- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: agent > Reporter: Megha Sharma > Assignee: Megha Sharma > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.15#6346)