[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870821#comment-15870821
 ] 

Yan Xu edited comment on MESOS-6223 at 2/16/17 10:54 PM:
---------------------------------------------------------

>From my comment on the email thread:

{quote}
So one thing that was brought up during offline conversations was that if the 
host reboot is associated with hardware change (e.g., a new memory stick):

Currently: the agent would skip the recovery (and the chance of running into 
incompatible agent info) and register as a new agent.
With the change: the agent could run into incompatible agent info due to 
resource change and flap indefinitely until the operator intervenes.

To mitigate this and maintain the current behavior, we can have the agent 
remove `rm -f <work_dir>/meta/slaves/latest` automatically upon recovery 
failure but only after the host has rebooted. This way the agent can restart as 
a new agent without operator intervention. 
{quote}

Of course, even if we do this to maintain the current behavior, it remain true 
that relying on reboot as a signal for hardware change is not reliable and the 
fix should be MESOS-1739.


was (Author: xujyan):
>From my comment on the email thread:

{quote}
So one thing that was brought up during offline conversations was that if the 
host reboot is associated with hardware change (e.g., a new memory stick):

Currently: the agent would skip the recovery (and the chance of running into 
incompatible agent info) and register as a new agent.
With the change: the agent could run into incompatible agent info due to 
resource change and flap indefinitely until the operator intervenes.

To mitigate this and maintain the current behavior, we can have the agent 
remove `rm -f <work_dir>/meta/slaves/latest` automatically upon recovery 
failure but only after the host has rebooted. This way the agent can restart as 
a new agent without operator intervention. 
{quote}

Of course, even if we do this to maintain the current behavior, it remain true 
that relying on reboot as a signal for hardware change is not reliable but the 
fix should be MESOS-1739.

> Allow agents to re-register post a host reboot
> ----------------------------------------------
>
>                 Key: MESOS-6223
>                 URL: https://issues.apache.org/jira/browse/MESOS-6223
>             Project: Mesos
>          Issue Type: Improvement
>          Components: agent
>            Reporter: Megha Sharma
>            Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to