[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090881#comment-16090881 ] Yan Xu commented on MESOS-6223: --- https://reviews.apache.org/r/60925/ > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Megha Sharma >Assignee: Megha Sharma > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085192#comment-16085192 ] Yan Xu commented on MESOS-6223: --- {noformat:title=} commit 188109b63ea9cc0cdfe1fd616c744cb10dbb4a57 Author: Megha SharmaDate: Wed Jul 12 22:03:37 2017 -0700 Added tests to ensure slave recovery post reboot. Added tests to verify that the state is recovered post reboot and the agent ID is retained given the recovery finishes without errors and if the recovery fails due to agent info mismatch then agent is recoverd as a new agent. Review: https://reviews.apache.org/r/56895/ commit cd6495e677ec74fd3f40b0dbf3b9654475308575 Author: Megha Sharma Date: Mon Jul 10 09:38:28 2017 -0700 Recover as a new agent in case of agent info mismatch on reboot. This is for backwards compatibility. Prior to Mesos 1.4 we directly bypass the state recovery and start as a new agent upon reboot (introduced in MESOS-844). This unnecessarily discards the existing agent ID (MESOS-6223). Starting in Mesos 1.4 we'll attempt to recover the slave state even after reboot but in case of slave info mismatch we'll fall back to recovering as a new agent (existing behavior). This prevents the agent from flapping if the agent info (resources, attributes, etc.) change is due to host maintenance associated with the reboot. Review: https://reviews.apache.org/r/60105/ commit 91f4e9acd0bad60201155b68a896d12d7200eda3 Author: Megha Sharma Date: Mon Jul 10 09:34:40 2017 -0700 Stopped short-circuiting agent recovery upon reboot. The agent would continue the recovery and we added a `rebooted` flag to `slave::State` to record the reboot info. Review: https://reviews.apache.org/r/60104/ {noformat} Still need a patch for CHANGELOG and upgrades.md before resolving. > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Megha Sharma >Assignee: Megha Sharma > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048741#comment-16048741 ] Deshi Xiao commented on MESOS-6223: --- need refactor the patch. [~megha.sharma] > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Megha Sharma >Assignee: Megha Sharma > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16013251#comment-16013251 ] Megha Sharma commented on MESOS-6223: - Review Request https://reviews.apache.org/r/56895/ > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Megha Sharma >Assignee: Megha Sharma > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16013053#comment-16013053 ] Megha Sharma commented on MESOS-6223: - [~xds2000] Sorry was on vacation last week. I am going to address the Vinod's comments on priority and try to close it asap. > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Megha Sharma >Assignee: Megha Sharma > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999209#comment-15999209 ] Deshi Xiao commented on MESOS-6223: --- [~megha.sharma] could you please refresh the patch based on Vinod Kone suggests? > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Megha Sharma >Assignee: Megha Sharma > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988379#comment-15988379 ] Deshi Xiao commented on MESOS-6223: --- Thanks Neil. > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Megha Sharma >Assignee: Megha Sharma > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985294#comment-15985294 ] Neil Conway commented on MESOS-6223: [~xds2000] Folks, my apologies for not reviewing this more promptly. [~vinodkone] has kindly agreed to taking over shepherding this ticket. > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Megha Sharma >Assignee: Megha Sharma > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984224#comment-15984224 ] Deshi Xiao commented on MESOS-6223: --- [~neilc] any progress on it. > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Megha Sharma >Assignee: Megha Sharma > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971344#comment-15971344 ] Megha Sharma commented on MESOS-6223: - [~neilc] on it, I am looking into the test failure. Should have the patch ready really soon. > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Megha Sharma >Assignee: Megha Sharma > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971303#comment-15971303 ] Neil Conway commented on MESOS-6223: [~xds2000] -- There is a known test failure that AFAIK hasn't been resolved yet (details are on ReviewBoard). I'm waiting for that to be addressed before I dig into these changes more deeply -- but I'd like to get this change wrapped up and shipped pretty soon. > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Megha Sharma >Assignee: Megha Sharma > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970939#comment-15970939 ] Deshi Xiao commented on MESOS-6223: --- [~neilc] do you have any update on this patch: https://reviews.apache.org/r/56895/ > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Megha Sharma >Assignee: Megha Sharma > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895772#comment-15895772 ] Deshi Xiao commented on MESOS-6223: --- any update for the review patch > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Megha Sharma >Assignee: Megha Sharma > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890850#comment-15890850 ] Tim Harper commented on MESOS-6223: --- This should help fix an issue we are seeing with tasks and reserved resources in Marathon: https://github.com/mesosphere/marathon/issues/5284 In Marathon's case, when a residential (has reserved resources) task becomes unreachable, due to a the node rebooting, we never receive a terminal state for the task even though the host reboots and comes back online. This is because, we believe, during reconciliation we send the old agent ID and the task ID, and Mesos continually reports an unknown status. Were the agent in question to keep the same agent ID, then an explicit reconciliation of that agent ID + the task ID, I think, should be able to result in a status update which signals definite terminality. > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Megha Sharma >Assignee: Megha Sharma > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888719#comment-15888719 ] Joseph Wu commented on MESOS-6223: -- See this ticket for a degenerate case where the agent will struggle to recover: https://issues.apache.org/jira/browse/MESOS-6285 > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Megha Sharma >Assignee: Megha Sharma > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888650#comment-15888650 ] Neil Conway commented on MESOS-6223: When we implement this, we should make sure that we benchmark the performance impact on agent recovery, in particular when there is frequent task churn on the agent. For example, when an agent has 10k-100k completed tasks and a few (< 20) running/live tasks; when the agent reboots, we should benchmark how long it takes for the agent to complete recovery. This is the situation that motivated the introduction of the "boot id" shortcut in the first place. (cc [~megha.sharma] [~xujyan]) > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Megha Sharma >Assignee: Megha Sharma > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870821#comment-15870821 ] Yan Xu commented on MESOS-6223: --- >From my comment on the email thread: {quote} So one thing that was brought up during offline conversations was that if the host reboot is associated with hardware change (e.g., a new memory stick): Currently: the agent would skip the recovery (and the chance of running into incompatible agent info) and register as a new agent. With the change: the agent could run into incompatible agent info due to resource change and flap indefinitely until the operator intervenes. To mitigate this and maintain the current behavior, we can have the agent remove `rm -f /meta/slaves/latest` automatically upon recovery failure but only after the host has rebooted. This way the agent can restart as a new agent without operator intervention. {quote} Of course, even if we do this to maintain the current behavior, it remain true that relying on reboot as a signal for hardware change is not reliable but the fix should be MESOS-1739. > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Megha Sharma >Assignee: Megha Sharma > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15655590#comment-15655590 ] Megha commented on MESOS-6223: -- [~neilc] Here, I am analyzing the impact of allowing agent to recover post reboot in the context of partition awareness. In my understanding there is no new transition which is not already happening with partition awareness. Do you think there could be a risk involved in allowing the recovery post reboots. 1. If there are no partition-aware frameworks on the agent: Agent while rebooting could either be disconnected or may fail the master health check timeout. The executors don't re-register as they have exited because of the reboot. Agent re-registers and starts to send status updates for unacked updates. From the framework's point of view the transition is simply TASK_STARTING/TASK_RUNNING -> TASK_LOST. 2. If there are tasks from partition aware frameworks on the agent: a. The transition is same as above if the agent is disconnected. b. If the agent is marked unreachable while it was rebooting then from the framework's point of view, the tasks transition from TASK_UNREACHABLE -> TASK_GONE when the agent re-registers and send status updates. Since the unreachable agents are in registry so master will remember them across its failovers so if the agent doesn't come back then frameworks will receive TASK_UNREACHABLE update upon reconciliation unless the registry is purged. c. If the agent is marked gone then the master sends TASK_GONE and if such an agent doesn't come back then future framework reconciliations will result in TASK_UNKNOWN status update since these there is no gone registry so the agents won't be remembered across master failovers. And if the agent eventually comes back then the task could transition from TASK_UNKNOWN back to TASK_GONE. > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: slave >Reporter: Megha >Assignee: Megha > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15590108#comment-15590108 ] Megha commented on MESOS-6223: -- Recovery of agent post a reboot is required to be able to support restart of Restartable tasks when the executor dies as a result of agent host reboot. Here's the detailed design doc for Restartable Tasks: https://docs.google.com/document/d/1YS_EBUNLkzpSru0dwn_hPUIeTATiWckSaosXSIaHUCo/edit?usp=sharing > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: slave >Reporter: Megha >Assignee: Megha > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1500#comment-1500 ] Megha commented on MESOS-6223: -- This jira came out as a pre-requisite to support task restart post a reboot. There are definitely use cases where you would need a persistent agent Id because resources like persistent volumes are not tied to the lifecycle of the ephemeral agent and exist even after the agent is gone. But the thing is in order to support task restart on the rebooted host we need the previous agent Id or session Id (from MESOS-5368) to recover and figure out which tasks to restart and restart them eventually. So, I believe the agent or session recovery post a reboot is needed. I believe recovery being short-circuited after reboot is an optimization because of the fact that no tasks/executors are running after agent's host reboot which will change with MESOS-3545. > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: slave >Reporter: Megha > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553432#comment-15553432 ] Yan Xu commented on MESOS-6223: --- [~neilc] [~vinodkone] I can think of ways we can implement restarting tasks post-reboot (MESOS-3545, will have design doc out soon) via either the approach in this ticket or in MESOS-5368 but this one feels simpler. Reboot as a special case sounds to me an optimization which will no longer hold true with tasks being restarted. Then the question is 1) Should the agent ID *always* change after a reboot? 2) Does the agent ID *ever has to* change when its {{work_dir}} hasn't changed? 1) Sounds like no. For 2), on the master the only error case where we disallow an agent to reregister but does allow the agent to register is [when the agent's ip or hostname has changed|https://github.com/apache/mesos/blob/3902b051f2cff59c55535dae08ebd4223833b0a0/src/master/master.cpp#L5228]. I can imagine we'd want to force the agent to get rid of its {{work_dir//slave_id}} but keep the checkpointed resources etc.? To summarize, seems like we can keep both this ticket and MESOS-5368, but change MESOS-5368 to not change the session ID in the reboot case? Thoughts? > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: slave >Reporter: Megha > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552427#comment-15552427 ] Neil Conway commented on MESOS-6223: cc [~vinodkone] > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: slave >Reporter: Megha > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
[ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552425#comment-15552425 ] Neil Conway commented on MESOS-6223: Another way to go here would be to introduce a new type of "persistent agent ID", as discussed in MESOS-5368 -- that would essentially be an ID for a given {{work_dir}}, whereas the existing Agent ID would remain closer to a "session ID". > Allow agents to re-register post a host reboot > -- > > Key: MESOS-6223 > URL: https://issues.apache.org/jira/browse/MESOS-6223 > Project: Mesos > Issue Type: Improvement > Components: slave >Reporter: Megha > > Agent does’t recover its state post a host reboot, it registers with the > master and gets a new SlaveID. With partition awareness, the agents are now > allowed to re-register after they have been marked Unreachable. The executors > are anyway terminated on the agent when it reboots so there is no harm in > letting the agent keep its SlaveID, re-register with the master and reconcile > the lost executors. This is a pre-requisite for supporting > persistent/restartable tasks in mesos (MESOS-3545). -- This message was sent by Atlassian JIRA (v6.3.4#6332)