[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-07-17 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090881#comment-16090881
 ] 

Yan Xu commented on MESOS-6223:
---

https://reviews.apache.org/r/60925/

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-07-12 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085192#comment-16085192
 ] 

Yan Xu commented on MESOS-6223:
---

{noformat:title=}
commit 188109b63ea9cc0cdfe1fd616c744cb10dbb4a57
Author: Megha Sharma 
Date:   Wed Jul 12 22:03:37 2017 -0700

Added tests to ensure slave recovery post reboot.

Added tests to verify that the state is recovered post reboot and the
agent ID is retained given the recovery finishes without errors and
if the recovery fails due to agent info mismatch then agent is recoverd
as a new agent.

Review: https://reviews.apache.org/r/56895/

commit cd6495e677ec74fd3f40b0dbf3b9654475308575
Author: Megha Sharma 
Date:   Mon Jul 10 09:38:28 2017 -0700

Recover as a new agent in case of agent info mismatch on reboot.

This is for backwards compatibility. Prior to Mesos 1.4 we directly
bypass the state recovery and start as a new agent upon reboot
(introduced in MESOS-844). This unnecessarily discards the existing
agent ID (MESOS-6223). Starting in Mesos 1.4 we'll attempt to recover
the slave state even after reboot but in case of slave info mismatch
we'll fall back to recovering as a new agent (existing behavior). This
prevents the agent from flapping if the agent info (resources,
attributes, etc.) change is due to host maintenance associated with
the reboot.

Review: https://reviews.apache.org/r/60105/

commit 91f4e9acd0bad60201155b68a896d12d7200eda3
Author: Megha Sharma 
Date:   Mon Jul 10 09:34:40 2017 -0700

Stopped short-circuiting agent recovery upon reboot.

The agent would continue the recovery and we added a `rebooted` flag
to `slave::State` to record the reboot info.

Review: https://reviews.apache.org/r/60104/
{noformat}

Still need a patch for CHANGELOG and upgrades.md before resolving.

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-06-14 Thread Deshi Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048741#comment-16048741
 ] 

Deshi Xiao commented on MESOS-6223:
---

need refactor the patch. [~megha.sharma]

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-05-16 Thread Megha Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16013251#comment-16013251
 ] 

Megha Sharma commented on MESOS-6223:
-

Review Request
https://reviews.apache.org/r/56895/

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-05-16 Thread Megha Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16013053#comment-16013053
 ] 

Megha Sharma commented on MESOS-6223:
-

[~xds2000] Sorry was on vacation last week. I am going to address the Vinod's 
comments on priority and try to close it asap.

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-05-05 Thread Deshi Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999209#comment-15999209
 ] 

Deshi Xiao commented on MESOS-6223:
---

[~megha.sharma] could you please refresh the patch based on Vinod Kone suggests?

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-04-28 Thread Deshi Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988379#comment-15988379
 ] 

Deshi Xiao commented on MESOS-6223:
---

Thanks Neil.

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-04-26 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985294#comment-15985294
 ] 

Neil Conway commented on MESOS-6223:


[~xds2000] Folks, my apologies for not reviewing this more promptly. 
[~vinodkone] has kindly agreed to taking over shepherding this ticket.

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-04-26 Thread Deshi Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984224#comment-15984224
 ] 

Deshi Xiao commented on MESOS-6223:
---

[~neilc]  any progress on it.

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-04-17 Thread Megha Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971344#comment-15971344
 ] 

Megha Sharma commented on MESOS-6223:
-

[~neilc] on it, I am looking into the test failure. Should have the patch ready 
really soon.

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-04-17 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971303#comment-15971303
 ] 

Neil Conway commented on MESOS-6223:


[~xds2000] -- There is a known test failure that AFAIK hasn't been resolved yet 
(details are on ReviewBoard). I'm waiting for that to be addressed before I dig 
into these changes more deeply -- but I'd like to get this change wrapped up 
and shipped pretty soon.

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-04-17 Thread Deshi Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970939#comment-15970939
 ] 

Deshi Xiao commented on MESOS-6223:
---

[~neilc]  do you have any update on this patch: 
https://reviews.apache.org/r/56895/

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-03-04 Thread Deshi Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895772#comment-15895772
 ] 

Deshi Xiao commented on MESOS-6223:
---

any update for the review patch

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-03-01 Thread Tim Harper (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890850#comment-15890850
 ] 

Tim Harper commented on MESOS-6223:
---

This should help fix an issue we are seeing with tasks and reserved resources 
in Marathon:

https://github.com/mesosphere/marathon/issues/5284

In Marathon's case, when a residential (has reserved resources) task becomes 
unreachable, due to a the node rebooting, we never receive a terminal state for 
the task even though the host reboots and comes back online. This is because, 
we believe, during reconciliation we send the old agent ID and the task ID, and 
Mesos continually reports  an unknown status. Were the agent in question to 
keep the same agent ID, then an explicit reconciliation of that agent ID + the 
task ID, I think, should be able to result in a status update which signals 
definite terminality.

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-02-28 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888719#comment-15888719
 ] 

Joseph Wu commented on MESOS-6223:
--

See this ticket for a degenerate case where the agent will struggle to recover: 
https://issues.apache.org/jira/browse/MESOS-6285

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-02-28 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888650#comment-15888650
 ] 

Neil Conway commented on MESOS-6223:


When we implement this, we should make sure that we benchmark the performance 
impact on agent recovery, in particular when there is frequent task churn on 
the agent. For example, when an agent has 10k-100k completed tasks and a few (< 
20) running/live tasks; when the agent reboots, we should benchmark how long it 
takes for the agent to complete recovery. This is the situation that motivated 
the introduction of the "boot id" shortcut in the first place. (cc 
[~megha.sharma] [~xujyan])

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-02-16 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870821#comment-15870821
 ] 

Yan Xu commented on MESOS-6223:
---

>From my comment on the email thread:

{quote}
So one thing that was brought up during offline conversations was that if the 
host reboot is associated with hardware change (e.g., a new memory stick):

Currently: the agent would skip the recovery (and the chance of running into 
incompatible agent info) and register as a new agent.
With the change: the agent could run into incompatible agent info due to 
resource change and flap indefinitely until the operator intervenes.

To mitigate this and maintain the current behavior, we can have the agent 
remove `rm -f /meta/slaves/latest` automatically upon recovery 
failure but only after the host has rebooted. This way the agent can restart as 
a new agent without operator intervention. 
{quote}

Of course, even if we do this to maintain the current behavior, it remain true 
that relying on reboot as a signal for hardware change is not reliable but the 
fix should be MESOS-1739.

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2016-11-10 Thread Megha (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15655590#comment-15655590
 ] 

Megha commented on MESOS-6223:
--

[~neilc]
Here, I am analyzing the impact of allowing agent to recover post reboot in the 
context of partition awareness. In my understanding there is no new transition 
which is not already happening with partition awareness. Do you think there 
could be a risk involved in allowing the recovery post reboots.

1. If there are no partition-aware frameworks on the agent: Agent while 
rebooting could either be disconnected or may fail the master health check 
timeout. The executors don't re-register as they have exited because of the 
reboot. Agent re-registers and starts to send status updates for unacked 
updates. From the framework's point of view the transition is simply 
TASK_STARTING/TASK_RUNNING -> TASK_LOST.

2. If there are tasks from partition aware frameworks on the agent: 
a. The transition is same as above if the agent is disconnected.
b. If the agent is marked unreachable while it was rebooting then from the 
framework's point of view, the tasks transition   from TASK_UNREACHABLE -> 
TASK_GONE when the agent re-registers and send status updates. Since the 
unreachable agents are in registry so master will remember them across its 
failovers so if the agent doesn't come back then frameworks will receive 
TASK_UNREACHABLE update upon reconciliation unless the registry is purged.
c. If the agent is marked gone then the master sends TASK_GONE and if such 
an agent doesn't come back then future framework reconciliations will result in 
TASK_UNKNOWN status update since these there is no gone registry so the agents 
won't be remembered across master failovers. And if the agent eventually comes 
back then the task could transition from TASK_UNKNOWN back to TASK_GONE.


> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Megha
>Assignee: Megha
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2016-10-19 Thread Megha (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15590108#comment-15590108
 ] 

Megha commented on MESOS-6223:
--

Recovery of agent post a reboot is required to be able to support restart of 
Restartable tasks when the executor dies as a result of agent host reboot. 
Here's the detailed design doc for Restartable Tasks:

https://docs.google.com/document/d/1YS_EBUNLkzpSru0dwn_hPUIeTATiWckSaosXSIaHUCo/edit?usp=sharing


> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Megha
>Assignee: Megha
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2016-10-07 Thread Megha (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1500#comment-1500
 ] 

Megha commented on MESOS-6223:
--

This jira came out as a pre-requisite to support task restart post a reboot. 
There are definitely use cases where you would need a persistent agent Id 
because resources like persistent volumes are not tied to the lifecycle of the 
ephemeral agent and exist even after the agent is gone. But the thing is in 
order to support task restart on the rebooted host we need the previous agent 
Id or session Id (from MESOS-5368) to recover and figure out which tasks to 
restart and restart them eventually. So, I believe the agent or session 
recovery post a reboot is needed. I believe recovery being short-circuited 
after reboot is an optimization because of the fact that no tasks/executors are 
running after agent's host reboot which will change with MESOS-3545.

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Megha
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2016-10-06 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553432#comment-15553432
 ] 

Yan Xu commented on MESOS-6223:
---

[~neilc] [~vinodkone] I can think of ways we can implement restarting tasks 
post-reboot (MESOS-3545, will have design doc out soon) via either the approach 
in this ticket or in MESOS-5368 but this one feels simpler. Reboot as a special 
case sounds to me an optimization which will no longer hold true with tasks 
being restarted. Then the question is 

1) Should the agent ID *always* change after a reboot?
2) Does the agent ID *ever has to* change when its {{work_dir}} hasn't changed?

1) Sounds like no.

For 2), on the master the only error case where we disallow an agent to 
reregister but does allow the agent to register is [when the agent's ip or 
hostname has 
changed|https://github.com/apache/mesos/blob/3902b051f2cff59c55535dae08ebd4223833b0a0/src/master/master.cpp#L5228].
 I can imagine we'd want to force the agent to get rid of its 
{{work_dir//slave_id}} but keep the checkpointed resources etc.?

To summarize, seems like we can keep both this ticket and MESOS-5368, but 
change MESOS-5368 to not change the session ID in the reboot case?

Thoughts?

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Megha
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2016-10-06 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552427#comment-15552427
 ] 

Neil Conway commented on MESOS-6223:


cc [~vinodkone]

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Megha
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2016-10-06 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552425#comment-15552425
 ] 

Neil Conway commented on MESOS-6223:


Another way to go here would be to introduce a new type of "persistent agent 
ID", as discussed in MESOS-5368 -- that would essentially be an ID for a given 
{{work_dir}}, whereas the existing Agent ID would remain closer to a "session 
ID".

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Megha
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)