[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

2016-08-26 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439996#comment-15439996
 ] 

Vinod Kone commented on MESOS-4049:
---

Author: Neil Conway 
Date:   Fri Aug 26 14:48:47 2016 -0700

Made a few minor tweaks to comments.

Review: https://reviews.apache.org/r/50704/

commit 0b90cccaca0069a2e2fff54d1424d205659346a3
Author: Neil Conway 
Date:   Fri Aug 26 14:48:39 2016 -0700

Removed a no-longer-relevant test.

The behavior this test is trying to validate (slaves receive a
`ShutdownMessage` if they attempt to reregister after failing health
checks) will be changed shortly. Moreover, the new behavior is already
covered by other test cases.

Review: https://reviews.apache.org/r/50703/

commit 93016d37bf8833d7a78ada9c4ec59a374419ba35
Author: Neil Conway 
Date:   Fri Aug 26 14:48:16 2016 -0700

Renamed metrics from "slave_shutdowns" to "slave_unreachable".

The master will shortly be changed to no longer shutdown unhealthy
agents, so the previous metric name is no longer accurate. The old
metric names have been kept for backwards compatibility, but they
are no longer updated (i.e., they will always be set to zero).

Review: https://reviews.apache.org/r/50702/

commit af496f3a80da9a8e7961fb62f839aacf1658222e
Author: Neil Conway 
Date:   Fri Aug 26 14:48:07 2016 -0700

Added registrar operations for marking agents (un-)reachable.

Review: https://reviews.apache.org/r/50701/

commit 540591407729ae9eaf81f68cb025b181782c5b99
Author: Neil Conway 
Date:   Fri Aug 26 14:48:03 2016 -0700

Added a list of "unreachable" agents to the registry.

These are agents that have failed health checks.

Review: https://reviews.apache.org/r/50700/

commit c3268cad3621a6373ff331d882393b2ada064f4b
Author: Neil Conway 
Date:   Fri Aug 26 14:47:53 2016 -0700

Added new TaskState values and PARTITION_AWARE capability.

TASK_DROPPED, TASK_UNREACHABLE, TASK_GONE, TASK_GONE_BY_OPERATOR, and
TASK_UNKNOWN. These values are intended to replace the existing
TASK_LOST state by offering more fine-grained information on the
current state of a task. These states will only be sent to frameworks
that opt into this new behavior via the PARTITION_AWARE capability.

Note that this commit doesn't add a master metric for the TASK_UNKNOWN
status, because this is a "default" status reported when the master has
no knowledge of a particular task/agent ID. Hence the number of
"unknown" tasks at any given time is not a well-defined metric.

Review: https://reviews.apache.org/r/50699/


> Allow user to control behavior of partitioned agents/tasks
> --
>
> Key: MESOS-4049
> URL: https://issues.apache.org/jira/browse/MESOS-4049
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits 
> for a period of time (see MESOS-4048) before deciding that the agent is dead. 
> Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the 
> tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for 
> everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., 
> after one or two ping timeouts are missed); then when the old copy of the 
> task comes back, they might want to make an intelligent decision about how to 
> reconcile this situation (e.g., kill old, kill new, allow both to continue 
> running).
> * Some frameworks might want different behavior from other frameworks, or to 
> treat some tasks differently from other tasks. For example, if a task has a 
> huge amount of state that would need to be regenerated to spin up another 
> instance, the user might want to wait longer before starting a new task to 
> increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from 
> {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
> that state back to {{RUNNING}} (or perhaps we could keep the current 
> "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
> could also transition to {{LOST}}). The agent would also keep its old 
> {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

2016-08-05 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410286#comment-15410286
 ] 

Vinod Kone commented on MESOS-4049:
---

commit 5220f77582a14d4cdd0a907ba8af6e9db87d8ab7
Author: Neil Conway 
Date:   Fri Aug 5 16:41:41 2016 -0700

Future-proofed some slave removal tests.

These tests relied on the implementation detail that when an agent is
removed from the list of registered agents, the master sends a
ShutdownSlaveMessage to the agent. That will change in the future
(MESOS-4049). To prepare for this future planned behavior, adjust these
tests to be more robust by instead checking for the invocation of the
`slaveLost` scheduler callback.

Review: https://reviews.apache.org/r/50422/

commit 8a0b17a11560f482628e890094e83400fa805a80
Author: Neil Conway 
Date:   Fri Aug 5 16:41:35 2016 -0700

Cleaned up comments in fault tolerance tests.

Review: https://reviews.apache.org/r/50418/

commit 5de96fa4b3e603553dbae3f06aff6621b268a7be
Author: Neil Conway 
Date:   Fri Aug 5 16:41:28 2016 -0700

Improved consistency of test code for partitioning an agent.

Removed unnecessary `Clock::settle` calls: `Clock::settle` should
typically only be used when a test case does not have an easy way to
wait for a _specific_ event to occur. In this case, `Clock::settle` was
unnecessary because the test code immediately proceeded to `AWAIT_READY`
for a more specific event.

Also fixed up some whitespace.

Review: https://reviews.apache.org/r/50417/

commit 60dbd347b409c788776760a8270965d943b6806e
Author: Neil Conway 
Date:   Fri Aug 5 16:41:18 2016 -0700

Added more assertions to master code.

Review: https://reviews.apache.org/r/50416/
commit 29925658291be60bda7af7f83225d743e8d24870
Author: Neil Conway 
Date:   Fri Aug 5 16:41:10 2016 -0700

Added more expectations to TASK_LOST test cases.

Check the reason and source of TASK_LOST status updates, replaced
ASSERT_ with EXPECT_ in various places where EXPECT_ is more
appropriate.

Review: https://reviews.apache.org/r/50235/


> Allow user to control behavior of partitioned agents/tasks
> --
>
> Key: MESOS-4049
> URL: https://issues.apache.org/jira/browse/MESOS-4049
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits 
> for a period of time (see MESOS-4048) before deciding that the agent is dead. 
> Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the 
> tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for 
> everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., 
> after one or two ping timeouts are missed); then when the old copy of the 
> task comes back, they might want to make an intelligent decision about how to 
> reconcile this situation (e.g., kill old, kill new, allow both to continue 
> running).
> * Some frameworks might want different behavior from other frameworks, or to 
> treat some tasks differently from other tasks. For example, if a task has a 
> huge amount of state that would need to be regenerated to spin up another 
> instance, the user might want to wait longer before starting a new task to 
> increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from 
> {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
> that state back to {{RUNNING}} (or perhaps we could keep the current 
> "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
> could also transition to {{LOST}}). The agent would also keep its old 
> {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

2016-06-20 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15339563#comment-15339563
 ] 

Neil Conway commented on MESOS-4049:


Initial design doc for this feature: 
https://docs.google.com/document/d/1AYoF5HZPRdQN2TsRpPOliGC6oHen6aHVc0FBOo30rLQ

> Allow user to control behavior of partitioned agents/tasks
> --
>
> Key: MESOS-4049
> URL: https://issues.apache.org/jira/browse/MESOS-4049
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits 
> for a period of time (see MESOS-4048) before deciding that the agent is dead. 
> Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the 
> tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for 
> everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., 
> after one or two ping timeouts are missed); then when the old copy of the 
> task comes back, they might want to make an intelligent decision about how to 
> reconcile this situation (e.g., kill old, kill new, allow both to continue 
> running).
> * Some frameworks might want different behavior from other frameworks, or to 
> treat some tasks differently from other tasks. For example, if a task has a 
> huge amount of state that would need to be regenerated to spin up another 
> instance, the user might want to wait longer before starting a new task to 
> increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from 
> {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
> that state back to {{RUNNING}} (or perhaps we could keep the current 
> "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
> could also transition to {{LOST}}). The agent would also keep its old 
> {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

2016-02-18 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153490#comment-15153490
 ] 

Yan Xu commented on MESOS-4049:
---

Looks like use case one "Some users might want to aggressively start a new 
replacement task" would require MESOS-4048 but we can start with this ticket 
relying on {{max_slave_ping_timeouts * slave_ping_timeout}}.

> Allow user to control behavior of partitioned agents/tasks
> --
>
> Key: MESOS-4049
> URL: https://issues.apache.org/jira/browse/MESOS-4049
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>  Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits 
> for a period of time (see MESOS-4048) before deciding that the agent is dead. 
> Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the 
> tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for 
> everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., 
> after one or two ping timeouts are missed); then when the old copy of the 
> task comes back, they might want to make an intelligent decision about how to 
> reconcile this situation (e.g., kill old, kill new, allow both to continue 
> running).
> * Some frameworks might want different behavior from other frameworks, or to 
> treat some tasks differently from other tasks. For example, if a task has a 
> huge amount of state that would need to be regenerated to spin up another 
> instance, the user might want to wait longer before starting a new task to 
> increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from 
> {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
> that state back to {{RUNNING}} (or perhaps we could keep the current 
> "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
> could also transition to {{LOST}}). The agent would also keep its old 
> {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

2015-12-02 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036599#comment-15036599
 ] 

Vinod Kone commented on MESOS-4049:
---

+100

> Allow user to control behavior of partitioned agents/tasks
> --
>
> Key: MESOS-4049
> URL: https://issues.apache.org/jira/browse/MESOS-4049
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>  Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits 
> for a period of time (see MESOS-4048) before deciding that the agent is dead. 
> Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the 
> tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for 
> everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., 
> after one or two ping timeouts are missed); then when the old copy of the 
> task comes back, they might want to make an intelligent decision about how to 
> reconcile this situation (e.g., kill old, kill new, allow both to continue 
> running).
> * Some frameworks might want different behavior from other frameworks, or to 
> treat some tasks differently from other tasks. For example, if a task has a 
> huge amount of state that would need to be regenerated to spin up another 
> instance, the user might want to wait longer before starting a new task to 
> increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from 
> {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
> that state back to {{RUNNING}} (or perhaps we could keep the current 
> "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
> could also transition to {{LOST}}). The agent would also keep its old 
> {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

2015-12-02 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037152#comment-15037152
 ] 

Klaus Ma commented on MESOS-4049:
-

I like {{replacement task}} feature :). Just want to confirm: in this JIRA, 
Mesos only provide a new state about connection glitch ({{UNKNOWN}} or 
{{WANDERING}}); "replacement task" is handled by framework.

> Allow user to control behavior of partitioned agents/tasks
> --
>
> Key: MESOS-4049
> URL: https://issues.apache.org/jira/browse/MESOS-4049
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>  Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits 
> for a period of time (see MESOS-4048) before deciding that the agent is dead. 
> Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the 
> tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for 
> everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., 
> after one or two ping timeouts are missed); then when the old copy of the 
> task comes back, they might want to make an intelligent decision about how to 
> reconcile this situation (e.g., kill old, kill new, allow both to continue 
> running).
> * Some frameworks might want different behavior from other frameworks, or to 
> treat some tasks differently from other tasks. For example, if a task has a 
> huge amount of state that would need to be regenerated to spin up another 
> instance, the user might want to wait longer before starting a new task to 
> increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from 
> {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
> that state back to {{RUNNING}} (or perhaps we could keep the current 
> "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
> could also transition to {{LOST}}). The agent would also keep its old 
> {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

2015-12-02 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037154#comment-15037154
 ] 

Neil Conway commented on MESOS-4049:


Yes.

> Allow user to control behavior of partitioned agents/tasks
> --
>
> Key: MESOS-4049
> URL: https://issues.apache.org/jira/browse/MESOS-4049
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>  Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits 
> for a period of time (see MESOS-4048) before deciding that the agent is dead. 
> Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the 
> tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for 
> everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., 
> after one or two ping timeouts are missed); then when the old copy of the 
> task comes back, they might want to make an intelligent decision about how to 
> reconcile this situation (e.g., kill old, kill new, allow both to continue 
> running).
> * Some frameworks might want different behavior from other frameworks, or to 
> treat some tasks differently from other tasks. For example, if a task has a 
> huge amount of state that would need to be regenerated to spin up another 
> instance, the user might want to wait longer before starting a new task to 
> increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from 
> {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
> that state back to {{RUNNING}} (or perhaps we could keep the current 
> "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
> could also transition to {{LOST}}). The agent would also keep its old 
> {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

2015-12-02 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037282#comment-15037282
 ] 

Klaus Ma commented on MESOS-4049:
-

And which timepoint would you like to report the new state to framework? Ping 
failed or configurable e.g. after # ping failed (< max_slave_ping_times)?

> Allow user to control behavior of partitioned agents/tasks
> --
>
> Key: MESOS-4049
> URL: https://issues.apache.org/jira/browse/MESOS-4049
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>  Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits 
> for a period of time (see MESOS-4048) before deciding that the agent is dead. 
> Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the 
> tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for 
> everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., 
> after one or two ping timeouts are missed); then when the old copy of the 
> task comes back, they might want to make an intelligent decision about how to 
> reconcile this situation (e.g., kill old, kill new, allow both to continue 
> running).
> * Some frameworks might want different behavior from other frameworks, or to 
> treat some tasks differently from other tasks. For example, if a task has a 
> huge amount of state that would need to be regenerated to spin up another 
> instance, the user might want to wait longer before starting a new task to 
> increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from 
> {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
> that state back to {{RUNNING}} (or perhaps we could keep the current 
> "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
> could also transition to {{LOST}}). The agent would also keep its old 
> {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

2015-12-02 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037207#comment-15037207
 ] 

Guangya Liu commented on MESOS-4049:


[~neilc]Got it, thanks! Then I think that we may need to consider the case if a 
UNKNOW or WANDERING task got killed? Shall we mark this as ZOMBIE and when the 
host come back, just mark the ZOMBIE as TASK_FINISH.

> Allow user to control behavior of partitioned agents/tasks
> --
>
> Key: MESOS-4049
> URL: https://issues.apache.org/jira/browse/MESOS-4049
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>  Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits 
> for a period of time (see MESOS-4048) before deciding that the agent is dead. 
> Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the 
> tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for 
> everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., 
> after one or two ping timeouts are missed); then when the old copy of the 
> task comes back, they might want to make an intelligent decision about how to 
> reconcile this situation (e.g., kill old, kill new, allow both to continue 
> running).
> * Some frameworks might want different behavior from other frameworks, or to 
> treat some tasks differently from other tasks. For example, if a task has a 
> huge amount of state that would need to be regenerated to spin up another 
> instance, the user might want to wait longer before starting a new task to 
> increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from 
> {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
> that state back to {{RUNNING}} (or perhaps we could keep the current 
> "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
> could also transition to {{LOST}}). The agent would also keep its old 
> {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

2015-12-02 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037116#comment-15037116
 ] 

Neil Conway commented on MESOS-4049:


I'm not sure {{ZOMBIE}} accurately describes the intended behavior -- for 
example, in Unix a zombie process cannot come back to life. A zombie process is 
definitely dead (it just hasn't been properly cleaned up), whereas in this case 
the true state of the task is not known (to the master/framework).

> Allow user to control behavior of partitioned agents/tasks
> --
>
> Key: MESOS-4049
> URL: https://issues.apache.org/jira/browse/MESOS-4049
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>  Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits 
> for a period of time (see MESOS-4048) before deciding that the agent is dead. 
> Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the 
> tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for 
> everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., 
> after one or two ping timeouts are missed); then when the old copy of the 
> task comes back, they might want to make an intelligent decision about how to 
> reconcile this situation (e.g., kill old, kill new, allow both to continue 
> running).
> * Some frameworks might want different behavior from other frameworks, or to 
> treat some tasks differently from other tasks. For example, if a task has a 
> huge amount of state that would need to be regenerated to spin up another 
> instance, the user might want to wait longer before starting a new task to 
> increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from 
> {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
> that state back to {{RUNNING}} (or perhaps we could keep the current 
> "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
> could also transition to {{LOST}}). The agent would also keep its old 
> {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

2015-12-02 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037109#comment-15037109
 ] 

Guangya Liu commented on MESOS-4049:


It would be great to add such intelligent feature. BTW: It might be more align 
with linux concept if we can name the transaction task state as "ZOMBIE".

> Allow user to control behavior of partitioned agents/tasks
> --
>
> Key: MESOS-4049
> URL: https://issues.apache.org/jira/browse/MESOS-4049
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>  Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits 
> for a period of time (see MESOS-4048) before deciding that the agent is dead. 
> Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the 
> tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for 
> everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., 
> after one or two ping timeouts are missed); then when the old copy of the 
> task comes back, they might want to make an intelligent decision about how to 
> reconcile this situation (e.g., kill old, kill new, allow both to continue 
> running).
> * Some frameworks might want different behavior from other frameworks, or to 
> treat some tasks differently from other tasks. For example, if a task has a 
> huge amount of state that would need to be regenerated to spin up another 
> instance, the user might want to wait longer before starting a new task to 
> increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from 
> {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
> that state back to {{RUNNING}} (or perhaps we could keep the current 
> "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
> could also transition to {{LOST}}). The agent would also keep its old 
> {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)