[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks
[ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439996#comment-15439996 ] Vinod Kone commented on MESOS-4049: --- Author: Neil ConwayDate: Fri Aug 26 14:48:47 2016 -0700 Made a few minor tweaks to comments. Review: https://reviews.apache.org/r/50704/ commit 0b90cccaca0069a2e2fff54d1424d205659346a3 Author: Neil Conway Date: Fri Aug 26 14:48:39 2016 -0700 Removed a no-longer-relevant test. The behavior this test is trying to validate (slaves receive a `ShutdownMessage` if they attempt to reregister after failing health checks) will be changed shortly. Moreover, the new behavior is already covered by other test cases. Review: https://reviews.apache.org/r/50703/ commit 93016d37bf8833d7a78ada9c4ec59a374419ba35 Author: Neil Conway Date: Fri Aug 26 14:48:16 2016 -0700 Renamed metrics from "slave_shutdowns" to "slave_unreachable". The master will shortly be changed to no longer shutdown unhealthy agents, so the previous metric name is no longer accurate. The old metric names have been kept for backwards compatibility, but they are no longer updated (i.e., they will always be set to zero). Review: https://reviews.apache.org/r/50702/ commit af496f3a80da9a8e7961fb62f839aacf1658222e Author: Neil Conway Date: Fri Aug 26 14:48:07 2016 -0700 Added registrar operations for marking agents (un-)reachable. Review: https://reviews.apache.org/r/50701/ commit 540591407729ae9eaf81f68cb025b181782c5b99 Author: Neil Conway Date: Fri Aug 26 14:48:03 2016 -0700 Added a list of "unreachable" agents to the registry. These are agents that have failed health checks. Review: https://reviews.apache.org/r/50700/ commit c3268cad3621a6373ff331d882393b2ada064f4b Author: Neil Conway Date: Fri Aug 26 14:47:53 2016 -0700 Added new TaskState values and PARTITION_AWARE capability. TASK_DROPPED, TASK_UNREACHABLE, TASK_GONE, TASK_GONE_BY_OPERATOR, and TASK_UNKNOWN. These values are intended to replace the existing TASK_LOST state by offering more fine-grained information on the current state of a task. These states will only be sent to frameworks that opt into this new behavior via the PARTITION_AWARE capability. Note that this commit doesn't add a master metric for the TASK_UNKNOWN status, because this is a "default" status reported when the master has no knowledge of a particular task/agent ID. Hence the number of "unknown" tasks at any given time is not a well-defined metric. Review: https://reviews.apache.org/r/50699/ > Allow user to control behavior of partitioned agents/tasks > -- > > Key: MESOS-4049 > URL: https://issues.apache.org/jira/browse/MESOS-4049 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway >Assignee: Neil Conway > Labels: mesosphere > > At present, if an agent is partitioned away from the master, the master waits > for a period of time (see MESOS-4048) before deciding that the agent is dead. > Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the > tasks running on the agent, and instructs the agent to shutdown. > Although this behavior is desirable for some/many users, it is not ideal for > everyone. For example: > * Some users might want to aggressively start a new replacement task (e.g., > after one or two ping timeouts are missed); then when the old copy of the > task comes back, they might want to make an intelligent decision about how to > reconcile this situation (e.g., kill old, kill new, allow both to continue > running). > * Some frameworks might want different behavior from other frameworks, or to > treat some tasks differently from other tasks. For example, if a task has a > huge amount of state that would need to be regenerated to spin up another > instance, the user might want to wait longer before starting a new task to > increase the chance that the old task will reappear. > To do this, we'd need to change task state so that a task can go from > {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from > that state back to {{RUNNING}} (or perhaps we could keep the current > "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} > could also transition to {{LOST}}). The agent would also keep its old > {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks
[ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15410286#comment-15410286 ] Vinod Kone commented on MESOS-4049: --- commit 5220f77582a14d4cdd0a907ba8af6e9db87d8ab7 Author: Neil ConwayDate: Fri Aug 5 16:41:41 2016 -0700 Future-proofed some slave removal tests. These tests relied on the implementation detail that when an agent is removed from the list of registered agents, the master sends a ShutdownSlaveMessage to the agent. That will change in the future (MESOS-4049). To prepare for this future planned behavior, adjust these tests to be more robust by instead checking for the invocation of the `slaveLost` scheduler callback. Review: https://reviews.apache.org/r/50422/ commit 8a0b17a11560f482628e890094e83400fa805a80 Author: Neil Conway Date: Fri Aug 5 16:41:35 2016 -0700 Cleaned up comments in fault tolerance tests. Review: https://reviews.apache.org/r/50418/ commit 5de96fa4b3e603553dbae3f06aff6621b268a7be Author: Neil Conway Date: Fri Aug 5 16:41:28 2016 -0700 Improved consistency of test code for partitioning an agent. Removed unnecessary `Clock::settle` calls: `Clock::settle` should typically only be used when a test case does not have an easy way to wait for a _specific_ event to occur. In this case, `Clock::settle` was unnecessary because the test code immediately proceeded to `AWAIT_READY` for a more specific event. Also fixed up some whitespace. Review: https://reviews.apache.org/r/50417/ commit 60dbd347b409c788776760a8270965d943b6806e Author: Neil Conway Date: Fri Aug 5 16:41:18 2016 -0700 Added more assertions to master code. Review: https://reviews.apache.org/r/50416/ commit 29925658291be60bda7af7f83225d743e8d24870 Author: Neil Conway Date: Fri Aug 5 16:41:10 2016 -0700 Added more expectations to TASK_LOST test cases. Check the reason and source of TASK_LOST status updates, replaced ASSERT_ with EXPECT_ in various places where EXPECT_ is more appropriate. Review: https://reviews.apache.org/r/50235/ > Allow user to control behavior of partitioned agents/tasks > -- > > Key: MESOS-4049 > URL: https://issues.apache.org/jira/browse/MESOS-4049 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway >Assignee: Neil Conway > Labels: mesosphere > > At present, if an agent is partitioned away from the master, the master waits > for a period of time (see MESOS-4048) before deciding that the agent is dead. > Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the > tasks running on the agent, and instructs the agent to shutdown. > Although this behavior is desirable for some/many users, it is not ideal for > everyone. For example: > * Some users might want to aggressively start a new replacement task (e.g., > after one or two ping timeouts are missed); then when the old copy of the > task comes back, they might want to make an intelligent decision about how to > reconcile this situation (e.g., kill old, kill new, allow both to continue > running). > * Some frameworks might want different behavior from other frameworks, or to > treat some tasks differently from other tasks. For example, if a task has a > huge amount of state that would need to be regenerated to spin up another > instance, the user might want to wait longer before starting a new task to > increase the chance that the old task will reappear. > To do this, we'd need to change task state so that a task can go from > {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from > that state back to {{RUNNING}} (or perhaps we could keep the current > "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} > could also transition to {{LOST}}). The agent would also keep its old > {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks
[ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15339563#comment-15339563 ] Neil Conway commented on MESOS-4049: Initial design doc for this feature: https://docs.google.com/document/d/1AYoF5HZPRdQN2TsRpPOliGC6oHen6aHVc0FBOo30rLQ > Allow user to control behavior of partitioned agents/tasks > -- > > Key: MESOS-4049 > URL: https://issues.apache.org/jira/browse/MESOS-4049 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway >Assignee: Neil Conway > Labels: mesosphere > > At present, if an agent is partitioned away from the master, the master waits > for a period of time (see MESOS-4048) before deciding that the agent is dead. > Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the > tasks running on the agent, and instructs the agent to shutdown. > Although this behavior is desirable for some/many users, it is not ideal for > everyone. For example: > * Some users might want to aggressively start a new replacement task (e.g., > after one or two ping timeouts are missed); then when the old copy of the > task comes back, they might want to make an intelligent decision about how to > reconcile this situation (e.g., kill old, kill new, allow both to continue > running). > * Some frameworks might want different behavior from other frameworks, or to > treat some tasks differently from other tasks. For example, if a task has a > huge amount of state that would need to be regenerated to spin up another > instance, the user might want to wait longer before starting a new task to > increase the chance that the old task will reappear. > To do this, we'd need to change task state so that a task can go from > {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from > that state back to {{RUNNING}} (or perhaps we could keep the current > "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} > could also transition to {{LOST}}). The agent would also keep its old > {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks
[ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153490#comment-15153490 ] Yan Xu commented on MESOS-4049: --- Looks like use case one "Some users might want to aggressively start a new replacement task" would require MESOS-4048 but we can start with this ticket relying on {{max_slave_ping_timeouts * slave_ping_timeout}}. > Allow user to control behavior of partitioned agents/tasks > -- > > Key: MESOS-4049 > URL: https://issues.apache.org/jira/browse/MESOS-4049 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway > Labels: mesosphere > > At present, if an agent is partitioned away from the master, the master waits > for a period of time (see MESOS-4048) before deciding that the agent is dead. > Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the > tasks running on the agent, and instructs the agent to shutdown. > Although this behavior is desirable for some/many users, it is not ideal for > everyone. For example: > * Some users might want to aggressively start a new replacement task (e.g., > after one or two ping timeouts are missed); then when the old copy of the > task comes back, they might want to make an intelligent decision about how to > reconcile this situation (e.g., kill old, kill new, allow both to continue > running). > * Some frameworks might want different behavior from other frameworks, or to > treat some tasks differently from other tasks. For example, if a task has a > huge amount of state that would need to be regenerated to spin up another > instance, the user might want to wait longer before starting a new task to > increase the chance that the old task will reappear. > To do this, we'd need to change task state so that a task can go from > {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from > that state back to {{RUNNING}} (or perhaps we could keep the current > "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} > could also transition to {{LOST}}). The agent would also keep its old > {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks
[ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036599#comment-15036599 ] Vinod Kone commented on MESOS-4049: --- +100 > Allow user to control behavior of partitioned agents/tasks > -- > > Key: MESOS-4049 > URL: https://issues.apache.org/jira/browse/MESOS-4049 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway > Labels: mesosphere > > At present, if an agent is partitioned away from the master, the master waits > for a period of time (see MESOS-4048) before deciding that the agent is dead. > Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the > tasks running on the agent, and instructs the agent to shutdown. > Although this behavior is desirable for some/many users, it is not ideal for > everyone. For example: > * Some users might want to aggressively start a new replacement task (e.g., > after one or two ping timeouts are missed); then when the old copy of the > task comes back, they might want to make an intelligent decision about how to > reconcile this situation (e.g., kill old, kill new, allow both to continue > running). > * Some frameworks might want different behavior from other frameworks, or to > treat some tasks differently from other tasks. For example, if a task has a > huge amount of state that would need to be regenerated to spin up another > instance, the user might want to wait longer before starting a new task to > increase the chance that the old task will reappear. > To do this, we'd need to change task state so that a task can go from > {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from > that state back to {{RUNNING}} (or perhaps we could keep the current > "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} > could also transition to {{LOST}}). The agent would also keep its old > {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks
[ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037152#comment-15037152 ] Klaus Ma commented on MESOS-4049: - I like {{replacement task}} feature :). Just want to confirm: in this JIRA, Mesos only provide a new state about connection glitch ({{UNKNOWN}} or {{WANDERING}}); "replacement task" is handled by framework. > Allow user to control behavior of partitioned agents/tasks > -- > > Key: MESOS-4049 > URL: https://issues.apache.org/jira/browse/MESOS-4049 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway > Labels: mesosphere > > At present, if an agent is partitioned away from the master, the master waits > for a period of time (see MESOS-4048) before deciding that the agent is dead. > Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the > tasks running on the agent, and instructs the agent to shutdown. > Although this behavior is desirable for some/many users, it is not ideal for > everyone. For example: > * Some users might want to aggressively start a new replacement task (e.g., > after one or two ping timeouts are missed); then when the old copy of the > task comes back, they might want to make an intelligent decision about how to > reconcile this situation (e.g., kill old, kill new, allow both to continue > running). > * Some frameworks might want different behavior from other frameworks, or to > treat some tasks differently from other tasks. For example, if a task has a > huge amount of state that would need to be regenerated to spin up another > instance, the user might want to wait longer before starting a new task to > increase the chance that the old task will reappear. > To do this, we'd need to change task state so that a task can go from > {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from > that state back to {{RUNNING}} (or perhaps we could keep the current > "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} > could also transition to {{LOST}}). The agent would also keep its old > {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks
[ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037154#comment-15037154 ] Neil Conway commented on MESOS-4049: Yes. > Allow user to control behavior of partitioned agents/tasks > -- > > Key: MESOS-4049 > URL: https://issues.apache.org/jira/browse/MESOS-4049 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway > Labels: mesosphere > > At present, if an agent is partitioned away from the master, the master waits > for a period of time (see MESOS-4048) before deciding that the agent is dead. > Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the > tasks running on the agent, and instructs the agent to shutdown. > Although this behavior is desirable for some/many users, it is not ideal for > everyone. For example: > * Some users might want to aggressively start a new replacement task (e.g., > after one or two ping timeouts are missed); then when the old copy of the > task comes back, they might want to make an intelligent decision about how to > reconcile this situation (e.g., kill old, kill new, allow both to continue > running). > * Some frameworks might want different behavior from other frameworks, or to > treat some tasks differently from other tasks. For example, if a task has a > huge amount of state that would need to be regenerated to spin up another > instance, the user might want to wait longer before starting a new task to > increase the chance that the old task will reappear. > To do this, we'd need to change task state so that a task can go from > {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from > that state back to {{RUNNING}} (or perhaps we could keep the current > "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} > could also transition to {{LOST}}). The agent would also keep its old > {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks
[ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037282#comment-15037282 ] Klaus Ma commented on MESOS-4049: - And which timepoint would you like to report the new state to framework? Ping failed or configurable e.g. after # ping failed (< max_slave_ping_times)? > Allow user to control behavior of partitioned agents/tasks > -- > > Key: MESOS-4049 > URL: https://issues.apache.org/jira/browse/MESOS-4049 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway > Labels: mesosphere > > At present, if an agent is partitioned away from the master, the master waits > for a period of time (see MESOS-4048) before deciding that the agent is dead. > Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the > tasks running on the agent, and instructs the agent to shutdown. > Although this behavior is desirable for some/many users, it is not ideal for > everyone. For example: > * Some users might want to aggressively start a new replacement task (e.g., > after one or two ping timeouts are missed); then when the old copy of the > task comes back, they might want to make an intelligent decision about how to > reconcile this situation (e.g., kill old, kill new, allow both to continue > running). > * Some frameworks might want different behavior from other frameworks, or to > treat some tasks differently from other tasks. For example, if a task has a > huge amount of state that would need to be regenerated to spin up another > instance, the user might want to wait longer before starting a new task to > increase the chance that the old task will reappear. > To do this, we'd need to change task state so that a task can go from > {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from > that state back to {{RUNNING}} (or perhaps we could keep the current > "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} > could also transition to {{LOST}}). The agent would also keep its old > {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks
[ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037207#comment-15037207 ] Guangya Liu commented on MESOS-4049: [~neilc]Got it, thanks! Then I think that we may need to consider the case if a UNKNOW or WANDERING task got killed? Shall we mark this as ZOMBIE and when the host come back, just mark the ZOMBIE as TASK_FINISH. > Allow user to control behavior of partitioned agents/tasks > -- > > Key: MESOS-4049 > URL: https://issues.apache.org/jira/browse/MESOS-4049 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway > Labels: mesosphere > > At present, if an agent is partitioned away from the master, the master waits > for a period of time (see MESOS-4048) before deciding that the agent is dead. > Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the > tasks running on the agent, and instructs the agent to shutdown. > Although this behavior is desirable for some/many users, it is not ideal for > everyone. For example: > * Some users might want to aggressively start a new replacement task (e.g., > after one or two ping timeouts are missed); then when the old copy of the > task comes back, they might want to make an intelligent decision about how to > reconcile this situation (e.g., kill old, kill new, allow both to continue > running). > * Some frameworks might want different behavior from other frameworks, or to > treat some tasks differently from other tasks. For example, if a task has a > huge amount of state that would need to be regenerated to spin up another > instance, the user might want to wait longer before starting a new task to > increase the chance that the old task will reappear. > To do this, we'd need to change task state so that a task can go from > {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from > that state back to {{RUNNING}} (or perhaps we could keep the current > "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} > could also transition to {{LOST}}). The agent would also keep its old > {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks
[ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037116#comment-15037116 ] Neil Conway commented on MESOS-4049: I'm not sure {{ZOMBIE}} accurately describes the intended behavior -- for example, in Unix a zombie process cannot come back to life. A zombie process is definitely dead (it just hasn't been properly cleaned up), whereas in this case the true state of the task is not known (to the master/framework). > Allow user to control behavior of partitioned agents/tasks > -- > > Key: MESOS-4049 > URL: https://issues.apache.org/jira/browse/MESOS-4049 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway > Labels: mesosphere > > At present, if an agent is partitioned away from the master, the master waits > for a period of time (see MESOS-4048) before deciding that the agent is dead. > Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the > tasks running on the agent, and instructs the agent to shutdown. > Although this behavior is desirable for some/many users, it is not ideal for > everyone. For example: > * Some users might want to aggressively start a new replacement task (e.g., > after one or two ping timeouts are missed); then when the old copy of the > task comes back, they might want to make an intelligent decision about how to > reconcile this situation (e.g., kill old, kill new, allow both to continue > running). > * Some frameworks might want different behavior from other frameworks, or to > treat some tasks differently from other tasks. For example, if a task has a > huge amount of state that would need to be regenerated to spin up another > instance, the user might want to wait longer before starting a new task to > increase the chance that the old task will reappear. > To do this, we'd need to change task state so that a task can go from > {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from > that state back to {{RUNNING}} (or perhaps we could keep the current > "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} > could also transition to {{LOST}}). The agent would also keep its old > {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks
[ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037109#comment-15037109 ] Guangya Liu commented on MESOS-4049: It would be great to add such intelligent feature. BTW: It might be more align with linux concept if we can name the transaction task state as "ZOMBIE". > Allow user to control behavior of partitioned agents/tasks > -- > > Key: MESOS-4049 > URL: https://issues.apache.org/jira/browse/MESOS-4049 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway > Labels: mesosphere > > At present, if an agent is partitioned away from the master, the master waits > for a period of time (see MESOS-4048) before deciding that the agent is dead. > Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the > tasks running on the agent, and instructs the agent to shutdown. > Although this behavior is desirable for some/many users, it is not ideal for > everyone. For example: > * Some users might want to aggressively start a new replacement task (e.g., > after one or two ping timeouts are missed); then when the old copy of the > task comes back, they might want to make an intelligent decision about how to > reconcile this situation (e.g., kill old, kill new, allow both to continue > running). > * Some frameworks might want different behavior from other frameworks, or to > treat some tasks differently from other tasks. For example, if a task has a > huge amount of state that would need to be regenerated to spin up another > instance, the user might want to wait longer before starting a new task to > increase the chance that the old task will reappear. > To do this, we'd need to change task state so that a task can go from > {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from > that state back to {{RUNNING}} (or perhaps we could keep the current > "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} > could also transition to {{LOST}}). The agent would also keep its old > {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)