[ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037282#comment-15037282 ]
Klaus Ma commented on MESOS-4049: --------------------------------- And which timepoint would you like to report the new state to framework? Ping failed or configurable e.g. after # ping failed (< max_slave_ping_times)? > Allow user to control behavior of partitioned agents/tasks > ---------------------------------------------------------- > > Key: MESOS-4049 > URL: https://issues.apache.org/jira/browse/MESOS-4049 > Project: Mesos > Issue Type: Improvement > Components: master, slave > Reporter: Neil Conway > Labels: mesosphere > > At present, if an agent is partitioned away from the master, the master waits > for a period of time (see MESOS-4048) before deciding that the agent is dead. > Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the > tasks running on the agent, and instructs the agent to shutdown. > Although this behavior is desirable for some/many users, it is not ideal for > everyone. For example: > * Some users might want to aggressively start a new replacement task (e.g., > after one or two ping timeouts are missed); then when the old copy of the > task comes back, they might want to make an intelligent decision about how to > reconcile this situation (e.g., kill old, kill new, allow both to continue > running). > * Some frameworks might want different behavior from other frameworks, or to > treat some tasks differently from other tasks. For example, if a task has a > huge amount of state that would need to be regenerated to spin up another > instance, the user might want to wait longer before starting a new task to > increase the chance that the old task will reappear. > To do this, we'd need to change task state so that a task can go from > {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from > that state back to {{RUNNING}} (or perhaps we could keep the current > "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} > could also transition to {{LOST}}). The agent would also keep its old > {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)