Neil Conway created MESOS-4049:
----------------------------------

             Summary: Allow user to control behavior of partitioned agents/tasks
                 Key: MESOS-4049
                 URL: https://issues.apache.org/jira/browse/MESOS-4049
             Project: Mesos
          Issue Type: Improvement
          Components: master, slave
            Reporter: Neil Conway


At present, if an agent is partitioned away from the master, the master waits 
for a period of time (see MESOS-4048) before deciding that the agent is dead. 
Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the tasks 
running on the agent, and instructs the agent to shutdown.

Although this behavior is desirable for some/many users, it is not ideal for 
everyone. For example:

* Some users might want to aggressively start a new replacement task (e.g., 
after one or two ping timeouts are missed); then when the old copy of the task 
comes back, they might want to make an intelligent decision about how to 
reconcile this situation (e.g., kill old, kill new, allow both to continue 
running).
* Some frameworks might want different behavior from other frameworks, or to 
treat some tasks differently from other tasks. For example, if a task has a 
huge amount of state that would need to be regenerated to spin up another 
instance, the user might want to wait longer before starting a new task to 
increase the chance that the old task will reappear.

To do this, we'd need to change task state so that a task can go from 
{{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
that state back to {{RUNNING}} (or perhaps we could keep the current 
"mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
could also transition to {{LOST}}). The agent would also keep its old 
{{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to