[ 
https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037116#comment-15037116
 ] 

Neil Conway commented on MESOS-4049:
------------------------------------

I'm not sure {{ZOMBIE}} accurately describes the intended behavior -- for 
example, in Unix a zombie process cannot come back to life. A zombie process is 
definitely dead (it just hasn't been properly cleaned up), whereas in this case 
the true state of the task is not known (to the master/framework).

> Allow user to control behavior of partitioned agents/tasks
> ----------------------------------------------------------
>
>                 Key: MESOS-4049
>                 URL: https://issues.apache.org/jira/browse/MESOS-4049
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master, slave
>            Reporter: Neil Conway
>              Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits 
> for a period of time (see MESOS-4048) before deciding that the agent is dead. 
> Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the 
> tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for 
> everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., 
> after one or two ping timeouts are missed); then when the old copy of the 
> task comes back, they might want to make an intelligent decision about how to 
> reconcile this situation (e.g., kill old, kill new, allow both to continue 
> running).
> * Some frameworks might want different behavior from other frameworks, or to 
> treat some tasks differently from other tasks. For example, if a task has a 
> huge amount of state that would need to be regenerated to spin up another 
> instance, the user might want to wait longer before starting a new task to 
> increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from 
> {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
> that state back to {{RUNNING}} (or perhaps we could keep the current 
> "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
> could also transition to {{LOST}}). The agent would also keep its old 
> {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to