[ 
https://issues.apache.org/jira/browse/MESOS-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-783:
----------------------------------

    Issue Type: Sub-task  (was: Bug)
        Parent: MESOS-764

> Master::killTask must not answer with TASK_LOST when the task is unknown.
> -------------------------------------------------------------------------
>
>                 Key: MESOS-783
>                 URL: https://issues.apache.org/jira/browse/MESOS-783
>             Project: Mesos
>          Issue Type: Sub-task
>    Affects Versions: 0.14.0, 0.14.1, 0.14.2, 0.15.0
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>            Priority: Critical
>              Labels: twitter
>             Fix For: 0.18.0
>
>
> When the Master is asked to kill a task and it knows of the framework but it 
> cannot locate the TaskID, the Master replies with TASK_LOST.
> This is normally ok, however, consider a failed over Master:
>   --> Master fails over.
>   --> Framework F re-registers.
>   --> Slave with Task T in TASK_RUNNING has not yet re-registered.
>   --> Master::killTask(F, T) cannot find T and replies with TASK_LOST.
>   --> Slave re-registers with Task T in TASK_RUNNING.
>   --> Now we've told the framework the task was LOST but it is left RUNNING.
> The simple fix here is to simply not reply in such cases and rely on a later 
> reconciliation request.
> In the presence of a stateful master (MESOS-764), we can reliably reply with 
> TASK_LOST if the slave is not in the Registrar, otherwise we must remain 
> silent as the slave will be possibly re-registering with the correct state of 
> the TASK. Ideally we can postpone the kill task message for the slave so that 
> once it re-registers we can send it, but this is a bit complicated to 
> implement and reconciliation can help with this.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to