[ https://issues.apache.org/jira/browse/MESOS-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benjamin Mahler updated MESOS-783: ---------------------------------- Issue Type: Sub-task (was: Bug) Parent: MESOS-764 > Master::killTask must not answer with TASK_LOST when the task is unknown. > ------------------------------------------------------------------------- > > Key: MESOS-783 > URL: https://issues.apache.org/jira/browse/MESOS-783 > Project: Mesos > Issue Type: Sub-task > Affects Versions: 0.14.0, 0.14.1, 0.14.2, 0.15.0 > Reporter: Benjamin Mahler > Assignee: Benjamin Mahler > Priority: Critical > Labels: twitter > Fix For: 0.18.0 > > > When the Master is asked to kill a task and it knows of the framework but it > cannot locate the TaskID, the Master replies with TASK_LOST. > This is normally ok, however, consider a failed over Master: > --> Master fails over. > --> Framework F re-registers. > --> Slave with Task T in TASK_RUNNING has not yet re-registered. > --> Master::killTask(F, T) cannot find T and replies with TASK_LOST. > --> Slave re-registers with Task T in TASK_RUNNING. > --> Now we've told the framework the task was LOST but it is left RUNNING. > The simple fix here is to simply not reply in such cases and rely on a later > reconciliation request. > In the presence of a stateful master (MESOS-764), we can reliably reply with > TASK_LOST if the slave is not in the Registrar, otherwise we must remain > silent as the slave will be possibly re-registering with the correct state of > the TASK. Ideally we can postpone the kill task message for the slave so that > once it re-registers we can send it, but this is a bit complicated to > implement and reconciliation can help with this. -- This message was sent by Atlassian JIRA (v6.1.5#6160)