[ https://issues.apache.org/jira/browse/MESOS-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154012#comment-14154012 ]
Benjamin Mahler commented on MESOS-1696: ---------------------------------------- https://reviews.apache.org/r/26202/ https://reviews.apache.org/r/26206/ https://reviews.apache.org/r/26207/ https://reviews.apache.org/r/26208/ > Improve reconciliation between master and slave. > ------------------------------------------------ > > Key: MESOS-1696 > URL: https://issues.apache.org/jira/browse/MESOS-1696 > Project: Mesos > Issue Type: Bug > Components: master, slave > Reporter: Benjamin Mahler > Assignee: Benjamin Mahler > > As we update the Master to keep tasks in memory until they are both terminal > and acknowledged (MESOS-1410), the lifetime of tasks in Mesos will look as > follows: > {code} > Master Slave > {} {} > {Tn} {} // Master receives Task T, non-terminal. Forwards to > slave. > {Tn} {Tn} // Slave receives Task T, non-terminal. > {Tn} {Tt} // Task becomes terminal on slave. Update forwarded. > {Tt} {Tt} // Master receives update, forwards to framework. > {} {Tt} // Master receives ack, forwards to slave. > {} {} // Slave receives ack. > {code} > In the current form of reconciliation, the slave sends to the master all > tasks that are not both terminal and acknowledged. At any point in the above > lifecycle, the slave's re-registration message can reach the master. > Note the following properties: > *(1)* The master may have a non-terminal task, not present in the slave's > re-registration message. > *(2)* The master may have a non-terminal task, present in the slave's > re-registration message but in a different state. > *(3)* The slave's re-registration message may contain a terminal > unacknowledged task unknown to the master. > In the current master / slave > [reconciliation|https://github.com/apache/mesos/blob/0.19.1/src/master/master.cpp#L3146] > code, the master assumes that case (1) is because a launch task message was > dropped, and it sends TASK_LOST. We've seen above that (1) can happen even > when the task reaches the slave correctly, so this can lead to inconsistency! > After chatting with [~vinodkone], we're considering updating the > reconciliation to occur as follows: > → Slave sends all tasks that are not both terminal and acknowledged, during > re-registration. This is the same as before. > → If the master sees tasks that are missing in the slave, the master sends > the tasks that need to be reconciled to the slave for the tasks. This can be > piggy-backed on the re-registration message. > → The slave will send TASK_LOST if the task is not known to it. Preferably in > a retried manner, unless we update socket closure on the slave to force a > re-registration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)