Benjamin Bannier created MESOS-10018:
----------------------------------------
Summary: Duplicate tasks if agent partitioned during maintenance
down period
Key: MESOS-10018
URL: https://issues.apache.org/jira/browse/MESOS-10018
Project: Mesos
Issue Type: Bug
Reporter: Benjamin Bannier
When the master starts maintenance for a node it
(1) sends a {{ShutdownMessage}} message to agent, and
(2) removes the slave which transitions all tasks to {{TASK_LOST}} and moves
them
to the completed task set.
If the {{ShutdownMessage}} isn't fully processed on the agent (e.g., message
dropped between (1) and (2), or agent process killed before the executor has
shut down), the agent could come back with the lost task running. It would
report the task on registration with the master, which would add it to the list
of active tasks. With that the same task could be both completed and active.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)