[ https://issues.apache.org/jira/browse/MESOS-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinod Kone reassigned MESOS-1817: --------------------------------- Assignee: Vinod Kone > Completed tasks remains in TASK_RUNNING when framework is disconnected > ---------------------------------------------------------------------- > > Key: MESOS-1817 > URL: https://issues.apache.org/jira/browse/MESOS-1817 > Project: Mesos > Issue Type: Bug > Reporter: Niklas Quarfot Nielsen > Assignee: Vinod Kone > > We have run into a problem that cause tasks which completes, when a framework > is disconnected and has a fail-over time, to remain in a running state even > though the tasks actually finishes. This hogs the cluster and gives users a > inconsistent view of the cluster state. Going to the slave, the task is > finished. Going to the master, the task is still in a non-terminal state. > When the scheduler reattaches or the failover timeout expires, the tasks > finishes correctly. The current workflow of this scheduler has a long > fail-over timeout, but may on the other hand never reattach. > Here is a test framework we have been able to reproduce the issue with: > https://gist.github.com/nqn/9b9b1de9123a6e836f54 > It launches many short-lived tasks (1 second sleep) and when killing the > framework instance, the master reports the tasks as running even after > several minutes: > http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png > When clicking on one of the slaves where, for example, task 49 runs; the > slave knows that it completed: > http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png > Here is the log of a mesos-local instance where I reproduced it: > https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are > stuck in running state). > There is a lot of output, so here is a filtered log for task 10: > https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d > The problem turn out to be an issue with the ack-cycle of status updates: > If the framework disconnects (with a failover timeout set), the status update > manage on the slaves will keep trying to send the front of status update > stream to the master (which in turn forwards it to the framework). If the > first status update after the disconnect is terminal, things work out fine; > the master pick the terminal state up, removes the task and release the > resources. > If, on the other hand, one non-terminal status is in the stream. The master > will never know that the task finished (or failed) before the framework > reconnects. > During a discussion on the dev mailing list > (http://mail-archives.apache.org/mod_mbox/mesos-dev/201409.mbox/%3cCADKthhAVR5mrq1s9HXw1BB_XFALXWWxjutp7MV4y3wP-Bh=a...@mail.gmail.com%3e) > we enumerated a couple of options to solve this problem. > First off, having two ack-cycles: one between masters and slaves and one > between masters and frameworks, would be ideal. We would be able to replay > the statuses in order while keeping the master state current. However, this > requires us to persist the master state in a replicated storage. > As a first pass, we can make sure that the tasks caught in a running state > doesn't hog the cluster when completed and the framework being disconnected. > Here is a proof-of-concept to work out of: > https://github.com/nqn/mesos/tree/niklas/status-update-disconnect/ > A new (optional) field have been added to the internal status update message: > https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/messages/messages.proto#L68 > Which makes it possible for the status update manager to set the field, if > the latest status was terminal: > https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/slave/status_update_manager.cpp#L501 > I added a test which should high-light the issue as well: > https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/tests/fault_tolerance_tests.cpp#L2478 > I would love some input on the approach before moving on. > There are rough edges in the PoC which (of course) should be addressed before > bringing it for up review. -- This message was sent by Atlassian JIRA (v6.3.4#6332)