[ 
https://issues.apache.org/jira/browse/MESOS-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-1817:
---------------------------------

    Assignee: Vinod Kone

> Completed tasks remains in TASK_RUNNING when framework is disconnected
> ----------------------------------------------------------------------
>
>                 Key: MESOS-1817
>                 URL: https://issues.apache.org/jira/browse/MESOS-1817
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Niklas Quarfot Nielsen
>            Assignee: Vinod Kone
>
> We have run into a problem that cause tasks which completes, when a framework 
> is disconnected and has a fail-over time, to remain in a running state even 
> though the tasks actually finishes. This hogs the cluster and gives users a 
> inconsistent view of the cluster state. Going to the slave, the task is 
> finished. Going to the master, the task is still in a non-terminal state. 
> When the scheduler reattaches or the failover timeout expires, the tasks 
> finishes correctly. The current workflow of this scheduler has a long 
> fail-over timeout, but may on the other hand never reattach.
> Here is a test framework we have been able to reproduce the issue with: 
> https://gist.github.com/nqn/9b9b1de9123a6e836f54
> It launches many short-lived tasks (1 second sleep) and when killing the 
> framework instance, the master reports the tasks as running even after 
> several minutes: 
> http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png
> When clicking on one of the slaves where, for example, task 49 runs; the 
> slave knows that it completed: 
> http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png
> Here is the log of a mesos-local instance where I reproduced it: 
> https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are 
> stuck in running state).
> There is a lot of output, so here is a filtered log for task 10: 
> https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d
> The problem turn out to be an issue with the ack-cycle of status updates:
> If the framework disconnects (with a failover timeout set), the status update 
> manage on the slaves will keep trying to send the front of status update 
> stream to the master (which in turn forwards it to the framework). If the 
> first status update after the disconnect is terminal, things work out fine; 
> the master pick the terminal state up, removes the task and release the 
> resources.
> If, on the other hand, one non-terminal status is in the stream. The master 
> will never know that the task finished (or failed) before the framework 
> reconnects.
> During a discussion on the dev mailing list 
> (http://mail-archives.apache.org/mod_mbox/mesos-dev/201409.mbox/%3cCADKthhAVR5mrq1s9HXw1BB_XFALXWWxjutp7MV4y3wP-Bh=a...@mail.gmail.com%3e)
>  we enumerated a couple of options to solve this problem.
> First off, having two ack-cycles: one between masters and slaves and one 
> between masters and frameworks, would be ideal. We would be able to replay 
> the statuses in order while keeping the master state current. However, this 
> requires us to persist the master state in a replicated storage.
> As a first pass, we can make sure that the tasks caught in a running state 
> doesn't hog the cluster when completed and the framework being disconnected.
> Here is a proof-of-concept to work out of: 
> https://github.com/nqn/mesos/tree/niklas/status-update-disconnect/
> A new (optional) field have been added to the internal status update message:
> https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/messages/messages.proto#L68
> Which makes it possible for the status update manager to set the field, if 
> the latest status was terminal: 
> https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/slave/status_update_manager.cpp#L501
> I added a test which should high-light the issue as well:
> https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/tests/fault_tolerance_tests.cpp#L2478
> I would love some input on the approach before moving on.
> There are rough edges in the PoC which (of course) should be addressed before 
> bringing it for up review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to