Hi guys, We have run into a problem that cause tasks which completes, when a framework is disconnected and has a fail-over time, to remain in a running state even though the tasks actually finishes.
Here is a test framework we have been able to reproduce the issue with: https://gist.github.com/nqn/9b9b1de9123a6e836f54 It launches many short-lived tasks (1 second sleep) and when killing the framework instance, the master reports the tasks as running even after several minutes: http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png When clicking on one of the slaves where, for example, task 49 runs; the slave knows that it completed: http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png The tasks only finish when the framework connects again (which it may never do). This is on Mesos 0.20.0, but also applies to HEAD (as of today). Do you guys have any insights into what may be going on here? Is this by-design or a bug? Thanks, Niklas