Hi guys,

We have run into a problem that cause tasks which completes, when a
framework is disconnected and has a fail-over time, to remain in a running
state even though the tasks actually finishes.

Here is a test framework we have been able to reproduce the issue with:
https://gist.github.com/nqn/9b9b1de9123a6e836f54
It launches many short-lived tasks (1 second sleep) and when killing the
framework instance, the master reports the tasks as running even after
several minutes:
http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png

When clicking on one of the slaves where, for example, task 49 runs; the
slave knows that it completed:
http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png

The tasks only finish when the framework connects again (which it may never
do). This is on Mesos 0.20.0, but also applies to HEAD (as of today).
Do you guys have any insights into what may be going on here? Is this
by-design or a bug?

Thanks,
Niklas

Reply via email to