Re: Completed tasks remains in TASK_RUNNING when framework is disconnected
Okay - that only solves half of the problem for us: users will still see their frameworks as running even though they completed but it is a first step. Let's continue the discussion in a JIRA ticket; I'll create one shortly. Thanks for helping out! Niklas On 15 September 2014 18:17, Benjamin Mahler benjamin.mah...@gmail.com wrote: On Mon, Sep 15, 2014 at 3:11 PM, Niklas Nielsen nik...@mesosphere.io wrote: Thanks for your input Ben! (Comments inlined) On 15 September 2014 12:35, Benjamin Mahler benjamin.mah...@gmail.com wrote: To ensure that the architecture of mesos remains a scalable one, we want to persist state in the leaves of the system as much as possible. This is why the master has never persisted tasks, task states, or status updates. Note that status updates can contain arbitrarily large amounts of data at the current time (example impact of this: https://issues.apache.org/jira/browse/MESOS-1746). Adam, I think solution (2) you listed above of including a terminal indicator in the _private_ message between slave and master would easily allow us to release the resources at the correct time. We would still hold the correct task state in the master, and would maintain the status update stream invariants for frameworks (guaranteed ordered delivery). This would be simpler to implement with my recent change here, because you no longer have to remove the task to release the resources: https://reviews.apache.org/r/25568/ We should still hold the correct task state; meaning the actual state of the task on the slave? Correct, as in, just maintaining the existing task state semantics in the master (for correctness of reconciliation / status update ordering). Then the auxiliary field should represent last known status, which may not necessarily be terminal. For example, a staging update followed by a running update while the framework is disconnected will show as staging still - or am I missing something? It would still show as staging, the running update would never be sent to the master since the slave is not receiving acknowledgements. But if there were a terminal task, then the slave would be setting the auxiliary field in the StatusUpdateMessage and the master will know to release the resources. + backwards compatibility Longer term, adding pipelining of status updates would be a nice improvement (similar to what you listed in (1) above). But as Vinod said, it will require care to ensure we maintain the stream invariants for frameworks (i.e. probably need to send multiple updates in 1 message). How does this sound? Sounds great - we would love to help out with (2). Would you be up for shepherding such a change? Yep! The changes needed should be based off of https://reviews.apache.org/r/25568/ since it changes the resource releasing in the master. On Thu, Sep 11, 2014 at 12:02 PM, Adam Bordelon a...@mesosphere.io wrote: Definitely relevant. If the master could be trusted to persist all the task status updates, then they could be queued up at the master instead of the slave once the master has acknowledged its receipt. Then the master could have the most up-to-date task state and can recover the resources as soon as it receives a terminal update, even if the framework is disconnected and unable to receive/ack the status updates. Then, once the framework reconnects, the master will be responsible for sending its queued status updates. We will still need a queue on the slave side, but only for updates that the master has not persisted and ack'ed, primarily during the scenario when the slave is disconnected from the master. On Thu, Sep 11, 2014 at 10:17 AM, Vinod Kone vinodk...@gmail.com wrote: The semantics of these changes would have an impact on the upcoming task reconciliation. @BenM: Can you chime in here on how this fits into the task reconciliation work that you've been leading? On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon a...@mesosphere.io wrote: I agree with Niklas that if the executor has sent a terminal status update to the slave, then the task is done and the master should be able to recover those resources. Only sending the oldest status update to the master, especially in the case of framework failover, prevents these resources from being recovered in a timely manner. I see a couple of options for getting around this, each with their own disadvantages. 1) Send the entire status update stream to the master. Once the master sees the terminal status update, it will removeTask and recover the resources. Future resends of the update will be forwarded to the scheduler, but the master will ignore (with warning and
Re: Completed tasks remains in TASK_RUNNING when framework is disconnected
To ensure that the architecture of mesos remains a scalable one, we want to persist state in the leaves of the system as much as possible. This is why the master has never persisted tasks, task states, or status updates. Note that status updates can contain arbitrarily large amounts of data at the current time (example impact of this: https://issues.apache.org/jira/browse/MESOS-1746). Adam, I think solution (2) you listed above of including a terminal indicator in the _private_ message between slave and master would easily allow us to release the resources at the correct time. We would still hold the correct task state in the master, and would maintain the status update stream invariants for frameworks (guaranteed ordered delivery). This would be simpler to implement with my recent change here, because you no longer have to remove the task to release the resources: https://reviews.apache.org/r/25568/ Longer term, adding pipelining of status updates would be a nice improvement (similar to what you listed in (1) above). But as Vinod said, it will require care to ensure we maintain the stream invariants for frameworks (i.e. probably need to send multiple updates in 1 message). How does this sound? On Thu, Sep 11, 2014 at 12:02 PM, Adam Bordelon a...@mesosphere.io wrote: Definitely relevant. If the master could be trusted to persist all the task status updates, then they could be queued up at the master instead of the slave once the master has acknowledged its receipt. Then the master could have the most up-to-date task state and can recover the resources as soon as it receives a terminal update, even if the framework is disconnected and unable to receive/ack the status updates. Then, once the framework reconnects, the master will be responsible for sending its queued status updates. We will still need a queue on the slave side, but only for updates that the master has not persisted and ack'ed, primarily during the scenario when the slave is disconnected from the master. On Thu, Sep 11, 2014 at 10:17 AM, Vinod Kone vinodk...@gmail.com wrote: The semantics of these changes would have an impact on the upcoming task reconciliation. @BenM: Can you chime in here on how this fits into the task reconciliation work that you've been leading? On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon a...@mesosphere.io wrote: I agree with Niklas that if the executor has sent a terminal status update to the slave, then the task is done and the master should be able to recover those resources. Only sending the oldest status update to the master, especially in the case of framework failover, prevents these resources from being recovered in a timely manner. I see a couple of options for getting around this, each with their own disadvantages. 1) Send the entire status update stream to the master. Once the master sees the terminal status update, it will removeTask and recover the resources. Future resends of the update will be forwarded to the scheduler, but the master will ignore (with warning and invalid_update++ metrics) the subsequent updates as far as its own state for the removed task is concerned. Disadvantage: Potentially sends a lot of status update messages until the scheduler reregisters and acknowledges the updates. Disadvantage2: Updates could be sent to the scheduler out of order if some updates are dropped between the slave and master. 2) Send only the oldest status update to the master, but with an annotation of the final/terminal state of the task, if any. That way the master can call removeTask to update its internal state for the task (and update the UI) and recover the resources for the task. While the scheduler is still down, the oldest update will continue to be resent and forwarded, but the master will ignore the update (with a warning as above) as far as its own internal state is concerned. When the scheduler reregisters, the update stream will be forwarded and acknowledged one-at-a-time as before, guaranteeing status update ordering to the scheduler. Disadvantage: Seems a bit hacky to tack a terminal state onto a running update. Disadvantage2: State endpoint won't show all the status updates until the entire stream actually gets forwarded+acknowledged. Thoughts? On Wed, Sep 10, 2014 at 5:55 PM, Vinod Kone vinodk...@gmail.com wrote: The main reason is to keep status update manager simple. Also, it is very easy to enforce the order of updates to the master/framework in this model. If we allow multiple updates for a task to be in flight, it's really hard (impossible?) to ensure that we are not delivering out-of-order updates even in edge cases (failover, network partitions etc). On Wed, Sep 10, 2014 at 5:35 PM, Niklas Nielsen nik...@mesosphere.io wrote: Hey Vinod - thanks for chiming in! Is there a particular reason for only having one status in flight? Or to put
Re: Completed tasks remains in TASK_RUNNING when framework is disconnected
Thanks for your input Ben! (Comments inlined) On 15 September 2014 12:35, Benjamin Mahler benjamin.mah...@gmail.com wrote: To ensure that the architecture of mesos remains a scalable one, we want to persist state in the leaves of the system as much as possible. This is why the master has never persisted tasks, task states, or status updates. Note that status updates can contain arbitrarily large amounts of data at the current time (example impact of this: https://issues.apache.org/jira/browse/MESOS-1746). Adam, I think solution (2) you listed above of including a terminal indicator in the _private_ message between slave and master would easily allow us to release the resources at the correct time. We would still hold the correct task state in the master, and would maintain the status update stream invariants for frameworks (guaranteed ordered delivery). This would be simpler to implement with my recent change here, because you no longer have to remove the task to release the resources: https://reviews.apache.org/r/25568/ We should still hold the correct task state; meaning the actual state of the task on the slave? Then the auxiliary field should represent last known status, which may not necessarily be terminal. For example, a staging update followed by a running update while the framework is disconnected will show as staging still - or am I missing something? Longer term, adding pipelining of status updates would be a nice improvement (similar to what you listed in (1) above). But as Vinod said, it will require care to ensure we maintain the stream invariants for frameworks (i.e. probably need to send multiple updates in 1 message). How does this sound? Sounds great - we would love to help out with (2). Would you be up for shepherding such a change? On Thu, Sep 11, 2014 at 12:02 PM, Adam Bordelon a...@mesosphere.io wrote: Definitely relevant. If the master could be trusted to persist all the task status updates, then they could be queued up at the master instead of the slave once the master has acknowledged its receipt. Then the master could have the most up-to-date task state and can recover the resources as soon as it receives a terminal update, even if the framework is disconnected and unable to receive/ack the status updates. Then, once the framework reconnects, the master will be responsible for sending its queued status updates. We will still need a queue on the slave side, but only for updates that the master has not persisted and ack'ed, primarily during the scenario when the slave is disconnected from the master. On Thu, Sep 11, 2014 at 10:17 AM, Vinod Kone vinodk...@gmail.com wrote: The semantics of these changes would have an impact on the upcoming task reconciliation. @BenM: Can you chime in here on how this fits into the task reconciliation work that you've been leading? On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon a...@mesosphere.io wrote: I agree with Niklas that if the executor has sent a terminal status update to the slave, then the task is done and the master should be able to recover those resources. Only sending the oldest status update to the master, especially in the case of framework failover, prevents these resources from being recovered in a timely manner. I see a couple of options for getting around this, each with their own disadvantages. 1) Send the entire status update stream to the master. Once the master sees the terminal status update, it will removeTask and recover the resources. Future resends of the update will be forwarded to the scheduler, but the master will ignore (with warning and invalid_update++ metrics) the subsequent updates as far as its own state for the removed task is concerned. Disadvantage: Potentially sends a lot of status update messages until the scheduler reregisters and acknowledges the updates. Disadvantage2: Updates could be sent to the scheduler out of order if some updates are dropped between the slave and master. 2) Send only the oldest status update to the master, but with an annotation of the final/terminal state of the task, if any. That way the master can call removeTask to update its internal state for the task (and update the UI) and recover the resources for the task. While the scheduler is still down, the oldest update will continue to be resent and forwarded, but the master will ignore the update (with a warning as above) as far as its own internal state is concerned. When the scheduler reregisters, the update stream will be forwarded and acknowledged one-at-a-time as before, guaranteeing status update ordering to the scheduler. Disadvantage: Seems a bit hacky to tack a terminal state onto a running update. Disadvantage2: State endpoint won't show all the status updates until the entire stream actually gets forwarded+acknowledged.
Re: Completed tasks remains in TASK_RUNNING when framework is disconnected
On Mon, Sep 15, 2014 at 3:11 PM, Niklas Nielsen nik...@mesosphere.io wrote: Thanks for your input Ben! (Comments inlined) On 15 September 2014 12:35, Benjamin Mahler benjamin.mah...@gmail.com wrote: To ensure that the architecture of mesos remains a scalable one, we want to persist state in the leaves of the system as much as possible. This is why the master has never persisted tasks, task states, or status updates. Note that status updates can contain arbitrarily large amounts of data at the current time (example impact of this: https://issues.apache.org/jira/browse/MESOS-1746). Adam, I think solution (2) you listed above of including a terminal indicator in the _private_ message between slave and master would easily allow us to release the resources at the correct time. We would still hold the correct task state in the master, and would maintain the status update stream invariants for frameworks (guaranteed ordered delivery). This would be simpler to implement with my recent change here, because you no longer have to remove the task to release the resources: https://reviews.apache.org/r/25568/ We should still hold the correct task state; meaning the actual state of the task on the slave? Correct, as in, just maintaining the existing task state semantics in the master (for correctness of reconciliation / status update ordering). Then the auxiliary field should represent last known status, which may not necessarily be terminal. For example, a staging update followed by a running update while the framework is disconnected will show as staging still - or am I missing something? It would still show as staging, the running update would never be sent to the master since the slave is not receiving acknowledgements. But if there were a terminal task, then the slave would be setting the auxiliary field in the StatusUpdateMessage and the master will know to release the resources. + backwards compatibility Longer term, adding pipelining of status updates would be a nice improvement (similar to what you listed in (1) above). But as Vinod said, it will require care to ensure we maintain the stream invariants for frameworks (i.e. probably need to send multiple updates in 1 message). How does this sound? Sounds great - we would love to help out with (2). Would you be up for shepherding such a change? Yep! The changes needed should be based off of https://reviews.apache.org/r/25568/ since it changes the resource releasing in the master. On Thu, Sep 11, 2014 at 12:02 PM, Adam Bordelon a...@mesosphere.io wrote: Definitely relevant. If the master could be trusted to persist all the task status updates, then they could be queued up at the master instead of the slave once the master has acknowledged its receipt. Then the master could have the most up-to-date task state and can recover the resources as soon as it receives a terminal update, even if the framework is disconnected and unable to receive/ack the status updates. Then, once the framework reconnects, the master will be responsible for sending its queued status updates. We will still need a queue on the slave side, but only for updates that the master has not persisted and ack'ed, primarily during the scenario when the slave is disconnected from the master. On Thu, Sep 11, 2014 at 10:17 AM, Vinod Kone vinodk...@gmail.com wrote: The semantics of these changes would have an impact on the upcoming task reconciliation. @BenM: Can you chime in here on how this fits into the task reconciliation work that you've been leading? On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon a...@mesosphere.io wrote: I agree with Niklas that if the executor has sent a terminal status update to the slave, then the task is done and the master should be able to recover those resources. Only sending the oldest status update to the master, especially in the case of framework failover, prevents these resources from being recovered in a timely manner. I see a couple of options for getting around this, each with their own disadvantages. 1) Send the entire status update stream to the master. Once the master sees the terminal status update, it will removeTask and recover the resources. Future resends of the update will be forwarded to the scheduler, but the master will ignore (with warning and invalid_update++ metrics) the subsequent updates as far as its own state for the removed task is concerned. Disadvantage: Potentially sends a lot of status update messages until the scheduler reregisters and acknowledges the updates. Disadvantage2: Updates could be sent to the scheduler out of order if some updates are dropped between the slave and master. 2) Send only the oldest status update to the master, but with an annotation of the
Re: Completed tasks remains in TASK_RUNNING when framework is disconnected
The semantics of these changes would have an impact on the upcoming task reconciliation. @BenM: Can you chime in here on how this fits into the task reconciliation work that you've been leading? On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon a...@mesosphere.io wrote: I agree with Niklas that if the executor has sent a terminal status update to the slave, then the task is done and the master should be able to recover those resources. Only sending the oldest status update to the master, especially in the case of framework failover, prevents these resources from being recovered in a timely manner. I see a couple of options for getting around this, each with their own disadvantages. 1) Send the entire status update stream to the master. Once the master sees the terminal status update, it will removeTask and recover the resources. Future resends of the update will be forwarded to the scheduler, but the master will ignore (with warning and invalid_update++ metrics) the subsequent updates as far as its own state for the removed task is concerned. Disadvantage: Potentially sends a lot of status update messages until the scheduler reregisters and acknowledges the updates. Disadvantage2: Updates could be sent to the scheduler out of order if some updates are dropped between the slave and master. 2) Send only the oldest status update to the master, but with an annotation of the final/terminal state of the task, if any. That way the master can call removeTask to update its internal state for the task (and update the UI) and recover the resources for the task. While the scheduler is still down, the oldest update will continue to be resent and forwarded, but the master will ignore the update (with a warning as above) as far as its own internal state is concerned. When the scheduler reregisters, the update stream will be forwarded and acknowledged one-at-a-time as before, guaranteeing status update ordering to the scheduler. Disadvantage: Seems a bit hacky to tack a terminal state onto a running update. Disadvantage2: State endpoint won't show all the status updates until the entire stream actually gets forwarded+acknowledged. Thoughts? On Wed, Sep 10, 2014 at 5:55 PM, Vinod Kone vinodk...@gmail.com wrote: The main reason is to keep status update manager simple. Also, it is very easy to enforce the order of updates to the master/framework in this model. If we allow multiple updates for a task to be in flight, it's really hard (impossible?) to ensure that we are not delivering out-of-order updates even in edge cases (failover, network partitions etc). On Wed, Sep 10, 2014 at 5:35 PM, Niklas Nielsen nik...@mesosphere.io wrote: Hey Vinod - thanks for chiming in! Is there a particular reason for only having one status in flight? Or to put it in another way, isn't that too strict behavior taken that the master state could present the most recent known state if the status update manager tried to send more than the front of the stream? Taken very long timeouts, just waiting for those to disappear seems a bit tedious and hogs the cluster. Niklas On 10 September 2014 17:18, Vinod Kone vinodk...@gmail.com wrote: What you observed is expected because of the way the slave (specifically, the status update manager) operates. The status update manager only sends the next update for a task if a previous update (if it exists) has been acked. In your case, since TASK_RUNNING was not acked by the framework, master doesn't know about the TASK_FINISHED update that is queued up by the status update manager. If the framework never comes back, i.e., failover timeout elapses, master shuts down the framework, which releases those resources. On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen nik...@mesosphere.io wrote: Here is the log of a mesos-local instance where I reproduced it: https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck in running state). There is a lot of output, so here is a filtered log for task 10: https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d At first glance, it looks like the task can't be found when trying to forward the finish update because the running update never got acknowledged before the framework disconnected. I may be missing something here. Niklas On 10 September 2014 16:09, Niklas Nielsen nik...@mesosphere.io wrote: Hi guys, We have run into a problem that cause tasks which completes, when a framework is disconnected and has a fail-over time, to remain in a running state even though the tasks actually finishes. Here is a test framework we have been able to reproduce the issue with: https://gist.github.com/nqn/9b9b1de9123a6e836f54 It launches many short-lived tasks (1
Re: Completed tasks remains in TASK_RUNNING when framework is disconnected
Definitely relevant. If the master could be trusted to persist all the task status updates, then they could be queued up at the master instead of the slave once the master has acknowledged its receipt. Then the master could have the most up-to-date task state and can recover the resources as soon as it receives a terminal update, even if the framework is disconnected and unable to receive/ack the status updates. Then, once the framework reconnects, the master will be responsible for sending its queued status updates. We will still need a queue on the slave side, but only for updates that the master has not persisted and ack'ed, primarily during the scenario when the slave is disconnected from the master. On Thu, Sep 11, 2014 at 10:17 AM, Vinod Kone vinodk...@gmail.com wrote: The semantics of these changes would have an impact on the upcoming task reconciliation. @BenM: Can you chime in here on how this fits into the task reconciliation work that you've been leading? On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon a...@mesosphere.io wrote: I agree with Niklas that if the executor has sent a terminal status update to the slave, then the task is done and the master should be able to recover those resources. Only sending the oldest status update to the master, especially in the case of framework failover, prevents these resources from being recovered in a timely manner. I see a couple of options for getting around this, each with their own disadvantages. 1) Send the entire status update stream to the master. Once the master sees the terminal status update, it will removeTask and recover the resources. Future resends of the update will be forwarded to the scheduler, but the master will ignore (with warning and invalid_update++ metrics) the subsequent updates as far as its own state for the removed task is concerned. Disadvantage: Potentially sends a lot of status update messages until the scheduler reregisters and acknowledges the updates. Disadvantage2: Updates could be sent to the scheduler out of order if some updates are dropped between the slave and master. 2) Send only the oldest status update to the master, but with an annotation of the final/terminal state of the task, if any. That way the master can call removeTask to update its internal state for the task (and update the UI) and recover the resources for the task. While the scheduler is still down, the oldest update will continue to be resent and forwarded, but the master will ignore the update (with a warning as above) as far as its own internal state is concerned. When the scheduler reregisters, the update stream will be forwarded and acknowledged one-at-a-time as before, guaranteeing status update ordering to the scheduler. Disadvantage: Seems a bit hacky to tack a terminal state onto a running update. Disadvantage2: State endpoint won't show all the status updates until the entire stream actually gets forwarded+acknowledged. Thoughts? On Wed, Sep 10, 2014 at 5:55 PM, Vinod Kone vinodk...@gmail.com wrote: The main reason is to keep status update manager simple. Also, it is very easy to enforce the order of updates to the master/framework in this model. If we allow multiple updates for a task to be in flight, it's really hard (impossible?) to ensure that we are not delivering out-of-order updates even in edge cases (failover, network partitions etc). On Wed, Sep 10, 2014 at 5:35 PM, Niklas Nielsen nik...@mesosphere.io wrote: Hey Vinod - thanks for chiming in! Is there a particular reason for only having one status in flight? Or to put it in another way, isn't that too strict behavior taken that the master state could present the most recent known state if the status update manager tried to send more than the front of the stream? Taken very long timeouts, just waiting for those to disappear seems a bit tedious and hogs the cluster. Niklas On 10 September 2014 17:18, Vinod Kone vinodk...@gmail.com wrote: What you observed is expected because of the way the slave (specifically, the status update manager) operates. The status update manager only sends the next update for a task if a previous update (if it exists) has been acked. In your case, since TASK_RUNNING was not acked by the framework, master doesn't know about the TASK_FINISHED update that is queued up by the status update manager. If the framework never comes back, i.e., failover timeout elapses, master shuts down the framework, which releases those resources. On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen nik...@mesosphere.io wrote: Here is the log of a mesos-local instance where I reproduced it: https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck in running state). There
Completed tasks remains in TASK_RUNNING when framework is disconnected
Hi guys, We have run into a problem that cause tasks which completes, when a framework is disconnected and has a fail-over time, to remain in a running state even though the tasks actually finishes. Here is a test framework we have been able to reproduce the issue with: https://gist.github.com/nqn/9b9b1de9123a6e836f54 It launches many short-lived tasks (1 second sleep) and when killing the framework instance, the master reports the tasks as running even after several minutes: http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png When clicking on one of the slaves where, for example, task 49 runs; the slave knows that it completed: http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png The tasks only finish when the framework connects again (which it may never do). This is on Mesos 0.20.0, but also applies to HEAD (as of today). Do you guys have any insights into what may be going on here? Is this by-design or a bug? Thanks, Niklas
Re: Completed tasks remains in TASK_RUNNING when framework is disconnected
Here is the log of a mesos-local instance where I reproduced it: https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck in running state). There is a lot of output, so here is a filtered log for task 10: https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d At first glance, it looks like the task can't be found when trying to forward the finish update because the running update never got acknowledged before the framework disconnected. I may be missing something here. Niklas On 10 September 2014 16:09, Niklas Nielsen nik...@mesosphere.io wrote: Hi guys, We have run into a problem that cause tasks which completes, when a framework is disconnected and has a fail-over time, to remain in a running state even though the tasks actually finishes. Here is a test framework we have been able to reproduce the issue with: https://gist.github.com/nqn/9b9b1de9123a6e836f54 It launches many short-lived tasks (1 second sleep) and when killing the framework instance, the master reports the tasks as running even after several minutes: http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png When clicking on one of the slaves where, for example, task 49 runs; the slave knows that it completed: http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png The tasks only finish when the framework connects again (which it may never do). This is on Mesos 0.20.0, but also applies to HEAD (as of today). Do you guys have any insights into what may be going on here? Is this by-design or a bug? Thanks, Niklas
Re: Completed tasks remains in TASK_RUNNING when framework is disconnected
What you observed is expected because of the way the slave (specifically, the status update manager) operates. The status update manager only sends the next update for a task if a previous update (if it exists) has been acked. In your case, since TASK_RUNNING was not acked by the framework, master doesn't know about the TASK_FINISHED update that is queued up by the status update manager. If the framework never comes back, i.e., failover timeout elapses, master shuts down the framework, which releases those resources. On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen nik...@mesosphere.io wrote: Here is the log of a mesos-local instance where I reproduced it: https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck in running state). There is a lot of output, so here is a filtered log for task 10: https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d At first glance, it looks like the task can't be found when trying to forward the finish update because the running update never got acknowledged before the framework disconnected. I may be missing something here. Niklas On 10 September 2014 16:09, Niklas Nielsen nik...@mesosphere.io wrote: Hi guys, We have run into a problem that cause tasks which completes, when a framework is disconnected and has a fail-over time, to remain in a running state even though the tasks actually finishes. Here is a test framework we have been able to reproduce the issue with: https://gist.github.com/nqn/9b9b1de9123a6e836f54 It launches many short-lived tasks (1 second sleep) and when killing the framework instance, the master reports the tasks as running even after several minutes: http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png When clicking on one of the slaves where, for example, task 49 runs; the slave knows that it completed: http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png The tasks only finish when the framework connects again (which it may never do). This is on Mesos 0.20.0, but also applies to HEAD (as of today). Do you guys have any insights into what may be going on here? Is this by-design or a bug? Thanks, Niklas
Re: Completed tasks remains in TASK_RUNNING when framework is disconnected
The main reason is to keep status update manager simple. Also, it is very easy to enforce the order of updates to the master/framework in this model. If we allow multiple updates for a task to be in flight, it's really hard (impossible?) to ensure that we are not delivering out-of-order updates even in edge cases (failover, network partitions etc). On Wed, Sep 10, 2014 at 5:35 PM, Niklas Nielsen nik...@mesosphere.io wrote: Hey Vinod - thanks for chiming in! Is there a particular reason for only having one status in flight? Or to put it in another way, isn't that too strict behavior taken that the master state could present the most recent known state if the status update manager tried to send more than the front of the stream? Taken very long timeouts, just waiting for those to disappear seems a bit tedious and hogs the cluster. Niklas On 10 September 2014 17:18, Vinod Kone vinodk...@gmail.com wrote: What you observed is expected because of the way the slave (specifically, the status update manager) operates. The status update manager only sends the next update for a task if a previous update (if it exists) has been acked. In your case, since TASK_RUNNING was not acked by the framework, master doesn't know about the TASK_FINISHED update that is queued up by the status update manager. If the framework never comes back, i.e., failover timeout elapses, master shuts down the framework, which releases those resources. On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen nik...@mesosphere.io wrote: Here is the log of a mesos-local instance where I reproduced it: https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck in running state). There is a lot of output, so here is a filtered log for task 10: https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d At first glance, it looks like the task can't be found when trying to forward the finish update because the running update never got acknowledged before the framework disconnected. I may be missing something here. Niklas On 10 September 2014 16:09, Niklas Nielsen nik...@mesosphere.io wrote: Hi guys, We have run into a problem that cause tasks which completes, when a framework is disconnected and has a fail-over time, to remain in a running state even though the tasks actually finishes. Here is a test framework we have been able to reproduce the issue with: https://gist.github.com/nqn/9b9b1de9123a6e836f54 It launches many short-lived tasks (1 second sleep) and when killing the framework instance, the master reports the tasks as running even after several minutes: http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png When clicking on one of the slaves where, for example, task 49 runs; the slave knows that it completed: http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png The tasks only finish when the framework connects again (which it may never do). This is on Mesos 0.20.0, but also applies to HEAD (as of today). Do you guys have any insights into what may be going on here? Is this by-design or a bug? Thanks, Niklas
Re: Completed tasks remains in TASK_RUNNING when framework is disconnected
I agree with Niklas that if the executor has sent a terminal status update to the slave, then the task is done and the master should be able to recover those resources. Only sending the oldest status update to the master, especially in the case of framework failover, prevents these resources from being recovered in a timely manner. I see a couple of options for getting around this, each with their own disadvantages. 1) Send the entire status update stream to the master. Once the master sees the terminal status update, it will removeTask and recover the resources. Future resends of the update will be forwarded to the scheduler, but the master will ignore (with warning and invalid_update++ metrics) the subsequent updates as far as its own state for the removed task is concerned. Disadvantage: Potentially sends a lot of status update messages until the scheduler reregisters and acknowledges the updates. Disadvantage2: Updates could be sent to the scheduler out of order if some updates are dropped between the slave and master. 2) Send only the oldest status update to the master, but with an annotation of the final/terminal state of the task, if any. That way the master can call removeTask to update its internal state for the task (and update the UI) and recover the resources for the task. While the scheduler is still down, the oldest update will continue to be resent and forwarded, but the master will ignore the update (with a warning as above) as far as its own internal state is concerned. When the scheduler reregisters, the update stream will be forwarded and acknowledged one-at-a-time as before, guaranteeing status update ordering to the scheduler. Disadvantage: Seems a bit hacky to tack a terminal state onto a running update. Disadvantage2: State endpoint won't show all the status updates until the entire stream actually gets forwarded+acknowledged. Thoughts? On Wed, Sep 10, 2014 at 5:55 PM, Vinod Kone vinodk...@gmail.com wrote: The main reason is to keep status update manager simple. Also, it is very easy to enforce the order of updates to the master/framework in this model. If we allow multiple updates for a task to be in flight, it's really hard (impossible?) to ensure that we are not delivering out-of-order updates even in edge cases (failover, network partitions etc). On Wed, Sep 10, 2014 at 5:35 PM, Niklas Nielsen nik...@mesosphere.io wrote: Hey Vinod - thanks for chiming in! Is there a particular reason for only having one status in flight? Or to put it in another way, isn't that too strict behavior taken that the master state could present the most recent known state if the status update manager tried to send more than the front of the stream? Taken very long timeouts, just waiting for those to disappear seems a bit tedious and hogs the cluster. Niklas On 10 September 2014 17:18, Vinod Kone vinodk...@gmail.com wrote: What you observed is expected because of the way the slave (specifically, the status update manager) operates. The status update manager only sends the next update for a task if a previous update (if it exists) has been acked. In your case, since TASK_RUNNING was not acked by the framework, master doesn't know about the TASK_FINISHED update that is queued up by the status update manager. If the framework never comes back, i.e., failover timeout elapses, master shuts down the framework, which releases those resources. On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen nik...@mesosphere.io wrote: Here is the log of a mesos-local instance where I reproduced it: https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck in running state). There is a lot of output, so here is a filtered log for task 10: https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d At first glance, it looks like the task can't be found when trying to forward the finish update because the running update never got acknowledged before the framework disconnected. I may be missing something here. Niklas On 10 September 2014 16:09, Niklas Nielsen nik...@mesosphere.io wrote: Hi guys, We have run into a problem that cause tasks which completes, when a framework is disconnected and has a fail-over time, to remain in a running state even though the tasks actually finishes. Here is a test framework we have been able to reproduce the issue with: https://gist.github.com/nqn/9b9b1de9123a6e836f54 It launches many short-lived tasks (1 second sleep) and when killing the framework instance, the master reports the tasks as running even after several minutes: http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png When clicking on one of the slaves where, for example, task 49 runs; the slave knows that it completed: