Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-16 Thread Niklas Nielsen
Okay - that only solves half of the problem for us: users will still see
their frameworks as running even though they completed but it is a first
step.

Let's continue the discussion in a JIRA ticket; I'll create one shortly.

Thanks for helping out!
Niklas


On 15 September 2014 18:17, Benjamin Mahler benjamin.mah...@gmail.com
wrote:

 On Mon, Sep 15, 2014 at 3:11 PM, Niklas Nielsen nik...@mesosphere.io
 wrote:

  Thanks for your input Ben! (Comments inlined)
 
  On 15 September 2014 12:35, Benjamin Mahler benjamin.mah...@gmail.com
  wrote:
 
   To ensure that the architecture of mesos remains a scalable one, we
 want
  to
   persist state in the leaves of the system as much as possible. This is
  why
   the master has never persisted tasks, task states, or status updates.
  Note
   that status updates can contain arbitrarily large amounts of data at
 the
   current time (example impact of this:
   https://issues.apache.org/jira/browse/MESOS-1746).
  
   Adam, I think solution (2) you listed above of including a terminal
   indicator in the _private_ message between slave and master would
 easily
   allow us to release the resources at the correct time. We would still
  hold
   the correct task state in the master, and would maintain the status
  update
   stream invariants for frameworks (guaranteed ordered delivery). This
  would
   be simpler to implement with my recent change here, because you no
 longer
   have to remove the task to release the resources:
   https://reviews.apache.org/r/25568/
 
 
  We should still hold the correct task state; meaning the actual state
 of
  the task on the slave?
 

 Correct, as in, just maintaining the existing task state semantics in the
 master (for correctness of reconciliation / status update ordering).


  Then the auxiliary field should represent last known status, which may
  not necessarily be terminal.
  For example, a staging update followed by a running update while the
  framework is disconnected will show as staging still - or am I missing
  something?
 

 It would still show as staging, the running update would never be sent to
 the master since the slave is not receiving acknowledgements. But if there
 were a terminal task, then the slave would be setting the auxiliary field
 in the StatusUpdateMessage and the master will know to release the
 resources.

 + backwards compatibility


 
 
  
  
   Longer term, adding pipelining of status updates would be a nice
   improvement (similar to what you listed in (1) above). But as Vinod
 said,
   it will require care to ensure we maintain the stream invariants for
   frameworks (i.e. probably need to send multiple updates in 1 message).
  
   How does this sound?
  
 
  Sounds great - we would love to help out with (2). Would you be up for
  shepherding such a change?
 

 Yep!

 The changes needed should be based off of
 https://reviews.apache.org/r/25568/ since it changes the resource
 releasing
 in the master.


 
 
  
   On Thu, Sep 11, 2014 at 12:02 PM, Adam Bordelon a...@mesosphere.io
   wrote:
  
Definitely relevant. If the master could be trusted to persist all
 the
task status updates, then they could be queued up at the master
 instead
   of
the slave once the master has acknowledged its receipt. Then the
 master
could have the most up-to-date task state and can recover the
 resources
   as
soon as it receives a terminal update, even if the framework is
disconnected and unable to receive/ack the status updates. Then, once
  the
framework reconnects, the master will be responsible for sending its
   queued
status updates. We will still need a queue on the slave side, but
 only
   for
updates that the master has not persisted and ack'ed, primarily
 during
   the
scenario when the slave is disconnected from the master.
   
On Thu, Sep 11, 2014 at 10:17 AM, Vinod Kone vinodk...@gmail.com
   wrote:
   
The semantics of these changes would have an impact on the upcoming
  task
reconciliation.
   
@BenM: Can you chime in here on how this fits into the task
   reconciliation
work that you've been leading?
   
On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon a...@mesosphere.io
wrote:
   
 I agree with Niklas that if the executor has sent a terminal
 status
update
 to the slave, then the task is done and the master should be able
 to
 recover those resources. Only sending the oldest status update to
  the
 master, especially in the case of framework failover, prevents
 these
 resources from being recovered in a timely manner. I see a couple
 of
 options for getting around this, each with their own
 disadvantages.
 1) Send the entire status update stream to the master. Once the
  master
sees
 the terminal status update, it will removeTask and recover the
resources.
 Future resends of the update will be forwarded to the scheduler,
 but
   the
 master will ignore (with warning and 

Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-15 Thread Benjamin Mahler
To ensure that the architecture of mesos remains a scalable one, we want to
persist state in the leaves of the system as much as possible. This is why
the master has never persisted tasks, task states, or status updates. Note
that status updates can contain arbitrarily large amounts of data at the
current time (example impact of this:
https://issues.apache.org/jira/browse/MESOS-1746).

Adam, I think solution (2) you listed above of including a terminal
indicator in the _private_ message between slave and master would easily
allow us to release the resources at the correct time. We would still hold
the correct task state in the master, and would maintain the status update
stream invariants for frameworks (guaranteed ordered delivery). This would
be simpler to implement with my recent change here, because you no longer
have to remove the task to release the resources:
https://reviews.apache.org/r/25568/

Longer term, adding pipelining of status updates would be a nice
improvement (similar to what you listed in (1) above). But as Vinod said,
it will require care to ensure we maintain the stream invariants for
frameworks (i.e. probably need to send multiple updates in 1 message).

How does this sound?

On Thu, Sep 11, 2014 at 12:02 PM, Adam Bordelon a...@mesosphere.io wrote:

 Definitely relevant. If the master could be trusted to persist all the
 task status updates, then they could be queued up at the master instead of
 the slave once the master has acknowledged its receipt. Then the master
 could have the most up-to-date task state and can recover the resources as
 soon as it receives a terminal update, even if the framework is
 disconnected and unable to receive/ack the status updates. Then, once the
 framework reconnects, the master will be responsible for sending its queued
 status updates. We will still need a queue on the slave side, but only for
 updates that the master has not persisted and ack'ed, primarily during the
 scenario when the slave is disconnected from the master.

 On Thu, Sep 11, 2014 at 10:17 AM, Vinod Kone vinodk...@gmail.com wrote:

 The semantics of these changes would have an impact on the upcoming task
 reconciliation.

 @BenM: Can you chime in here on how this fits into the task reconciliation
 work that you've been leading?

 On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon a...@mesosphere.io
 wrote:

  I agree with Niklas that if the executor has sent a terminal status
 update
  to the slave, then the task is done and the master should be able to
  recover those resources. Only sending the oldest status update to the
  master, especially in the case of framework failover, prevents these
  resources from being recovered in a timely manner. I see a couple of
  options for getting around this, each with their own disadvantages.
  1) Send the entire status update stream to the master. Once the master
 sees
  the terminal status update, it will removeTask and recover the
 resources.
  Future resends of the update will be forwarded to the scheduler, but the
  master will ignore (with warning and invalid_update++ metrics) the
  subsequent updates as far as its own state for the removed task is
  concerned. Disadvantage: Potentially sends a lot of status update
 messages
  until the scheduler reregisters and acknowledges the updates.
  Disadvantage2: Updates could be sent to the scheduler out of order if
 some
  updates are dropped between the slave and master.
  2) Send only the oldest status update to the master, but with an
 annotation
  of the final/terminal state of the task, if any. That way the master can
  call removeTask to update its internal state for the task (and update
 the
  UI) and recover the resources for the task. While the scheduler is still
  down, the oldest update will continue to be resent and forwarded, but
 the
  master will ignore the update (with a warning as above) as far as its
 own
  internal state is concerned. When the scheduler reregisters, the update
  stream will be forwarded and acknowledged one-at-a-time as before,
  guaranteeing status update ordering to the scheduler. Disadvantage:
 Seems a
  bit hacky to tack a terminal state onto a running update. Disadvantage2:
  State endpoint won't show all the status updates until the entire stream
  actually gets forwarded+acknowledged.
  Thoughts?
 
 
  On Wed, Sep 10, 2014 at 5:55 PM, Vinod Kone vinodk...@gmail.com
 wrote:
 
   The main reason is to keep status update manager simple. Also, it is
 very
   easy to enforce the order of updates to the master/framework in this
  model.
   If we allow multiple updates for a task to be in flight, it's really
 hard
   (impossible?) to ensure that we are not delivering out-of-order
 updates
   even in edge cases (failover, network partitions etc).
  
   On Wed, Sep 10, 2014 at 5:35 PM, Niklas Nielsen nik...@mesosphere.io
 
   wrote:
  
Hey Vinod - thanks for chiming in!
   
Is there a particular reason for only having one status in flight?
 Or
  to
put 

Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-15 Thread Niklas Nielsen
Thanks for your input Ben! (Comments inlined)

On 15 September 2014 12:35, Benjamin Mahler benjamin.mah...@gmail.com
wrote:

 To ensure that the architecture of mesos remains a scalable one, we want to
 persist state in the leaves of the system as much as possible. This is why
 the master has never persisted tasks, task states, or status updates. Note
 that status updates can contain arbitrarily large amounts of data at the
 current time (example impact of this:
 https://issues.apache.org/jira/browse/MESOS-1746).

 Adam, I think solution (2) you listed above of including a terminal
 indicator in the _private_ message between slave and master would easily
 allow us to release the resources at the correct time. We would still hold
 the correct task state in the master, and would maintain the status update
 stream invariants for frameworks (guaranteed ordered delivery). This would
 be simpler to implement with my recent change here, because you no longer
 have to remove the task to release the resources:
 https://reviews.apache.org/r/25568/


We should still hold the correct task state; meaning the actual state of
the task on the slave?
Then the auxiliary field should represent last known status, which may
not necessarily be terminal.
For example, a staging update followed by a running update while the
framework is disconnected will show as staging still - or am I missing
something?




 Longer term, adding pipelining of status updates would be a nice
 improvement (similar to what you listed in (1) above). But as Vinod said,
 it will require care to ensure we maintain the stream invariants for
 frameworks (i.e. probably need to send multiple updates in 1 message).

 How does this sound?


Sounds great - we would love to help out with (2). Would you be up for
shepherding such a change?



 On Thu, Sep 11, 2014 at 12:02 PM, Adam Bordelon a...@mesosphere.io
 wrote:

  Definitely relevant. If the master could be trusted to persist all the
  task status updates, then they could be queued up at the master instead
 of
  the slave once the master has acknowledged its receipt. Then the master
  could have the most up-to-date task state and can recover the resources
 as
  soon as it receives a terminal update, even if the framework is
  disconnected and unable to receive/ack the status updates. Then, once the
  framework reconnects, the master will be responsible for sending its
 queued
  status updates. We will still need a queue on the slave side, but only
 for
  updates that the master has not persisted and ack'ed, primarily during
 the
  scenario when the slave is disconnected from the master.
 
  On Thu, Sep 11, 2014 at 10:17 AM, Vinod Kone vinodk...@gmail.com
 wrote:
 
  The semantics of these changes would have an impact on the upcoming task
  reconciliation.
 
  @BenM: Can you chime in here on how this fits into the task
 reconciliation
  work that you've been leading?
 
  On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon a...@mesosphere.io
  wrote:
 
   I agree with Niklas that if the executor has sent a terminal status
  update
   to the slave, then the task is done and the master should be able to
   recover those resources. Only sending the oldest status update to the
   master, especially in the case of framework failover, prevents these
   resources from being recovered in a timely manner. I see a couple of
   options for getting around this, each with their own disadvantages.
   1) Send the entire status update stream to the master. Once the master
  sees
   the terminal status update, it will removeTask and recover the
  resources.
   Future resends of the update will be forwarded to the scheduler, but
 the
   master will ignore (with warning and invalid_update++ metrics) the
   subsequent updates as far as its own state for the removed task is
   concerned. Disadvantage: Potentially sends a lot of status update
  messages
   until the scheduler reregisters and acknowledges the updates.
   Disadvantage2: Updates could be sent to the scheduler out of order if
  some
   updates are dropped between the slave and master.
   2) Send only the oldest status update to the master, but with an
  annotation
   of the final/terminal state of the task, if any. That way the master
 can
   call removeTask to update its internal state for the task (and update
  the
   UI) and recover the resources for the task. While the scheduler is
 still
   down, the oldest update will continue to be resent and forwarded, but
  the
   master will ignore the update (with a warning as above) as far as its
  own
   internal state is concerned. When the scheduler reregisters, the
 update
   stream will be forwarded and acknowledged one-at-a-time as before,
   guaranteeing status update ordering to the scheduler. Disadvantage:
  Seems a
   bit hacky to tack a terminal state onto a running update.
 Disadvantage2:
   State endpoint won't show all the status updates until the entire
 stream
   actually gets forwarded+acknowledged.
   

Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-15 Thread Benjamin Mahler
On Mon, Sep 15, 2014 at 3:11 PM, Niklas Nielsen nik...@mesosphere.io
wrote:

 Thanks for your input Ben! (Comments inlined)

 On 15 September 2014 12:35, Benjamin Mahler benjamin.mah...@gmail.com
 wrote:

  To ensure that the architecture of mesos remains a scalable one, we want
 to
  persist state in the leaves of the system as much as possible. This is
 why
  the master has never persisted tasks, task states, or status updates.
 Note
  that status updates can contain arbitrarily large amounts of data at the
  current time (example impact of this:
  https://issues.apache.org/jira/browse/MESOS-1746).
 
  Adam, I think solution (2) you listed above of including a terminal
  indicator in the _private_ message between slave and master would easily
  allow us to release the resources at the correct time. We would still
 hold
  the correct task state in the master, and would maintain the status
 update
  stream invariants for frameworks (guaranteed ordered delivery). This
 would
  be simpler to implement with my recent change here, because you no longer
  have to remove the task to release the resources:
  https://reviews.apache.org/r/25568/


 We should still hold the correct task state; meaning the actual state of
 the task on the slave?


Correct, as in, just maintaining the existing task state semantics in the
master (for correctness of reconciliation / status update ordering).


 Then the auxiliary field should represent last known status, which may
 not necessarily be terminal.
 For example, a staging update followed by a running update while the
 framework is disconnected will show as staging still - or am I missing
 something?


It would still show as staging, the running update would never be sent to
the master since the slave is not receiving acknowledgements. But if there
were a terminal task, then the slave would be setting the auxiliary field
in the StatusUpdateMessage and the master will know to release the
resources.

+ backwards compatibility




 
 
  Longer term, adding pipelining of status updates would be a nice
  improvement (similar to what you listed in (1) above). But as Vinod said,
  it will require care to ensure we maintain the stream invariants for
  frameworks (i.e. probably need to send multiple updates in 1 message).
 
  How does this sound?
 

 Sounds great - we would love to help out with (2). Would you be up for
 shepherding such a change?


Yep!

The changes needed should be based off of
https://reviews.apache.org/r/25568/ since it changes the resource releasing
in the master.




 
  On Thu, Sep 11, 2014 at 12:02 PM, Adam Bordelon a...@mesosphere.io
  wrote:
 
   Definitely relevant. If the master could be trusted to persist all the
   task status updates, then they could be queued up at the master instead
  of
   the slave once the master has acknowledged its receipt. Then the master
   could have the most up-to-date task state and can recover the resources
  as
   soon as it receives a terminal update, even if the framework is
   disconnected and unable to receive/ack the status updates. Then, once
 the
   framework reconnects, the master will be responsible for sending its
  queued
   status updates. We will still need a queue on the slave side, but only
  for
   updates that the master has not persisted and ack'ed, primarily during
  the
   scenario when the slave is disconnected from the master.
  
   On Thu, Sep 11, 2014 at 10:17 AM, Vinod Kone vinodk...@gmail.com
  wrote:
  
   The semantics of these changes would have an impact on the upcoming
 task
   reconciliation.
  
   @BenM: Can you chime in here on how this fits into the task
  reconciliation
   work that you've been leading?
  
   On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon a...@mesosphere.io
   wrote:
  
I agree with Niklas that if the executor has sent a terminal status
   update
to the slave, then the task is done and the master should be able to
recover those resources. Only sending the oldest status update to
 the
master, especially in the case of framework failover, prevents these
resources from being recovered in a timely manner. I see a couple of
options for getting around this, each with their own disadvantages.
1) Send the entire status update stream to the master. Once the
 master
   sees
the terminal status update, it will removeTask and recover the
   resources.
Future resends of the update will be forwarded to the scheduler, but
  the
master will ignore (with warning and invalid_update++ metrics) the
subsequent updates as far as its own state for the removed task is
concerned. Disadvantage: Potentially sends a lot of status update
   messages
until the scheduler reregisters and acknowledges the updates.
Disadvantage2: Updates could be sent to the scheduler out of order
 if
   some
updates are dropped between the slave and master.
2) Send only the oldest status update to the master, but with an
   annotation
of the 

Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-11 Thread Vinod Kone
The semantics of these changes would have an impact on the upcoming task
reconciliation.

@BenM: Can you chime in here on how this fits into the task reconciliation
work that you've been leading?

On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon a...@mesosphere.io wrote:

 I agree with Niklas that if the executor has sent a terminal status update
 to the slave, then the task is done and the master should be able to
 recover those resources. Only sending the oldest status update to the
 master, especially in the case of framework failover, prevents these
 resources from being recovered in a timely manner. I see a couple of
 options for getting around this, each with their own disadvantages.
 1) Send the entire status update stream to the master. Once the master sees
 the terminal status update, it will removeTask and recover the resources.
 Future resends of the update will be forwarded to the scheduler, but the
 master will ignore (with warning and invalid_update++ metrics) the
 subsequent updates as far as its own state for the removed task is
 concerned. Disadvantage: Potentially sends a lot of status update messages
 until the scheduler reregisters and acknowledges the updates.
 Disadvantage2: Updates could be sent to the scheduler out of order if some
 updates are dropped between the slave and master.
 2) Send only the oldest status update to the master, but with an annotation
 of the final/terminal state of the task, if any. That way the master can
 call removeTask to update its internal state for the task (and update the
 UI) and recover the resources for the task. While the scheduler is still
 down, the oldest update will continue to be resent and forwarded, but the
 master will ignore the update (with a warning as above) as far as its own
 internal state is concerned. When the scheduler reregisters, the update
 stream will be forwarded and acknowledged one-at-a-time as before,
 guaranteeing status update ordering to the scheduler. Disadvantage: Seems a
 bit hacky to tack a terminal state onto a running update. Disadvantage2:
 State endpoint won't show all the status updates until the entire stream
 actually gets forwarded+acknowledged.
 Thoughts?


 On Wed, Sep 10, 2014 at 5:55 PM, Vinod Kone vinodk...@gmail.com wrote:

  The main reason is to keep status update manager simple. Also, it is very
  easy to enforce the order of updates to the master/framework in this
 model.
  If we allow multiple updates for a task to be in flight, it's really hard
  (impossible?) to ensure that we are not delivering out-of-order updates
  even in edge cases (failover, network partitions etc).
 
  On Wed, Sep 10, 2014 at 5:35 PM, Niklas Nielsen nik...@mesosphere.io
  wrote:
 
   Hey Vinod - thanks for chiming in!
  
   Is there a particular reason for only having one status in flight? Or
 to
   put it in another way, isn't that too strict behavior taken that the
  master
   state could present the most recent known state if the status update
   manager tried to send more than the front of the stream?
   Taken very long timeouts, just waiting for those to disappear seems a
 bit
   tedious and hogs the cluster.
  
   Niklas
  
   On 10 September 2014 17:18, Vinod Kone vinodk...@gmail.com wrote:
  
What you observed is expected because of the way the slave
  (specifically,
the status update manager) operates.
   
The status update manager only sends the next update for a task if a
previous update (if it exists) has been acked.
   
In your case, since TASK_RUNNING was not acked by the framework,
 master
doesn't know about the TASK_FINISHED update that is queued up by the
   status
update manager.
   
If the framework never comes back, i.e., failover timeout elapses,
  master
shuts down the framework, which releases those resources.
   
On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen 
 nik...@mesosphere.io
wrote:
   
 Here is the log of a mesos-local instance where I reproduced it:
 https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to
 19
   are
 stuck in running state).
 There is a lot of output, so here is a filtered log for task 10:
 https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d

 At first glance, it looks like the task can't be found when trying
 to
 forward the finish update because the running update never got
acknowledged
 before the framework disconnected. I may be missing something here.

 Niklas


 On 10 September 2014 16:09, Niklas Nielsen nik...@mesosphere.io
   wrote:

  Hi guys,
 
  We have run into a problem that cause tasks which completes,
 when a
  framework is disconnected and has a fail-over time, to remain in
 a
 running
  state even though the tasks actually finishes.
 
  Here is a test framework we have been able to reproduce the issue
   with:
  https://gist.github.com/nqn/9b9b1de9123a6e836f54
  It launches many short-lived tasks (1 

Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-11 Thread Adam Bordelon
Definitely relevant. If the master could be trusted to persist all the task
status updates, then they could be queued up at the master instead of the
slave once the master has acknowledged its receipt. Then the master could
have the most up-to-date task state and can recover the resources as soon
as it receives a terminal update, even if the framework is disconnected and
unable to receive/ack the status updates. Then, once the framework
reconnects, the master will be responsible for sending its queued status
updates. We will still need a queue on the slave side, but only for updates
that the master has not persisted and ack'ed, primarily during the scenario
when the slave is disconnected from the master.

On Thu, Sep 11, 2014 at 10:17 AM, Vinod Kone vinodk...@gmail.com wrote:

 The semantics of these changes would have an impact on the upcoming task
 reconciliation.

 @BenM: Can you chime in here on how this fits into the task reconciliation
 work that you've been leading?

 On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon a...@mesosphere.io wrote:

  I agree with Niklas that if the executor has sent a terminal status
 update
  to the slave, then the task is done and the master should be able to
  recover those resources. Only sending the oldest status update to the
  master, especially in the case of framework failover, prevents these
  resources from being recovered in a timely manner. I see a couple of
  options for getting around this, each with their own disadvantages.
  1) Send the entire status update stream to the master. Once the master
 sees
  the terminal status update, it will removeTask and recover the resources.
  Future resends of the update will be forwarded to the scheduler, but the
  master will ignore (with warning and invalid_update++ metrics) the
  subsequent updates as far as its own state for the removed task is
  concerned. Disadvantage: Potentially sends a lot of status update
 messages
  until the scheduler reregisters and acknowledges the updates.
  Disadvantage2: Updates could be sent to the scheduler out of order if
 some
  updates are dropped between the slave and master.
  2) Send only the oldest status update to the master, but with an
 annotation
  of the final/terminal state of the task, if any. That way the master can
  call removeTask to update its internal state for the task (and update the
  UI) and recover the resources for the task. While the scheduler is still
  down, the oldest update will continue to be resent and forwarded, but the
  master will ignore the update (with a warning as above) as far as its own
  internal state is concerned. When the scheduler reregisters, the update
  stream will be forwarded and acknowledged one-at-a-time as before,
  guaranteeing status update ordering to the scheduler. Disadvantage:
 Seems a
  bit hacky to tack a terminal state onto a running update. Disadvantage2:
  State endpoint won't show all the status updates until the entire stream
  actually gets forwarded+acknowledged.
  Thoughts?
 
 
  On Wed, Sep 10, 2014 at 5:55 PM, Vinod Kone vinodk...@gmail.com wrote:
 
   The main reason is to keep status update manager simple. Also, it is
 very
   easy to enforce the order of updates to the master/framework in this
  model.
   If we allow multiple updates for a task to be in flight, it's really
 hard
   (impossible?) to ensure that we are not delivering out-of-order updates
   even in edge cases (failover, network partitions etc).
  
   On Wed, Sep 10, 2014 at 5:35 PM, Niklas Nielsen nik...@mesosphere.io
   wrote:
  
Hey Vinod - thanks for chiming in!
   
Is there a particular reason for only having one status in flight? Or
  to
put it in another way, isn't that too strict behavior taken that the
   master
state could present the most recent known state if the status update
manager tried to send more than the front of the stream?
Taken very long timeouts, just waiting for those to disappear seems a
  bit
tedious and hogs the cluster.
   
Niklas
   
On 10 September 2014 17:18, Vinod Kone vinodk...@gmail.com wrote:
   
 What you observed is expected because of the way the slave
   (specifically,
 the status update manager) operates.

 The status update manager only sends the next update for a task if
 a
 previous update (if it exists) has been acked.

 In your case, since TASK_RUNNING was not acked by the framework,
  master
 doesn't know about the TASK_FINISHED update that is queued up by
 the
status
 update manager.

 If the framework never comes back, i.e., failover timeout elapses,
   master
 shuts down the framework, which releases those resources.

 On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen 
  nik...@mesosphere.io
 wrote:

  Here is the log of a mesos-local instance where I reproduced it:
  https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10
 to
  19
are
  stuck in running state).
  There 

Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-10 Thread Niklas Nielsen
Hi guys,

We have run into a problem that cause tasks which completes, when a
framework is disconnected and has a fail-over time, to remain in a running
state even though the tasks actually finishes.

Here is a test framework we have been able to reproduce the issue with:
https://gist.github.com/nqn/9b9b1de9123a6e836f54
It launches many short-lived tasks (1 second sleep) and when killing the
framework instance, the master reports the tasks as running even after
several minutes:
http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png

When clicking on one of the slaves where, for example, task 49 runs; the
slave knows that it completed:
http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png

The tasks only finish when the framework connects again (which it may never
do). This is on Mesos 0.20.0, but also applies to HEAD (as of today).
Do you guys have any insights into what may be going on here? Is this
by-design or a bug?

Thanks,
Niklas


Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-10 Thread Niklas Nielsen
Here is the log of a mesos-local instance where I reproduced it:
https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are
stuck in running state).
There is a lot of output, so here is a filtered log for task 10:
https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d

At first glance, it looks like the task can't be found when trying to
forward the finish update because the running update never got acknowledged
before the framework disconnected. I may be missing something here.

Niklas


On 10 September 2014 16:09, Niklas Nielsen nik...@mesosphere.io wrote:

 Hi guys,

 We have run into a problem that cause tasks which completes, when a
 framework is disconnected and has a fail-over time, to remain in a running
 state even though the tasks actually finishes.

 Here is a test framework we have been able to reproduce the issue with:
 https://gist.github.com/nqn/9b9b1de9123a6e836f54
 It launches many short-lived tasks (1 second sleep) and when killing the
 framework instance, the master reports the tasks as running even after
 several minutes:
 http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png

 When clicking on one of the slaves where, for example, task 49 runs; the
 slave knows that it completed:
 http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png

 The tasks only finish when the framework connects again (which it may
 never do). This is on Mesos 0.20.0, but also applies to HEAD (as of today).
 Do you guys have any insights into what may be going on here? Is this
 by-design or a bug?

 Thanks,
 Niklas



Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-10 Thread Vinod Kone
What you observed is expected because of the way the slave (specifically,
the status update manager) operates.

The status update manager only sends the next update for a task if a
previous update (if it exists) has been acked.

In your case, since TASK_RUNNING was not acked by the framework, master
doesn't know about the TASK_FINISHED update that is queued up by the status
update manager.

If the framework never comes back, i.e., failover timeout elapses, master
shuts down the framework, which releases those resources.

On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen nik...@mesosphere.io
wrote:

 Here is the log of a mesos-local instance where I reproduced it:
 https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are
 stuck in running state).
 There is a lot of output, so here is a filtered log for task 10:
 https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d

 At first glance, it looks like the task can't be found when trying to
 forward the finish update because the running update never got acknowledged
 before the framework disconnected. I may be missing something here.

 Niklas


 On 10 September 2014 16:09, Niklas Nielsen nik...@mesosphere.io wrote:

  Hi guys,
 
  We have run into a problem that cause tasks which completes, when a
  framework is disconnected and has a fail-over time, to remain in a
 running
  state even though the tasks actually finishes.
 
  Here is a test framework we have been able to reproduce the issue with:
  https://gist.github.com/nqn/9b9b1de9123a6e836f54
  It launches many short-lived tasks (1 second sleep) and when killing the
  framework instance, the master reports the tasks as running even after
  several minutes:
 
 http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png
 
  When clicking on one of the slaves where, for example, task 49 runs; the
  slave knows that it completed:
 
 http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png
 
  The tasks only finish when the framework connects again (which it may
  never do). This is on Mesos 0.20.0, but also applies to HEAD (as of
 today).
  Do you guys have any insights into what may be going on here? Is this
  by-design or a bug?
 
  Thanks,
  Niklas
 



Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-10 Thread Vinod Kone
The main reason is to keep status update manager simple. Also, it is very
easy to enforce the order of updates to the master/framework in this model.
If we allow multiple updates for a task to be in flight, it's really hard
(impossible?) to ensure that we are not delivering out-of-order updates
even in edge cases (failover, network partitions etc).

On Wed, Sep 10, 2014 at 5:35 PM, Niklas Nielsen nik...@mesosphere.io
wrote:

 Hey Vinod - thanks for chiming in!

 Is there a particular reason for only having one status in flight? Or to
 put it in another way, isn't that too strict behavior taken that the master
 state could present the most recent known state if the status update
 manager tried to send more than the front of the stream?
 Taken very long timeouts, just waiting for those to disappear seems a bit
 tedious and hogs the cluster.

 Niklas

 On 10 September 2014 17:18, Vinod Kone vinodk...@gmail.com wrote:

  What you observed is expected because of the way the slave (specifically,
  the status update manager) operates.
 
  The status update manager only sends the next update for a task if a
  previous update (if it exists) has been acked.
 
  In your case, since TASK_RUNNING was not acked by the framework, master
  doesn't know about the TASK_FINISHED update that is queued up by the
 status
  update manager.
 
  If the framework never comes back, i.e., failover timeout elapses, master
  shuts down the framework, which releases those resources.
 
  On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen nik...@mesosphere.io
  wrote:
 
   Here is the log of a mesos-local instance where I reproduced it:
   https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19
 are
   stuck in running state).
   There is a lot of output, so here is a filtered log for task 10:
   https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d
  
   At first glance, it looks like the task can't be found when trying to
   forward the finish update because the running update never got
  acknowledged
   before the framework disconnected. I may be missing something here.
  
   Niklas
  
  
   On 10 September 2014 16:09, Niklas Nielsen nik...@mesosphere.io
 wrote:
  
Hi guys,
   
We have run into a problem that cause tasks which completes, when a
framework is disconnected and has a fail-over time, to remain in a
   running
state even though the tasks actually finishes.
   
Here is a test framework we have been able to reproduce the issue
 with:
https://gist.github.com/nqn/9b9b1de9123a6e836f54
It launches many short-lived tasks (1 second sleep) and when killing
  the
framework instance, the master reports the tasks as running even
 after
several minutes:
   
  
 
 http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png
   
When clicking on one of the slaves where, for example, task 49 runs;
  the
slave knows that it completed:
   
  
 
 http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png
   
The tasks only finish when the framework connects again (which it may
never do). This is on Mesos 0.20.0, but also applies to HEAD (as of
   today).
Do you guys have any insights into what may be going on here? Is this
by-design or a bug?
   
Thanks,
Niklas
   
  
 



Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-10 Thread Adam Bordelon
I agree with Niklas that if the executor has sent a terminal status update
to the slave, then the task is done and the master should be able to
recover those resources. Only sending the oldest status update to the
master, especially in the case of framework failover, prevents these
resources from being recovered in a timely manner. I see a couple of
options for getting around this, each with their own disadvantages.
1) Send the entire status update stream to the master. Once the master sees
the terminal status update, it will removeTask and recover the resources.
Future resends of the update will be forwarded to the scheduler, but the
master will ignore (with warning and invalid_update++ metrics) the
subsequent updates as far as its own state for the removed task is
concerned. Disadvantage: Potentially sends a lot of status update messages
until the scheduler reregisters and acknowledges the updates.
Disadvantage2: Updates could be sent to the scheduler out of order if some
updates are dropped between the slave and master.
2) Send only the oldest status update to the master, but with an annotation
of the final/terminal state of the task, if any. That way the master can
call removeTask to update its internal state for the task (and update the
UI) and recover the resources for the task. While the scheduler is still
down, the oldest update will continue to be resent and forwarded, but the
master will ignore the update (with a warning as above) as far as its own
internal state is concerned. When the scheduler reregisters, the update
stream will be forwarded and acknowledged one-at-a-time as before,
guaranteeing status update ordering to the scheduler. Disadvantage: Seems a
bit hacky to tack a terminal state onto a running update. Disadvantage2:
State endpoint won't show all the status updates until the entire stream
actually gets forwarded+acknowledged.
Thoughts?


On Wed, Sep 10, 2014 at 5:55 PM, Vinod Kone vinodk...@gmail.com wrote:

 The main reason is to keep status update manager simple. Also, it is very
 easy to enforce the order of updates to the master/framework in this model.
 If we allow multiple updates for a task to be in flight, it's really hard
 (impossible?) to ensure that we are not delivering out-of-order updates
 even in edge cases (failover, network partitions etc).

 On Wed, Sep 10, 2014 at 5:35 PM, Niklas Nielsen nik...@mesosphere.io
 wrote:

  Hey Vinod - thanks for chiming in!
 
  Is there a particular reason for only having one status in flight? Or to
  put it in another way, isn't that too strict behavior taken that the
 master
  state could present the most recent known state if the status update
  manager tried to send more than the front of the stream?
  Taken very long timeouts, just waiting for those to disappear seems a bit
  tedious and hogs the cluster.
 
  Niklas
 
  On 10 September 2014 17:18, Vinod Kone vinodk...@gmail.com wrote:
 
   What you observed is expected because of the way the slave
 (specifically,
   the status update manager) operates.
  
   The status update manager only sends the next update for a task if a
   previous update (if it exists) has been acked.
  
   In your case, since TASK_RUNNING was not acked by the framework, master
   doesn't know about the TASK_FINISHED update that is queued up by the
  status
   update manager.
  
   If the framework never comes back, i.e., failover timeout elapses,
 master
   shuts down the framework, which releases those resources.
  
   On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen nik...@mesosphere.io
   wrote:
  
Here is the log of a mesos-local instance where I reproduced it:
https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19
  are
stuck in running state).
There is a lot of output, so here is a filtered log for task 10:
https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d
   
At first glance, it looks like the task can't be found when trying to
forward the finish update because the running update never got
   acknowledged
before the framework disconnected. I may be missing something here.
   
Niklas
   
   
On 10 September 2014 16:09, Niklas Nielsen nik...@mesosphere.io
  wrote:
   
 Hi guys,

 We have run into a problem that cause tasks which completes, when a
 framework is disconnected and has a fail-over time, to remain in a
running
 state even though the tasks actually finishes.

 Here is a test framework we have been able to reproduce the issue
  with:
 https://gist.github.com/nqn/9b9b1de9123a6e836f54
 It launches many short-lived tasks (1 second sleep) and when
 killing
   the
 framework instance, the master reports the tasks as running even
  after
 several minutes:

   
  
 
 http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png

 When clicking on one of the slaves where, for example, task 49
 runs;
   the
 slave knows that it completed: