Re: Agent reregistration timeout, no TASK_LOST messages

Ilya Pronin Mon, 17 Jul 2017 12:48:57 -0700

The old code doesn't look like it was able to send TASK_LOST updates in
such situation either. The failed over master simply doesn't have enough
information to do it, because it never heard from the agent that can tell
the master about its tasks. Could it be that the doc refers to TASK_LOST
updates produced in response to explicit reconciliation requests?


BTW, the doc seems to be a bit outdated. It mentions shutting down agents
that try to re-register after being removed due to failed health checks,
which is no longer true. Plus there's nothing about partition awareness.

On Mon, Jul 17, 2017 at 7:23 PM, David McLaughlin <dmclaugh...@apache.org>
wrote:

> Not sending TASK_LOST is a breaking change compared to previous behavior.
> From the docs here:
>
> http://mesos.apache.org/documentation/latest/high-
> availability-framework-guide/
>
> When it is time to remove an agent, the master removes the agent from the
> > list of registered agents in the master’s durable state
> > <http://mesos.apache.org/documentation/latest/replicated-log-internals/>
> (this
> > will survive master failover). The master sends a slaveLost callback to
> > every registered scheduler driver; it also sends TASK_LOST status updates
> > for every task that was running on the removed agent.
>
>
> And then from the section on agent reregistration:
>
> If an agent does not reregister with the new master within a timeout
> > (controlled by the --agent_reregister_timeout configuration flag),*the
> > master marks the agent as failed and follows the same steps described
> above*.
> > However, there is one difference: by default, agents are *allowed to
> > reconnect* following master failover, even after the
> > agent_reregister_timeout has fired. This means that frameworks might see
> > a TASK_LOST update for a task but then later discover that the task is
> > running (because the agent where it was running was allowed to
> reconnect).
> >
>
>
> Clearly the idea was that frameworks would see TASK_LOST every time the
> agent is marked as lost.
>
> This behavior appears to have been broken by this commit:
>
> https://github.com/apache/mesos/commit/937c85f2f6528d1ac56ea
> 9a7aa174ca0bd371d0c
>
> Reconciliation is still required because message delivery is best-effort,
> but the fundamental difference is now frameworks *rely* on reconciliation
> for basic operation. We have plans to eventually adopt partition-awareness
> into Aurora, but IMO this change in behavior was an oversight when trying
> to maintain backwards compatibility and can be (harmlessly) fixed in Mesos.
>
> Cheers,
> David
>
> On 2017-07-17 09:20 (-0700), Ilya Pronin <i...@twopensource.com> wrote:
> > Hi,>
> >
> > AFAIK the absence of TASK_LOST statuses is expected. Master registry>
> > persists information only about agents. Tasks are recovered from>
> > re-registering agents. Because of that the failed over master can't send>
> > TASK_LOST for tasks that were running on the agent that didn't
> re-register,>
> > it simply doesn't know about them. The only thing the master can do in
> this>
> > situation is send LostSlaveMessage that will tell the scheduler that
> tasks>
> > on this agent are LOST/UNREACHABLE.>
> >
> > The situation where the agent came back after reregistration timeout>
> > doesn't sound good. The only way for the framework to learn about tasks>
> > that are still running on such agent is either from status updates or
> via>
> > implicit reconciliation. Perhaps, the master could send updates for
> tasks>
> > it learned about when such agent is readmitted?>
> >
> > On Sun, Jul 16, 2017 at 5:54 AM, Meghdoot bhattacharya <>
> > meghdoo...@yahoo.com.invalid> wrote:>
> >
> > > This looks like a serious bug unless we are missing something. Hoping
> for>
> > > clarifications.>
> > >>
> > > Thx>
> > >>
> > > > On Jul 14, 2017, at 3:52 PM, Renan DelValle <rd...@binghamton.edu>>
> > > wrote:>
> > > >>
> > > > Hi all,>
> > > >>
> > > > We're using Mesos 1.1.0 and have observed some unexpected behavior
> with>
> > > > regards to Agent reregistration on our cluster.>
> > > >>
> > > > When a health check failure happens, our framework (in this case
> Apache>
> > > > Aurora) receives an Agent Lost message along with TASK_LOST messages
> for>
> > > > each of the tasks that was currently running on the agent that failed
> the>
> > > > health check (not responding after *max_agent_ping_timeouts*).>
> > > >>
> > > > We expected the same behavior to take place when an Agent does not>
> > > register>
> > > > before the *agent_reregister_timeout* is up. However, while our
> framework>
> > > > did receive an Agent Lost message after 10 minutes had passed
> (default>
> > > > agent_reregister_timeout value) since leader election, it did not
> receive>
> > > > any messages concerning the tasks that were running on that node.>
> > > >>
> > > > This can create a scenario where, if the Agent goes away permanently,
> we>
> > > > have tasks that are unaccounted for and won't be restarted on
> another>
> > > Agent>
> > > > until an explicit reconciliation is done.>
> > > >>
> > > > On the other hand, if the Agent does come back after the reregister>
> > > > timeout, and the framework has replaced the missing instances, the>
> > > > instances that were previously running will continue to run until an>
> > > > implicit reconciliation is done.>
> > > >>
> > > > I understand some behavior may have changed with partition aware>
> > > > frameworks, so I'm trying to understand if this is the expected
> behavior.>
> > > >>
> > > > For what is worth, Aurora is not a partition aware framework.>
> > > >>
> > > > Any help would be appreciated,>
> > > >>
> > > > Thanks!>
> > > > -Renan>
> > >>
> > >>
> > -- >
> > Ilya Pronin>
> >
>

-- 
Ilya Pronin

Re: Agent reregistration timeout, no TASK_LOST messages

Reply via email to