Re: Agent reregistration timeout, no TASK_LOST messages

Yan Xu Mon, 17 Jul 2017 15:14:01 -0700

On Mon, Jul 17, 2017 at 9:34 AM, Neil Conway <neil.con...@gmail.com> wrote:


> On Mon, Jul 17, 2017 at 9:20 AM, Ilya Pronin <ipro...@twopensource.com>
> wrote:
>
> > AFAIK the absence of TASK_LOST statuses is expected. Master registry
> > persists information only about agents. Tasks are recovered from
> > re-registering agents. Because of that the failed over master can't send
> > TASK_LOST for tasks that were running on the agent that didn't
> re-register,
> > it simply doesn't know about them. The only thing the master can do in
> this
> > situation is send LostSlaveMessage that will tell the scheduler that
> tasks
> > on this agent are LOST/UNREACHABLE.
> >
>
> +1.
>
> The situation where the agent came back after reregistration timeout
> > doesn't sound good. The only way for the framework to learn about tasks
> > that are still running on such agent is either from status updates or via
> > implicit reconciliation. Perhaps, the master could send updates for tasks
> > it learned about when such agent is readmitted?
> >
>
> I agree this would be a good idea:
> https://issues.apache.org/jira/browse/MESOS-6406
>
> I haven't had a chance to implement it yet, but if someone is interested, I
> think this would be a pretty nicely scoped project.
>

The master should probably send updates about non-partition-aware framework
tasks as well. Especially in light of MESOS-7215 for which we are going to
stop killing tasks in all cases.


>
> Neil
>

Re: Agent reregistration timeout, no TASK_LOST messages

Reply via email to