On Mon, Jul 17, 2017 at 9:34 AM, Neil Conway <[email protected]> wrote:
> On Mon, Jul 17, 2017 at 9:20 AM, Ilya Pronin <[email protected]> > wrote: > > > AFAIK the absence of TASK_LOST statuses is expected. Master registry > > persists information only about agents. Tasks are recovered from > > re-registering agents. Because of that the failed over master can't send > > TASK_LOST for tasks that were running on the agent that didn't > re-register, > > it simply doesn't know about them. The only thing the master can do in > this > > situation is send LostSlaveMessage that will tell the scheduler that > tasks > > on this agent are LOST/UNREACHABLE. > > > > +1. > > The situation where the agent came back after reregistration timeout > > doesn't sound good. The only way for the framework to learn about tasks > > that are still running on such agent is either from status updates or via > > implicit reconciliation. Perhaps, the master could send updates for tasks > > it learned about when such agent is readmitted? > > > > I agree this would be a good idea: > https://issues.apache.org/jira/browse/MESOS-6406 > > I haven't had a chance to implement it yet, but if someone is interested, I > think this would be a pretty nicely scoped project. > The master should probably send updates about non-partition-aware framework tasks as well. Especially in light of MESOS-7215 for which we are going to stop killing tasks in all cases. > > Neil >
