This looks like a serious bug unless we are missing something. Hoping for clarifications.
Thx > On Jul 14, 2017, at 3:52 PM, Renan DelValle <[email protected]> wrote: > > Hi all, > > We're using Mesos 1.1.0 and have observed some unexpected behavior with > regards to Agent reregistration on our cluster. > > When a health check failure happens, our framework (in this case Apache > Aurora) receives an Agent Lost message along with TASK_LOST messages for > each of the tasks that was currently running on the agent that failed the > health check (not responding after *max_agent_ping_timeouts*). > > We expected the same behavior to take place when an Agent does not register > before the *agent_reregister_timeout* is up. However, while our framework > did receive an Agent Lost message after 10 minutes had passed (default > agent_reregister_timeout value) since leader election, it did not receive > any messages concerning the tasks that were running on that node. > > This can create a scenario where, if the Agent goes away permanently, we > have tasks that are unaccounted for and won't be restarted on another Agent > until an explicit reconciliation is done. > > On the other hand, if the Agent does come back after the reregister > timeout, and the framework has replaced the missing instances, the > instances that were previously running will continue to run until an > implicit reconciliation is done. > > I understand some behavior may have changed with partition aware > frameworks, so I'm trying to understand if this is the expected behavior. > > For what is worth, Aurora is not a partition aware framework. > > Any help would be appreciated, > > Thanks! > -Renan
