This looks like a serious bug unless we are missing something. Hoping for 
clarifications.

Thx

> On Jul 14, 2017, at 3:52 PM, Renan DelValle <[email protected]> wrote:
> 
> Hi all,
> 
> We're using Mesos 1.1.0 and have observed some unexpected behavior with
> regards to Agent reregistration on our cluster.
> 
> When a health check failure happens, our framework (in this case Apache
> Aurora) receives an Agent Lost message along with TASK_LOST messages for
> each of the tasks that was currently running on the agent that failed the
> health check (not responding after *max_agent_ping_timeouts*).
> 
> We expected the same behavior to take place when an Agent does not register
> before the *agent_reregister_timeout* is up. However, while our framework
> did receive an Agent Lost message after 10 minutes had passed (default
> agent_reregister_timeout value) since leader election, it did not receive
> any messages concerning the tasks that were running on that node.
> 
> This can create a scenario where, if the Agent goes away permanently, we
> have tasks that are unaccounted for and won't be restarted on another Agent
> until an explicit reconciliation is done.
> 
> On the other hand, if the Agent does come back after the reregister
> timeout, and the framework has replaced the missing instances, the
> instances that were previously running will continue to run until an
> implicit reconciliation is done.
> 
> I understand some behavior may have changed with partition aware
> frameworks, so I'm trying to understand if this is the expected behavior.
> 
> For what is worth, Aurora is not a partition aware framework.
> 
> Any help would be appreciated,
> 
> Thanks!
> -Renan

Reply via email to