Hi All,

We are working on fixing a potential issue MESOS-7215 
<https://issues.apache.org/jira/browse/MESOS-7215> with partition awareness 
which happens when an unreachable agent, with tasks for non-Partition Aware 
frameworks, attempts to re-register with the master. Before the support for 
partition-aware frameworks, which was introduced in Mesos 1.1.0 MESOS-5344 
<https://issues.apache.org/jira/browse/MESOS-5344>,  if an agent partitioned 
from the master attempted to re-register, then it will be shut down and all the 
tasks on the agent would be terminated. With this feature, the partitioned 
agents were no longer shut down by the master when they re-registered but to 
keep the old behavior the tasks on these agents were still shutdown if the 
corresponding framework didn’t opt-in to partition awareness.

One of the possible solutions to address the issue mentioned in MESOS-7215 
<https://issues.apache.org/jira/browse/MESOS-7215> is to change master’s 
behavior to not kill the tasks for non-Partition aware frameworks when an 
unreachable agent re-registers with the master. When an agent goes unreachable 
i.e. fails the masters health check ping for max_agent_ping_timeouts then the 
master sends TASK_LOST status updates for all the tasks on this agent which 
have been launched by non-Partition Aware frameworks. So, if such tasks are no 
longer killed by the master then upon agent re-registration the frameworks will 
see a non-terminal status updates for tasks for which they already received a 
TASK_LOST.
This change will hopefully not break any schedulers since it could have 
happened in the past with non-strict registry as well and schedulers are 
expected to be resilient enough to handle this scenario.

For the proposed solution we wanted to get feedback from the community to 
ensure that this change doesn’t break or cause any side effects for the 
schedulers. Looking forward to any feedbacks/comments.

Many Thanks
Megha


Reply via email to