Hi All, We are working on fixing a potential issue MESOS-7215 <https://issues.apache.org/jira/browse/MESOS-7215> with partition awareness which happens when an unreachable agent, with tasks for non-Partition Aware frameworks, attempts to re-register with the master. Before the support for partition-aware frameworks, which was introduced in Mesos 1.1.0 MESOS-5344 <https://issues.apache.org/jira/browse/MESOS-5344>, if an agent partitioned from the master attempted to re-register, then it will be shut down and all the tasks on the agent would be terminated. With this feature, the partitioned agents were no longer shut down by the master when they re-registered but to keep the old behavior the tasks on these agents were still shutdown if the corresponding framework didn’t opt-in to partition awareness.
One of the possible solutions to address the issue mentioned in MESOS-7215 <https://issues.apache.org/jira/browse/MESOS-7215> is to change master’s behavior to not kill the tasks for non-Partition aware frameworks when an unreachable agent re-registers with the master. When an agent goes unreachable i.e. fails the masters health check ping for max_agent_ping_timeouts then the master sends TASK_LOST status updates for all the tasks on this agent which have been launched by non-Partition Aware frameworks. So, if such tasks are no longer killed by the master then upon agent re-registration the frameworks will see a non-terminal status updates for tasks for which they already received a TASK_LOST. This change will hopefully not break any schedulers since it could have happened in the past with non-strict registry as well and schedulers are expected to be resilient enough to handle this scenario. For the proposed solution we wanted to get feedback from the community to ensure that this change doesn’t break or cause any side effects for the schedulers. Looking forward to any feedbacks/comments. Many Thanks Megha