If I understood correctly, the proposal is to not kill the tasks for non-partition aware frameworks? That seems like a pretty big change for frameworks that are not partition aware and expect the old killing semantics.
It seems like we should just directly fix the issue, do you have a sense of what the difficulty is there? Is it the re-use of the existing framework shutdown message to kill the tasks that makes this problematic? On Fri, May 26, 2017 at 3:19 PM, Megha Sharma <mshar...@apple.com> wrote: > Hi All, > > We are working on fixing a potential issue MESOS-7215 > <https://issues.apache.org/jira/browse/MESOS-7215> with partition > awareness which happens when an unreachable agent, with tasks for > non-Partition Aware frameworks, attempts to re-register with the master. > Before the support for partition-aware frameworks, which was introduced in > Mesos 1.1.0 MESOS-5344 <https://issues.apache.org/jira/browse/MESOS-5344>, > if an agent partitioned from the master attempted to re-register, then it > will be shut down and all the tasks on the agent would be terminated. With > this feature, the partitioned agents were no longer shut down by the master > when they re-registered but to keep the old behavior the tasks on these > agents were still shutdown if the corresponding framework didn’t opt-in to > partition awareness. > > One of the possible solutions to address the issue mentioned in MESOS-7215 > <https://issues.apache.org/jira/browse/MESOS-7215> is to change master’s > behavior to not kill the tasks for non-Partition aware frameworks when an > unreachable agent re-registers with the master. When an agent goes > unreachable i.e. fails the masters health check ping for > max_agent_ping_timeouts then the master sends TASK_LOST status updates for > all the tasks on this agent which have been launched by non-Partition Aware > frameworks. So, if such tasks are no longer killed by the master then upon > agent re-registration the frameworks will see a non-terminal status updates > for tasks for which they already received a TASK_LOST. > This change will hopefully not break any schedulers since it could have > happened in the past with non-strict registry as well and schedulers are > expected to be resilient enough to handle this scenario. > > For the proposed solution we wanted to get feedback from the community to > ensure that this change doesn’t break or cause any side effects for the > schedulers. Looking forward to any feedbacks/comments. > > Many Thanks > Megha > > >