If I understood correctly, the proposal is to not kill the tasks for
non-partition aware frameworks? That seems like a pretty big change for
frameworks that are not partition aware and expect the old killing
semantics.

It seems like we should just directly fix the issue, do you have a sense of
what the difficulty is there? Is it the re-use of the existing framework
shutdown message to kill the tasks that makes this problematic?

On Fri, May 26, 2017 at 3:19 PM, Megha Sharma <mshar...@apple.com> wrote:

> Hi All,
>
> We are working on fixing a potential issue MESOS-7215
> <https://issues.apache.org/jira/browse/MESOS-7215> with partition
> awareness which happens when an unreachable agent, with tasks for
> non-Partition Aware frameworks, attempts to re-register with the master.
> Before the support for partition-aware frameworks, which was introduced in
> Mesos 1.1.0 MESOS-5344 <https://issues.apache.org/jira/browse/MESOS-5344>,
> if an agent partitioned from the master attempted to re-register, then it
> will be shut down and all the tasks on the agent would be terminated. With
> this feature, the partitioned agents were no longer shut down by the master
> when they re-registered but to keep the old behavior the tasks on these
> agents were still shutdown if the corresponding framework didn’t opt-in to
> partition awareness.
>
> One of the possible solutions to address the issue mentioned in MESOS-7215
> <https://issues.apache.org/jira/browse/MESOS-7215> is to change master’s
> behavior to not kill the tasks for non-Partition aware frameworks when an
> unreachable agent re-registers with the master. When an agent goes
> unreachable i.e. fails the masters health check ping for
> max_agent_ping_timeouts then the master sends TASK_LOST status updates for
> all the tasks on this agent which have been launched by non-Partition Aware
> frameworks. So, if such tasks are no longer killed by the master then upon
> agent re-registration the frameworks will see a non-terminal status updates
> for tasks for which they already received a TASK_LOST.
> This change will hopefully not break any schedulers since it could have
> happened in the past with non-strict registry as well and schedulers are
> expected to be resilient enough to handle this scenario.
>
> For the proposed solution we wanted to get feedback from the community to
> ensure that this change doesn’t break or cause any side effects for the
> schedulers. Looking forward to any feedbacks/comments.
>
> Many Thanks
> Megha
>
>
>

Reply via email to