Hi Ben,

The argument for changing the semantics is that correct frameworks
should _always_ have accounted for the possibility that TASK_LOST
tasks would go back to running (due to the non-strict registry
semantics). The proposed change would just increase the probability of
this behavior occurring. From a certain POV, this change would
actually make it easier to write correct frameworks because the
TASK_LOST scenario will be less of a corner case :)

Implementing the task-killing behavior is a bit tricky, because the
task might continue to run on the agent for a considerable period of
time. During that time, we can either:

(a) omit the being-killed task from the master's memory (current
behavior). That means that any resources used by the task appear to be
unused, so there might be a concurrent task launch that attempts to
use them and fails.

(b) track the being-killed task in the master's memory. This ensures
the task's resources are not re-offered until the task is actually
terminated. The concern here is that this "being-killed" task is in a
weird state -- what task status should it have? When it finally dies,
we don't want to report a terminal status update back to frameworks
(for backward compatibility).

Neither of those approaches seemed ideal, hence we are wondering
whether we really need to implement this backward compatibility
behavior in the first place.

Neil

On Thu, Jun 1, 2017 at 2:22 PM, Benjamin Mahler <bmah...@apache.org> wrote:
> If I understood correctly, the proposal is to not kill the tasks for
> non-partition aware frameworks? That seems like a pretty big change for
> frameworks that are not partition aware and expect the old killing
> semantics.
>
> It seems like we should just directly fix the issue, do you have a sense of
> what the difficulty is there? Is it the re-use of the existing framework
> shutdown message to kill the tasks that makes this problematic?
>
> On Fri, May 26, 2017 at 3:19 PM, Megha Sharma <mshar...@apple.com> wrote:
>>
>> Hi All,
>>
>> We are working on fixing a potential issue MESOS-7215 with partition
>> awareness which happens when an unreachable agent, with tasks for
>> non-Partition Aware frameworks, attempts to re-register with the master.
>> Before the support for partition-aware frameworks, which was introduced in
>> Mesos 1.1.0 MESOS-5344,  if an agent partitioned from the master attempted
>> to re-register, then it will be shut down and all the tasks on the agent
>> would be terminated. With this feature, the partitioned agents were no
>> longer shut down by the master when they re-registered but to keep the old
>> behavior the tasks on these agents were still shutdown if the corresponding
>> framework didn’t opt-in to partition awareness.
>>
>> One of the possible solutions to address the issue mentioned in MESOS-7215
>> is to change master’s behavior to not kill the tasks for non-Partition aware
>> frameworks when an unreachable agent re-registers with the master. When an
>> agent goes unreachable i.e. fails the masters health check ping for
>> max_agent_ping_timeouts then the master sends TASK_LOST status updates for
>> all the tasks on this agent which have been launched by non-Partition Aware
>> frameworks. So, if such tasks are no longer killed by the master then upon
>> agent re-registration the frameworks will see a non-terminal status updates
>> for tasks for which they already received a TASK_LOST.
>> This change will hopefully not break any schedulers since it could have
>> happened in the past with non-strict registry as well and schedulers are
>> expected to be resilient enough to handle this scenario.
>>
>> For the proposed solution we wanted to get feedback from the community to
>> ensure that this change doesn’t break or cause any side effects for the
>> schedulers. Looking forward to any feedbacks/comments.
>>
>> Many Thanks
>> Megha
>>
>>
>

Reply via email to