Re: RFC: Partition Awareness

2017-10-05 Thread James Peach

> On Jun 21, 2017, at 10:16 AM, Megha Sharma  wrote:
> 
> Thank you all for the feedback.
> To summarize, not killing tasks for non-Partition Aware frameworks will make 
> the schedulers see a higher volume of non terminal updates for tasks for 
> which they have already received a TASK_LOST but nothing new that they are 
> not seeing today. So, this shouldn’t be a breaking change for frameworks and 
> this will make the partition awareness logic simpler. I will update 
> MESOS-7215 with the details once the design is ready.

What happens for short-lived frameworks? That is, the lost task comes back, 
causing the master to track its framework as disconnected, but the framework is 
gone and will never return.

J

Re: RFC: Partition Awareness

2017-06-21 Thread Megha Sharma
Thank you all for the feedback.
To summarize, not killing tasks for non-Partition Aware frameworks will make 
the schedulers see a higher volume of non terminal updates for tasks for which 
they have already received a TASK_LOST but nothing new that they are not seeing 
today. So, this shouldn’t be a breaking change for frameworks and this will 
make the partition awareness logic simpler. I will update MESOS-7215 
 with the details once the 
design is ready.

Thanks
Megha Sharma

On Jun 1, 2017, at 2:56 PM, Vinod Kone  wrote:

On Thu, Jun 1, 2017 at 2:22 PM, Benjamin Mahler  wrote:

If I understood correctly, the proposal is to not kill the tasks for
non-partition aware frameworks? That seems like a pretty big change for
frameworks that are not partition aware and expect the old killing
semantics.


Adding to what Neil said, I think most (if not all) non-PA frameworks
would've already rescheduled the task after seeing a TASK_LOST. The
difference is that previously such tasks can come back to TASK_RUNNING iff
master fails over and non-strict registry (default) is used. Now, we are
saying tasks can come back to TASK_RUNNING irrespective of master fail
over. The assumption/hope is that this shouldn't break existing frameworks
in a catastrophic way.

> On Jun 1, 2017, at 2:30 PM, Neil Conway  wrote:
> 
> Hi Ben,
> 
> The argument for changing the semantics is that correct frameworks
> should _always_ have accounted for the possibility that TASK_LOST
> tasks would go back to running (due to the non-strict registry
> semantics). The proposed change would just increase the probability of
> this behavior occurring. From a certain POV, this change would
> actually make it easier to write correct frameworks because the
> TASK_LOST scenario will be less of a corner case :)
> 
> Implementing the task-killing behavior is a bit tricky, because the
> task might continue to run on the agent for a considerable period of
> time. During that time, we can either:
> 
> (a) omit the being-killed task from the master's memory (current
> behavior). That means that any resources used by the task appear to be
> unused, so there might be a concurrent task launch that attempts to
> use them and fails.
> 
> (b) track the being-killed task in the master's memory. This ensures
> the task's resources are not re-offered until the task is actually
> terminated. The concern here is that this "being-killed" task is in a
> weird state -- what task status should it have? When it finally dies,
> we don't want to report a terminal status update back to frameworks
> (for backward compatibility).
> 
> Neither of those approaches seemed ideal, hence we are wondering
> whether we really need to implement this backward compatibility
> behavior in the first place.
> 
> Neil
> 
> On Thu, Jun 1, 2017 at 2:22 PM, Benjamin Mahler  wrote:
>> If I understood correctly, the proposal is to not kill the tasks for
>> non-partition aware frameworks? That seems like a pretty big change for
>> frameworks that are not partition aware and expect the old killing
>> semantics.
>> 
>> It seems like we should just directly fix the issue, do you have a sense of
>> what the difficulty is there? Is it the re-use of the existing framework
>> shutdown message to kill the tasks that makes this problematic?
>> 
>> On Fri, May 26, 2017 at 3:19 PM, Megha Sharma  wrote:
>>> 
>>> Hi All,
>>> 
>>> We are working on fixing a potential issue MESOS-7215 with partition
>>> awareness which happens when an unreachable agent, with tasks for
>>> non-Partition Aware frameworks, attempts to re-register with the master.
>>> Before the support for partition-aware frameworks, which was introduced in
>>> Mesos 1.1.0 MESOS-5344,  if an agent partitioned from the master attempted
>>> to re-register, then it will be shut down and all the tasks on the agent
>>> would be terminated. With this feature, the partitioned agents were no
>>> longer shut down by the master when they re-registered but to keep the old
>>> behavior the tasks on these agents were still shutdown if the corresponding
>>> framework didn’t opt-in to partition awareness.
>>> 
>>> One of the possible solutions to address the issue mentioned in MESOS-7215
>>> is to change master’s behavior to not kill the tasks for non-Partition aware
>>> frameworks when an unreachable agent re-registers with the master. When an
>>> agent goes unreachable i.e. fails the masters health check ping for
>>> max_agent_ping_timeouts then the master sends TASK_LOST status updates for
>>> all the tasks on this agent which have been launched by non-Partition Aware
>>> frameworks. So, if such tasks are no longer killed by the master then upon
>>> agent re-registration the frameworks will see a non-terminal status updates
>>> for tasks for which they already received a TASK_LOST.
>>> This change will 

Re: RFC: Partition Awareness

2017-06-01 Thread Vinod Kone
On Thu, Jun 1, 2017 at 2:22 PM, Benjamin Mahler  wrote:

> If I understood correctly, the proposal is to not kill the tasks for
> non-partition aware frameworks? That seems like a pretty big change for
> frameworks that are not partition aware and expect the old killing
> semantics.
>

Adding to what Neil said, I think most (if not all) non-PA frameworks
would've already rescheduled the task after seeing a TASK_LOST. The
difference is that previously such tasks can come back to TASK_RUNNING iff
master fails over and non-strict registry (default) is used. Now, we are
saying tasks can come back to TASK_RUNNING irrespective of master fail
over. The assumption/hope is that this shouldn't break existing frameworks
in a catastrophic way.


Re: RFC: Partition Awareness

2017-06-01 Thread Neil Conway
Hi Ben,

The argument for changing the semantics is that correct frameworks
should _always_ have accounted for the possibility that TASK_LOST
tasks would go back to running (due to the non-strict registry
semantics). The proposed change would just increase the probability of
this behavior occurring. From a certain POV, this change would
actually make it easier to write correct frameworks because the
TASK_LOST scenario will be less of a corner case :)

Implementing the task-killing behavior is a bit tricky, because the
task might continue to run on the agent for a considerable period of
time. During that time, we can either:

(a) omit the being-killed task from the master's memory (current
behavior). That means that any resources used by the task appear to be
unused, so there might be a concurrent task launch that attempts to
use them and fails.

(b) track the being-killed task in the master's memory. This ensures
the task's resources are not re-offered until the task is actually
terminated. The concern here is that this "being-killed" task is in a
weird state -- what task status should it have? When it finally dies,
we don't want to report a terminal status update back to frameworks
(for backward compatibility).

Neither of those approaches seemed ideal, hence we are wondering
whether we really need to implement this backward compatibility
behavior in the first place.

Neil

On Thu, Jun 1, 2017 at 2:22 PM, Benjamin Mahler  wrote:
> If I understood correctly, the proposal is to not kill the tasks for
> non-partition aware frameworks? That seems like a pretty big change for
> frameworks that are not partition aware and expect the old killing
> semantics.
>
> It seems like we should just directly fix the issue, do you have a sense of
> what the difficulty is there? Is it the re-use of the existing framework
> shutdown message to kill the tasks that makes this problematic?
>
> On Fri, May 26, 2017 at 3:19 PM, Megha Sharma  wrote:
>>
>> Hi All,
>>
>> We are working on fixing a potential issue MESOS-7215 with partition
>> awareness which happens when an unreachable agent, with tasks for
>> non-Partition Aware frameworks, attempts to re-register with the master.
>> Before the support for partition-aware frameworks, which was introduced in
>> Mesos 1.1.0 MESOS-5344,  if an agent partitioned from the master attempted
>> to re-register, then it will be shut down and all the tasks on the agent
>> would be terminated. With this feature, the partitioned agents were no
>> longer shut down by the master when they re-registered but to keep the old
>> behavior the tasks on these agents were still shutdown if the corresponding
>> framework didn’t opt-in to partition awareness.
>>
>> One of the possible solutions to address the issue mentioned in MESOS-7215
>> is to change master’s behavior to not kill the tasks for non-Partition aware
>> frameworks when an unreachable agent re-registers with the master. When an
>> agent goes unreachable i.e. fails the masters health check ping for
>> max_agent_ping_timeouts then the master sends TASK_LOST status updates for
>> all the tasks on this agent which have been launched by non-Partition Aware
>> frameworks. So, if such tasks are no longer killed by the master then upon
>> agent re-registration the frameworks will see a non-terminal status updates
>> for tasks for which they already received a TASK_LOST.
>> This change will hopefully not break any schedulers since it could have
>> happened in the past with non-strict registry as well and schedulers are
>> expected to be resilient enough to handle this scenario.
>>
>> For the proposed solution we wanted to get feedback from the community to
>> ensure that this change doesn’t break or cause any side effects for the
>> schedulers. Looking forward to any feedbacks/comments.
>>
>> Many Thanks
>> Megha
>>
>>
>


Re: RFC: Partition Awareness

2017-06-01 Thread Benjamin Mahler
If I understood correctly, the proposal is to not kill the tasks for
non-partition aware frameworks? That seems like a pretty big change for
frameworks that are not partition aware and expect the old killing
semantics.

It seems like we should just directly fix the issue, do you have a sense of
what the difficulty is there? Is it the re-use of the existing framework
shutdown message to kill the tasks that makes this problematic?

On Fri, May 26, 2017 at 3:19 PM, Megha Sharma  wrote:

> Hi All,
>
> We are working on fixing a potential issue MESOS-7215
>  with partition
> awareness which happens when an unreachable agent, with tasks for
> non-Partition Aware frameworks, attempts to re-register with the master.
> Before the support for partition-aware frameworks, which was introduced in
> Mesos 1.1.0 MESOS-5344 ,
> if an agent partitioned from the master attempted to re-register, then it
> will be shut down and all the tasks on the agent would be terminated. With
> this feature, the partitioned agents were no longer shut down by the master
> when they re-registered but to keep the old behavior the tasks on these
> agents were still shutdown if the corresponding framework didn’t opt-in to
> partition awareness.
>
> One of the possible solutions to address the issue mentioned in MESOS-7215
>  is to change master’s
> behavior to not kill the tasks for non-Partition aware frameworks when an
> unreachable agent re-registers with the master. When an agent goes
> unreachable i.e. fails the masters health check ping for
> max_agent_ping_timeouts then the master sends TASK_LOST status updates for
> all the tasks on this agent which have been launched by non-Partition Aware
> frameworks. So, if such tasks are no longer killed by the master then upon
> agent re-registration the frameworks will see a non-terminal status updates
> for tasks for which they already received a TASK_LOST.
> This change will hopefully not break any schedulers since it could have
> happened in the past with non-strict registry as well and schedulers are
> expected to be resilient enough to handle this scenario.
>
> For the proposed solution we wanted to get feedback from the community to
> ensure that this change doesn’t break or cause any side effects for the
> schedulers. Looking forward to any feedbacks/comments.
>
> Many Thanks
> Megha
>
>
>


RFC: Partition Awareness

2017-05-26 Thread Megha Sharma
Hi All,

We are working on fixing a potential issue MESOS-7215 
 with partition awareness 
which happens when an unreachable agent, with tasks for non-Partition Aware 
frameworks, attempts to re-register with the master. Before the support for 
partition-aware frameworks, which was introduced in Mesos 1.1.0 MESOS-5344 
,  if an agent partitioned 
from the master attempted to re-register, then it will be shut down and all the 
tasks on the agent would be terminated. With this feature, the partitioned 
agents were no longer shut down by the master when they re-registered but to 
keep the old behavior the tasks on these agents were still shutdown if the 
corresponding framework didn’t opt-in to partition awareness.

One of the possible solutions to address the issue mentioned in MESOS-7215 
 is to change master’s 
behavior to not kill the tasks for non-Partition aware frameworks when an 
unreachable agent re-registers with the master. When an agent goes unreachable 
i.e. fails the masters health check ping for max_agent_ping_timeouts then the 
master sends TASK_LOST status updates for all the tasks on this agent which 
have been launched by non-Partition Aware frameworks. So, if such tasks are no 
longer killed by the master then upon agent re-registration the frameworks will 
see a non-terminal status updates for tasks for which they already received a 
TASK_LOST.
This change will hopefully not break any schedulers since it could have 
happened in the past with non-strict registry as well and schedulers are 
expected to be resilient enough to handle this scenario.

For the proposed solution we wanted to get feedback from the community to 
ensure that this change doesn’t break or cause any side effects for the 
schedulers. Looking forward to any feedbacks/comments.

Many Thanks
Megha