[ 
https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899618#comment-15899618
 ] 

Neil Conway commented on MESOS-7215:
------------------------------------

Updated the summary to reflect what I believe is the root issue here: we _are_ 
shutting down the framework on the agent (which isn't wrong), but the shutdown 
of the framework on that agent interfere with attempts to launch new framework 
tasks on the same agent.

Sending {{KillTaskMessage}} instead makes sense to me. [~xujyan] if you have 
cycles to take a look then I'd be happy to shepherd, otherwise I'll fix it 
myself.

> Race condition on re-registration of non-partition-aware frameworks
> -------------------------------------------------------------------
>
>                 Key: MESOS-7215
>                 URL: https://issues.apache.org/jira/browse/MESOS-7215
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Yan Xu
>            Priority: Critical
>
> Prior to the partition-awareness work MESOS-5344, upon agent reregistration 
> after it has been removed, the master only sends ShutdownFrameworkMessages to 
> the agent for frameworks that it knows have been torn down. 
> With the new logic in MESOS-5344, Mesos is now sending 
> {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware 
> frameworks (including the ones that are still registered)
> This is problematic. The offer from this agent can still go to the same 
> framework which can then launch new tasks. The agent then receives tasks of 
> the same framework and ignores them because it thinks the framework is 
> shutting down. The framework is not shutting down of course, so from the 
> master and the scheduler's perspective the task is pending in STAGING forever 
> until the next agent reregistration, which could happen much later.
> This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the 
> agent is assuming the framework to be going away (and act accordingly) when 
> it's not. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to