[ https://issues.apache.org/jira/browse/MESOS-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15483578#comment-15483578 ]
Neil Conway commented on MESOS-6136: ------------------------------------ Can you clarify which behavior you're referring to when you say that Mesos will "prevent a framework from rejoining given the inconsistent state of tasks"? In my mind, a framework ID basically identifies a "framework session". A framework session is created when a framework registers for the first time (and doesn't provide an ID). Current behavior: * session continues until _either_ {{/teardown}} is used or the framework is disconnected for longer than {{failover_timeout}} * to resume a session from a new connection, you just specify the framework ID when registering with the master. Proposed change in behavior: * session continues indefinitely until explicit {{/teardown}} * either support for {{failover_timeout}} is deprecated/removed, or we just have an infinite {{failover_timeout}} by default, not sure. * we look at enhancing the usability of {{/teardown}} or making it easier to identify/terminate tasks associated with orphan framework IDs, as needed. > Duplicate framework id handling > ------------------------------- > > Key: MESOS-6136 > URL: https://issues.apache.org/jira/browse/MESOS-6136 > Project: Mesos > Issue Type: Improvement > Components: general > Affects Versions: 0.28.1 > Environment: DCOS 1.7 Cloud Formation scripts > Reporter: Christopher Hunt > Priority: Critical > Labels: framework, lifecyclemanagement, task > > We have observed a situation where Mesos will kill tasks belonging to a > framework where that framework times out with the Mesos master for some > reason, perhaps even because of a network partition. > While we can provide a long timeout so that Mesos will not kill a framework's > tasks for practical purposes, I'm wondering if there's an improvement where a > framework shouldn't be permitted to re-register for a given id (as now), but > Mesos doesn't also kill tasks? What I'm thinking is that Mesos could be > "told" by an operator that this condition should be cleared. > IMHO frameworks should be the only entity requesting that tasks be killed > unless manually overridden by an operator. > I'm flagging this as a critical improvement because a) the focus should be on > keeping tasks running in a system, and it isn't; and b) Mesos is working as > designed. > In summary I feel that Mesos is taking on a responsibility in killing tasks > where it shouldn't be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)