[jira] [Commented] (MESOS-5067) Killing a framework does not kill framework tasks

2016-04-04 Thread Guillermo Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225691#comment-15225691
 ] 

Guillermo Rodriguez commented on MESOS-5067:


Ok so I modified swarm to specify a framework ID. So mesos does reassign the 
tasks to the restarted framework and everybody happy!

Thanks!


> Killing a framework does not kill framework tasks
> -
>
> Key: MESOS-5067
> URL: https://issues.apache.org/jira/browse/MESOS-5067
> Project: Mesos
>  Issue Type: Wish
>Reporter: Guillermo Rodriguez
>
> By default, when a framework is terminated, mesos-master terminates all child 
> tasks for that framework.
> There are some cases when I might like to stop a framework but not kill the 
> tasks of the framework. 
> In my particular case. I have Docker Swarm running, Swarm allows me to send 
> number crunching jobs to the cluster and they can run for hours.
> The problem is that Swarm is also quite flaky and can crash anytime. If that 
> happens then all jobs are terminated and all the processing time is lost.
> So, I would like to be able to set some flag for a framework where I tell 
> mesos master that the jobs started by the framework should be considered 
> separate of the framework itself so that the framework can be restarted and 
> jobs will keep running. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5067) Killing a framework does not kill framework tasks

2016-03-30 Thread Guillermo Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219323#comment-15219323
 ] 

Guillermo Rodriguez commented on MESOS-5067:


Ok so I implemented the failover. Now the problem I have is that the launcher 
is marathon and marathon adds a random string at the end of the name. IN other 
words I will never get the same ID again.

Still I have set the failover to 3 days, so tasks will continue running. 
Partial fix.

> Killing a framework does not kill framework tasks
> -
>
> Key: MESOS-5067
> URL: https://issues.apache.org/jira/browse/MESOS-5067
> Project: Mesos
>  Issue Type: Wish
>Reporter: Guillermo Rodriguez
>
> By default, when a framework is terminated, mesos-master terminates all child 
> tasks for that framework.
> There are some cases when I might like to stop a framework but not kill the 
> tasks of the framework. 
> In my particular case. I have Docker Swarm running, Swarm allows me to send 
> number crunching jobs to the cluster and they can run for hours.
> The problem is that Swarm is also quite flaky and can crash anytime. If that 
> happens then all jobs are terminated and all the processing time is lost.
> So, I would like to be able to set some flag for a framework where I tell 
> mesos master that the jobs started by the framework should be considered 
> separate of the framework itself so that the framework can be restarted and 
> jobs will keep running. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5067) Killing a framework does not kill framework tasks

2016-03-30 Thread Guillermo Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219200#comment-15219200
 ] 

Guillermo Rodriguez commented on MESOS-5067:


Interesting, maybe it gives me enough time for another swarm to recover the 
jobs. Let me try.

> Killing a framework does not kill framework tasks
> -
>
> Key: MESOS-5067
> URL: https://issues.apache.org/jira/browse/MESOS-5067
> Project: Mesos
>  Issue Type: Wish
>Reporter: Guillermo Rodriguez
>
> By default, when a framework is terminated, mesos-master terminates all child 
> tasks for that framework.
> There are some cases when I might like to stop a framework but not kill the 
> tasks of the framework. 
> In my particular case. I have Docker Swarm running, Swarm allows me to send 
> number crunching jobs to the cluster and they can run for hours.
> The problem is that Swarm is also quite flaky and can crash anytime. If that 
> happens then all jobs are terminated and all the processing time is lost.
> So, I would like to be able to set some flag for a framework where I tell 
> mesos master that the jobs started by the framework should be considered 
> separate of the framework itself so that the framework can be restarted and 
> jobs will keep running. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5067) Killing a framework does not kill framework tasks

2016-03-30 Thread Guillermo Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219197#comment-15219197
 ] 

Guillermo Rodriguez commented on MESOS-5067:


I can ask swarm, but other node.

The problem I have right now is that I have 4 swarm services for HA. So docker 
tasks can be started at any swarm instance. MesosDNS resolves the 
swarm.marathon.mesos for any swarm service. All ok.

Any swarm service can see any running task in any node so they can answer with 
status, info, etc on any node. Perfect.

My system gets the docker task UID on startup and can ask any of the swarm 
services about the task status. Awesome.

But then one service crashes and any task that was registered or started on 
that particular framework is killed. So essentially if swarm crashes I lose 25% 
of my running tasks. Given that swarm crashes anytime (or marathon restarts it 
because its unresponsive for too long) then I lost any progress in my number 
crunching.

So I was wondering if I could launch tasks from a framework without specifying 
the frameword id for example. Or something like that.

> Killing a framework does not kill framework tasks
> -
>
> Key: MESOS-5067
> URL: https://issues.apache.org/jira/browse/MESOS-5067
> Project: Mesos
>  Issue Type: Wish
>Reporter: Guillermo Rodriguez
>
> By default, when a framework is terminated, mesos-master terminates all child 
> tasks for that framework.
> There are some cases when I might like to stop a framework but not kill the 
> tasks of the framework. 
> In my particular case. I have Docker Swarm running, Swarm allows me to send 
> number crunching jobs to the cluster and they can run for hours.
> The problem is that Swarm is also quite flaky and can crash anytime. If that 
> happens then all jobs are terminated and all the processing time is lost.
> So, I would like to be able to set some flag for a framework where I tell 
> mesos master that the jobs started by the framework should be considered 
> separate of the framework itself so that the framework can be restarted and 
> jobs will keep running. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5067) Killing a framework does not kill framework tasks

2016-03-30 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219182#comment-15219182
 ] 

Anand Mazumdar commented on MESOS-5067:
---

Are you running docker swarm as a mesos framework? If so, why don't you set a 
higher failover timeout value for it i.e. like a week? 
https://github.com/apache/mesos/blob/master/include/mesos/v1/mesos.proto#L223

If so, even if your swarm framework instance crashes it can reconnect with 
Mesos using the same {{FrameworkId}}. All of it's tasks would still be intact. 
If you want to kill all the tasks you can explicitly teardown the framework.

> Killing a framework does not kill framework tasks
> -
>
> Key: MESOS-5067
> URL: https://issues.apache.org/jira/browse/MESOS-5067
> Project: Mesos
>  Issue Type: Wish
>Reporter: Guillermo Rodriguez
>
> By default, when a framework is terminated, mesos-master terminates all child 
> tasks for that framework.
> There are some cases when I might like to stop a framework but not kill the 
> tasks of the framework. 
> In my particular case. I have Docker Swarm running, Swarm allows me to send 
> number crunching jobs to the cluster and they can run for hours.
> The problem is that Swarm is also quite flaky and can crash anytime. If that 
> happens then all jobs are terminated and all the processing time is lost.
> So, I would like to be able to set some flag for a framework where I tell 
> mesos master that the jobs started by the framework should be considered 
> separate of the framework itself so that the framework can be restarted and 
> jobs will keep running. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5067) Killing a framework does not kill framework tasks

2016-03-30 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219160#comment-15219160
 ] 

Klaus Ma commented on MESOS-5067:
-

If so, how do you get tasks back?

> Killing a framework does not kill framework tasks
> -
>
> Key: MESOS-5067
> URL: https://issues.apache.org/jira/browse/MESOS-5067
> Project: Mesos
>  Issue Type: Wish
>Reporter: Guillermo Rodriguez
>
> By default, when a framework is terminated, mesos-master terminates all child 
> tasks for that framework.
> There are some cases when I might like to stop a framework but not kill the 
> tasks of the framework. 
> In my particular case. I have Docker Swarm running, Swarm allows me to send 
> number crunching jobs to the cluster and they can run for hours.
> The problem is that Swarm is also quite flaky and can crash anytime. If that 
> happens then all jobs are terminated and all the processing time is lost.
> So, I would like to be able to set some flag for a framework where I tell 
> mesos master that the jobs started by the framework should be considered 
> separate of the framework itself so that the framework can be restarted and 
> jobs will keep running. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)