[jira] [Commented] (MESOS-7181) Stale frameworks seen on Mesos, but not known to scheduler

Joseph Wu (JIRA) Mon, 27 Feb 2017 11:40:23 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-7181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15886380#comment-15886380
 ]


Joseph Wu commented on MESOS-7181:
----------------------------------

You'll need to fix this in your scheduler, by terminating the scheduler after 
terminating the scheduler driver.

This case can only happen if the scheduler *process* outlives the scheduler 
*driver*.  From the master's perspective, a scheduler driver has approximately 
the same lifetime as the scheduler.  If you start 2+ scheduler drivers from a 
single scheduler process, the master will not notice the termination of any 
scheduler driver until the entire scheduler process is terminated.

If this does not work for you, then an alternative would be:
* Enable {{GLOG_v=2}} on your scheduler.
* Have an external program parse the logs for: {{Dropping event for process 
<PID of dead scheduler driver>}}.
* When you see that line, use the operation API to teardown the corresponding 
framework.

> Stale frameworks seen on Mesos, but not known to scheduler
> ----------------------------------------------------------
>
>                 Key: MESOS-7181
>                 URL: https://issues.apache.org/jira/browse/MESOS-7181
>             Project: Mesos
>          Issue Type: Bug
>          Components: general
>            Reporter: Anindya Sinha
>            Assignee: Anindya Sinha
>
> Using a scheduler which launches multiple frameworks using scheduler driver, 
> we observe occasionally that a framework exists on Mesos which is not known 
> to the scheduler. Since there is no entity that acts on the offers, this 
> framework ends up hogging all the offers leading to starvation in the cluster.
> This particular scenario is as follows:
> 1) Scheduler does a driver.start() which results in the 1st SUBSCRIBE sent to 
> master.
> 2) The scheduler driver resends the SUBSCRIBE (since the framework has not 
> yet registered) which is a result of the exponential backoff.
> 3) Framework is registered based on the 1st SUBSCRIBE, but the scheduler 
> issues a driver.stop() immediately which results in a TEARDOWN sent to the 
> master.
> 4) Master processes the TEARDOWN which removes the framework.
> 5) Master now processes the 2nd SUBSCRIBE (after authorization) and tries to 
> add this framework. This succeeds and a new framework id is generated (since 
> the original framework is no longer registered after the TEARDOWN) but the 
> Scheduler driver by now has already terminated once the scheduler issued the 
> driver.stop(). So, master continues to send offers to this 2nd framework and 
> hogs on to offers till offer time out.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7181) Stale frameworks seen on Mesos, but not known to scheduler

Reply via email to