[jira] [Commented] (MESOS-7181) Stale frameworks seen on Mesos, but not known to scheduler

Joseph Wu (JIRA) Mon, 20 Mar 2017 12:36:00 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-7181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15933368#comment-15933368
 ]


Joseph Wu commented on MESOS-7181:
----------------------------------

I think this case deserves a fix on the libprocess level, rather than on the 
master.

We basically have a case where there are two libprocess OS processes, holding 
1+ actors.  When we send a message to a remote UPID and the actor (in 
{{<actor-id>@IP:port}}) has exited, the recipient libprocess simply drops the 
message.  Instead, we could try notifying the sender of this message drop and 
therefore cause an {{ExitedEvent}} in the sender.

> Stale frameworks seen on Mesos, but not known to scheduler
> ----------------------------------------------------------
>
>                 Key: MESOS-7181
>                 URL: https://issues.apache.org/jira/browse/MESOS-7181
>             Project: Mesos
>          Issue Type: Bug
>          Components: general
>            Reporter: Anindya Sinha
>            Assignee: Anindya Sinha
>
> Using a scheduler which launches multiple frameworks using scheduler driver, 
> we observe occasionally that a framework exists on Mesos which is not known 
> to the scheduler. Since there is no entity that acts on the offers, this 
> framework ends up hogging all the offers leading to starvation in the cluster.
> This particular scenario is as follows:
> 1) Scheduler does a driver.start() which results in the 1st SUBSCRIBE sent to 
> master.
> 2) The scheduler driver resends the SUBSCRIBE (since the framework has not 
> yet registered) which is a result of the exponential backoff.
> 3) Framework is registered based on the 1st SUBSCRIBE, but the scheduler 
> issues a driver.stop() immediately which results in a TEARDOWN sent to the 
> master.
> 4) Master processes the TEARDOWN which removes the framework.
> 5) Master now processes the 2nd SUBSCRIBE (after authorization) and tries to 
> add this framework. This succeeds and a new framework id is generated (since 
> the original framework is no longer registered after the TEARDOWN) but the 
> Scheduler driver by now has already terminated once the scheduler issued the 
> driver.stop(). So, master continues to send offers to this 2nd framework and 
> hogs on to offers till offer time out.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MESOS-7181) Stale frameworks seen on Mesos, but not known to scheduler

Reply via email to