[ https://issues.apache.org/jira/browse/MESOS-7181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15933368#comment-15933368 ]
Joseph Wu commented on MESOS-7181: ---------------------------------- I think this case deserves a fix on the libprocess level, rather than on the master. We basically have a case where there are two libprocess OS processes, holding 1+ actors. When we send a message to a remote UPID and the actor (in {{<actor-id>@IP:port}}) has exited, the recipient libprocess simply drops the message. Instead, we could try notifying the sender of this message drop and therefore cause an {{ExitedEvent}} in the sender. > Stale frameworks seen on Mesos, but not known to scheduler > ---------------------------------------------------------- > > Key: MESOS-7181 > URL: https://issues.apache.org/jira/browse/MESOS-7181 > Project: Mesos > Issue Type: Bug > Components: general > Reporter: Anindya Sinha > Assignee: Anindya Sinha > > Using a scheduler which launches multiple frameworks using scheduler driver, > we observe occasionally that a framework exists on Mesos which is not known > to the scheduler. Since there is no entity that acts on the offers, this > framework ends up hogging all the offers leading to starvation in the cluster. > This particular scenario is as follows: > 1) Scheduler does a driver.start() which results in the 1st SUBSCRIBE sent to > master. > 2) The scheduler driver resends the SUBSCRIBE (since the framework has not > yet registered) which is a result of the exponential backoff. > 3) Framework is registered based on the 1st SUBSCRIBE, but the scheduler > issues a driver.stop() immediately which results in a TEARDOWN sent to the > master. > 4) Master processes the TEARDOWN which removes the framework. > 5) Master now processes the 2nd SUBSCRIBE (after authorization) and tries to > add this framework. This succeeds and a new framework id is generated (since > the original framework is no longer registered after the TEARDOWN) but the > Scheduler driver by now has already terminated once the scheduler issued the > driver.stop(). So, master continues to send offers to this 2nd framework and > hogs on to offers till offer time out. -- This message was sent by Atlassian JIRA (v6.3.15#6346)