[ https://issues.apache.org/jira/browse/MESOS-7181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15939922#comment-15939922 ]
Yan Xu commented on MESOS-7181: ------------------------------- Yeah so I meant that on the receiving end the process manager doesn't know whether an actor is being {{link}} ed or not so it has to send {{TargetPIDExited}} in all situations, this is different than the current local {{PID}} behavior. Also this message is sent not when the actor dies but when a message arrives, so I guess if a frameworks dies when it's suppressed and with no pending status updates, the master will not find out about it because it doesn't send messages? Perhaps we can have a {{Link}} message sent to the linkee based on which it can send a special {{Exited}} message to the sender when the actor terminates? > Stale frameworks seen on Mesos, but not known to scheduler > ---------------------------------------------------------- > > Key: MESOS-7181 > URL: https://issues.apache.org/jira/browse/MESOS-7181 > Project: Mesos > Issue Type: Bug > Components: general > Reporter: Anindya Sinha > Assignee: Anindya Sinha > > Using a scheduler which launches multiple frameworks using scheduler driver, > we observe occasionally that a framework exists on Mesos which is not known > to the scheduler. Since there is no entity that acts on the offers, this > framework ends up hogging all the offers leading to starvation in the cluster. > This particular scenario is as follows: > 1) Scheduler does a driver.start() which results in the 1st SUBSCRIBE sent to > master. > 2) The scheduler driver resends the SUBSCRIBE (since the framework has not > yet registered) which is a result of the exponential backoff. > 3) Framework is registered based on the 1st SUBSCRIBE, but the scheduler > issues a driver.stop() immediately which results in a TEARDOWN sent to the > master. > 4) Master processes the TEARDOWN which removes the framework. > 5) Master now processes the 2nd SUBSCRIBE (after authorization) and tries to > add this framework. This succeeds and a new framework id is generated (since > the original framework is no longer registered after the TEARDOWN) but the > Scheduler driver by now has already terminated once the scheduler issued the > driver.stop(). So, master continues to send offers to this 2nd framework and > hogs on to offers till offer time out. -- This message was sent by Atlassian JIRA (v6.3.15#6346)