[jira] [Comment Edited] (MESOS-5180) Scheduler driver does not detect disconnection with master and reregister.

Greg Mann (JIRA) Mon, 18 Apr 2016 13:38:37 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246371#comment-15246371
 ]


Greg Mann edited comment on MESOS-5180 at 4/18/16 8:38 PM:
-----------------------------------------------------------

We're currently running into this in a long-running cluster with Mesos and 
Marathon. The master logs show the moment when Marathon disconnects:
{code}
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393314 21960 
master.cpp:1275] Framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) 
at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114 disconnected
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393350 21960 
master.cpp:2658] Disconnecting framework 
29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393373 21960 
master.cpp:2682] Deactivating framework 
29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393434 21960 
master.hpp:1825] Master attempted to send message to disconnected framework 
29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393440 21958 
hierarchical.cpp:375] Deactivated framework 
29b0cddb-f239-47cd-9d43-84624751d5ad-0000
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393635 21960 
master.hpp:1825] Master attempted to send message to disconnected framework 
29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393723 21960 
master.hpp:1825] Master attempted to send message to disconnected framework 
29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393815 21960 
master.hpp:1825] Master attempted to send message to disconnected framework 
29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393875 21960 
master.cpp:1299] Giving framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000 
(marathon) at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114 
1weeks to failover
{code}

But looking in the Marathon logs around the same time doesn't yield an 
indication that the scheduler has disconnected. It continues to receive task 
status updates, but doesn't receive offers, as expected.

It would be great if the master's logging messages could provide more 
information about the disconnection when it occurs, if possible.


was (Author: greggomann):
We're currently running into this in a long-running cluster with Mesos and 
Marathon. The master logs show the moment when Marathon disconnects:
{code}
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393314 21960 
master.cpp:1275] Framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) 
at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114 disconnected
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393350 21960 
master.cpp:2658] Disconnecting framework 
29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393373 21960 
master.cpp:2682] Deactivating framework 
29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393434 21960 
master.hpp:1825] Master attempted to send message to disconnected framework 
29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393440 21958 
hierarchical.cpp:375] Deactivated framework 
29b0cddb-f239-47cd-9d43-84624751d5ad-0000
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393635 21960 
master.hpp:1825] Master attempted to send message to disconnected framework 
29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393723 21960 
master.hpp:1825] Master attempted to send message to disconnected framework 
29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393815 21960 
master.hpp:1825] Master attempted to send message to disconnected framework 
29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393875 21960 
master.cpp:1299] Giving framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000 
(marathon) at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114 
1weeks to failover
{code}

But looking in the Marathon logs around the same time doesn't yield an 
indication that the scheduler has disconnected. It continues to receive task 
status updates, but doesn't seem to be receiving offers.

It would be great if the master's logging messages could provide more 
information about the disconnection when it occurs, if possible.

> Scheduler driver does not detect disconnection with master and reregister.
> --------------------------------------------------------------------------
>
>                 Key: MESOS-5180
>                 URL: https://issues.apache.org/jira/browse/MESOS-5180
>             Project: Mesos
>          Issue Type: Bug
>          Components: scheduler driver
>    Affects Versions: 0.24.0
>            Reporter: Joseph Wu
>            Assignee: Anand Mazumdar
>              Labels: mesosphere
>
> The existing implementation of the scheduler driver does not re-register with 
> the master under some network partition cases.
> When a scheduler registers with the master:
> 1) master links to the framework
> 2) framework links to the master
> It is possible for either of these links to break *without* the master 
> changing.  (Currently, the scheduler driver will only re-register if the 
> master changes).
> If both links break or if just link (1) breaks, the master views the 
> framework as {{inactive}} and {{disconnected}}.  This means the framework 
> will not receive any more events (such as offers) from the master until it 
> re-registers.  There is currently no way for the scheduler to detect a 
> one-way link breakage.
> if link (2) breaks, it makes (almost) no difference to the scheduler.  The 
> scheduler usually uses the link to send messages to the master, but 
> libprocess will create another socket if the persistent one is not available.
> To fix link breakages for (1+2) and (2), the scheduler driver should 
> implement a `::exited` event handler for the master's {{pid}} and trigger a 
> master (re-)detection upon a disconnection. This in turn should make the 
> driver (re)-register with the master. The scheduler library already does 
> this: 
> https://github.com/apache/mesos/blob/master/src/scheduler/scheduler.cpp#L395
> See the related issue MESOS-5181 for link (1) breakage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-5180) Scheduler driver does not detect disconnection with master and reregister.

Reply via email to