[ https://issues.apache.org/jira/browse/MESOS-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303581#comment-15303581 ]
Jay Guo commented on MESOS-5468: -------------------------------- [~anandmazumdar] The socket is NOT successfully closed and still left in ESTABLISHED (can be observed from {{netstat}}). And I suspect it somehow happens before master explicitly issues close. Here's the log: {code:title=master.log} E0527 05:48:45.564194 13105 process.cpp:2033] Failed to shutdown socket with fd 33: Transport endpoint is not connected I0527 05:48:45.573005 13101 master.cpp:1383] Framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++)) disconnected I0527 05:48:45.573212 13101 master.cpp:2792] Disconnecting framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++)) I0527 05:48:45.573431 13101 master.cpp:2816] Deactivating framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++)) W0527 05:48:45.574806 13101 master.hpp:1846] Master attempted to send message to disconnected framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++)) I0527 05:48:45.575145 13100 hierarchical.cpp:375] Deactivated framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 W0527 05:48:45.580201 13101 master.hpp:1852] Unable to send event to framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++)): connection closed W0527 05:48:45.581838 13101 master.hpp:1846] Master attempted to send message to disconnected framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++)) W0527 05:48:45.582034 13101 master.hpp:1852] Unable to send event to framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++)): connection closed W0527 05:48:45.583015 13101 master.hpp:1846] Master attempted to send message to disconnected framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++)) W0527 05:48:45.583124 13101 master.hpp:1852] Unable to send event to framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++)): connection closed I0527 05:48:45.583395 13101 master.cpp:1396] Giving framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++)) 0ns to failover I0527 05:48:45.585503 13102 master.cpp:5516] Framework failover timeout, removing framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++)) I0527 05:48:45.585793 13102 master.cpp:6246] Removing framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++)) I0527 05:48:45.588471 13102 master.cpp:6761] Updating the state of task 2 of framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 (latest state: TASK_FINISHED, status update state: TASK_KILLED) I0527 05:48:45.589534 13102 master.cpp:6827] Removing task 2 with resources cpus(*):0.001; mem(*):1 of framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 on agent af46d7b0-4e75-443d-9e11-e89d5605f012-S2 at slave(1)@10.11.13.10:5051 (agent-3.novalocal) I0527 05:48:45.590454 13102 master.cpp:6856] Removing executor 'default' with resources cpus(*):0.1; mem(*):32 of framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 on agent af46d7b0-4e75-443d-9e11-e89d5605f012-S2 at slave(1)@10.11.13.10:5051 (agent-3.novalocal) I0527 05:48:45.592897 13100 hierarchical.cpp:326] Removed framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 W0527 05:48:50.662726 13098 master.cpp:5199] Ignoring unknown exited executor 'default' of framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 on agent af46d7b0-4e75-443d-9e11-e89d5605f012-S2 at slave(1)@10.11.13.10:5051 (agent-3.novalocal) {code} The build is not super fresh (within 1 week), so you may find line number not consistent with latest code. > Add logic in long-lived-framework to handle network partitions. > --------------------------------------------------------------- > > Key: MESOS-5468 > URL: https://issues.apache.org/jira/browse/MESOS-5468 > Project: Mesos > Issue Type: Task > Components: framework, master > Reporter: Jay Guo > > Currently long-lived-framework does not handle network partitions i.e > explicitly trying to {{reconnect}} with the master upon not receiving > {{HEARTBEAT}} events for a prolonged amount of time. If the master > disconnects a framework without the framework being aware of it (one way > partition), the framework should explicitly issue a {{reconnect}} request via > the scheduler library after a certain period of time. > *On the other hand*, should we close TCP socket on master side when teardown > a framework? Currently the tcp socket is left alive even framework has been > deactivated. This results in framework sending invalid {{Call}} to master and > re-detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)