[ 
https://issues.apache.org/jira/browse/MESOS-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303581#comment-15303581
 ] 

Jay Guo commented on MESOS-5468:
--------------------------------

[~anandmazumdar]
The socket is NOT successfully closed and still left in ESTABLISHED (can be 
observed from {{netstat}}). And I suspect it somehow happens before master 
explicitly issues close. Here's the log:
{code:title=master.log}
E0527 05:48:45.564194 13105 process.cpp:2033] Failed to shutdown socket with fd 
33: Transport endpoint is not connected
I0527 05:48:45.573005 13101 master.cpp:1383] Framework 
61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++)) 
disconnected
I0527 05:48:45.573212 13101 master.cpp:2792] Disconnecting framework 
61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++))
I0527 05:48:45.573431 13101 master.cpp:2816] Deactivating framework 
61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++))
W0527 05:48:45.574806 13101 master.hpp:1846] Master attempted to send message 
to disconnected framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived 
Framework (C++))
I0527 05:48:45.575145 13100 hierarchical.cpp:375] Deactivated framework 
61100b89-f964-4aa2-b084-e1089d205b83-0000
W0527 05:48:45.580201 13101 master.hpp:1852] Unable to send event to framework 
61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++)): 
connection closed
W0527 05:48:45.581838 13101 master.hpp:1846] Master attempted to send message 
to disconnected framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived 
Framework (C++))
W0527 05:48:45.582034 13101 master.hpp:1852] Unable to send event to framework 
61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++)): 
connection closed
W0527 05:48:45.583015 13101 master.hpp:1846] Master attempted to send message 
to disconnected framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived 
Framework (C++))
W0527 05:48:45.583124 13101 master.hpp:1852] Unable to send event to framework 
61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++)): 
connection closed
I0527 05:48:45.583395 13101 master.cpp:1396] Giving framework 
61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++)) 0ns to 
failover
I0527 05:48:45.585503 13102 master.cpp:5516] Framework failover timeout, 
removing framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived 
Framework (C++))
I0527 05:48:45.585793 13102 master.cpp:6246] Removing framework 
61100b89-f964-4aa2-b084-e1089d205b83-0000 (Long Lived Framework (C++))
I0527 05:48:45.588471 13102 master.cpp:6761] Updating the state of task 2 of 
framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 (latest state: 
TASK_FINISHED, status update state: TASK_KILLED)
I0527 05:48:45.589534 13102 master.cpp:6827] Removing task 2 with resources 
cpus(*):0.001; mem(*):1 of framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 
on agent af46d7b0-4e75-443d-9e11-e89d5605f012-S2 at slave(1)@10.11.13.10:5051 
(agent-3.novalocal)
I0527 05:48:45.590454 13102 master.cpp:6856] Removing executor 'default' with 
resources cpus(*):0.1; mem(*):32 of framework 
61100b89-f964-4aa2-b084-e1089d205b83-0000 on agent 
af46d7b0-4e75-443d-9e11-e89d5605f012-S2 at slave(1)@10.11.13.10:5051 
(agent-3.novalocal)
I0527 05:48:45.592897 13100 hierarchical.cpp:326] Removed framework 
61100b89-f964-4aa2-b084-e1089d205b83-0000
W0527 05:48:50.662726 13098 master.cpp:5199] Ignoring unknown exited executor 
'default' of framework 61100b89-f964-4aa2-b084-e1089d205b83-0000 on agent 
af46d7b0-4e75-443d-9e11-e89d5605f012-S2 at slave(1)@10.11.13.10:5051 
(agent-3.novalocal)
{code}

The build is not super fresh (within 1 week), so you may find line number not 
consistent with latest code.

> Add logic in long-lived-framework to handle network partitions.
> ---------------------------------------------------------------
>
>                 Key: MESOS-5468
>                 URL: https://issues.apache.org/jira/browse/MESOS-5468
>             Project: Mesos
>          Issue Type: Task
>          Components: framework, master
>            Reporter: Jay Guo
>
> Currently long-lived-framework does not handle network partitions i.e 
> explicitly trying to {{reconnect}} with the master upon not receiving 
> {{HEARTBEAT}} events for a prolonged amount of time. If the master 
> disconnects a framework without the framework being aware of it (one way 
> partition), the framework should explicitly issue a {{reconnect}} request via 
> the scheduler library after a certain period of time.
> *On the other hand*, should we close TCP socket on master side when teardown 
> a framework? Currently the tcp socket is left alive even framework has been 
> deactivated. This results in framework sending invalid {{Call}} to master and 
> re-detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to