[jira] [Commented] (MESOS-5468) Add logic in long-lived-framework to handle network partitions.

2016-05-31 Thread Jay Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309244#comment-15309244
 ] 

Jay Guo commented on MESOS-5468:


[~anandmazumdar] Sorry for the delay.
One out of two connections between framework and master is successfully closed, 
however another one is left ESTABLISHED when master attempts to remove the 
framework. Upon network rejoin, master repeatedly denied subscription call from 
framework. So the question is, is the EVENT connection left open intentionally 
or accidentally?

Here's the full log:
{code:title=master.log}
I0601 12:12:03.671700  2252 master.cpp:5195] Status update TASK_FINISHED (UUID: 
e370dac6-2915-4090-876f-c000d0fe71c7) for task 3 of framework 
e8288e1d-2c05-4e05-9db7-713a366f7f5f- from agent 
edbc3730-e55b-4390-a1f2-5de5a66497f5-S0 at slave(1)@127.0.1.1:5051 (ubuntu)
I0601 12:12:03.671931  2252 master.cpp:5243] Forwarding status update 
TASK_FINISHED (UUID: e370dac6-2915-4090-876f-c000d0fe71c7) for task 3 of 
framework e8288e1d-2c05-4e05-9db7-713a366f7f5f-
I0601 12:12:03.672360  2252 master.cpp:6853] Updating the state of task 3 of 
framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- (latest state: 
TASK_FINISHED, status update state: TASK_FINISHED)
I0601 12:14:43.677433  2247 master.cpp:5195] Status update TASK_FINISHED (UUID: 
e370dac6-2915-4090-876f-c000d0fe71c7) for task 3 of framework 
e8288e1d-2c05-4e05-9db7-713a366f7f5f- from agent 
edbc3730-e55b-4390-a1f2-5de5a66497f5-S0 at slave(1)@127.0.1.1:5051 (ubuntu)
I0601 12:14:43.677781  2247 master.cpp:5243] Forwarding status update 
TASK_FINISHED (UUID: e370dac6-2915-4090-876f-c000d0fe71c7) for task 3 of 
framework e8288e1d-2c05-4e05-9db7-713a366f7f5f-
I0601 12:14:43.678387  2247 master.cpp:6853] Updating the state of task 3 of 
framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- (latest state: 
TASK_FINISHED, status update state: TASK_FINISHED)
I0601 12:20:03.679064  2251 master.cpp:5195] Status update TASK_FINISHED (UUID: 
e370dac6-2915-4090-876f-c000d0fe71c7) for task 3 of framework 
e8288e1d-2c05-4e05-9db7-713a366f7f5f- from agent 
edbc3730-e55b-4390-a1f2-5de5a66497f5-S0 at slave(1)@127.0.1.1:5051 (ubuntu)
I0601 12:20:03.679194  2251 master.cpp:5243] Forwarding status update 
TASK_FINISHED (UUID: e370dac6-2915-4090-876f-c000d0fe71c7) for task 3 of 
framework e8288e1d-2c05-4e05-9db7-713a366f7f5f-
I0601 12:20:03.679565  2251 master.cpp:6853] Updating the state of task 3 of 
framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- (latest state: 
TASK_FINISHED, status update state: TASK_FINISHED)
E0601 12:25:02.891707  2254 process.cpp:2040] Failed to shutdown socket with fd 
13: Transport endpoint is not connected
I0601 12:25:02.895753  2248 master.cpp:1388] Framework 
e8288e1d-2c05-4e05-9db7-713a366f7f5f- (Long Lived Framework (C++)) 
disconnected
I0601 12:25:02.896077  2248 master.cpp:2822] Disconnecting framework 
e8288e1d-2c05-4e05-9db7-713a366f7f5f- (Long Lived Framework (C++))
I0601 12:25:02.896289  2248 master.cpp:2846] Deactivating framework 
e8288e1d-2c05-4e05-9db7-713a366f7f5f- (Long Lived Framework (C++))
W0601 12:25:02.896682  2248 master.hpp:1903] Master attempted to send message 
to disconnected framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- (Long Lived 
Framework (C++))
W0601 12:25:02.897027  2248 master.hpp:1909] Unable to send event to framework 
e8288e1d-2c05-4e05-9db7-713a366f7f5f- (Long Lived Framework (C++)): 
connection closed
I0601 12:25:02.897341  2248 master.cpp:1401] Giving framework 
e8288e1d-2c05-4e05-9db7-713a366f7f5f- (Long Lived Framework (C++)) 0ns to 
failover
I0601 12:25:02.896751  2249 hierarchical.cpp:375] Deactivated framework 
e8288e1d-2c05-4e05-9db7-713a366f7f5f-
I0601 12:25:02.901005  2251 master.cpp:5608] Framework failover timeout, 
removing framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- (Long Lived 
Framework (C++))
I0601 12:25:02.901053  2251 master.cpp:6338] Removing framework 
e8288e1d-2c05-4e05-9db7-713a366f7f5f- (Long Lived Framework (C++))
I0601 12:25:02.901409  2251 master.cpp:6853] Updating the state of task 3 of 
framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- (latest state: 
TASK_FINISHED, status update state: TASK_KILLED)
I0601 12:25:02.901449  2251 master.cpp:6919] Removing task 3 with resources 
cpus(*):0.001; mem(*):1 of framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- 
on agent edbc3730-e55b-4390-a1f2-5de5a66497f5-S0 at slave(1)@127.0.1.1:5051 
(ubuntu)
I0601 12:25:02.901721  2251 master.cpp:6948] Removing executor 'default' with 
resources cpus(*):0.1; mem(*):32 of framework 
e8288e1d-2c05-4e05-9db7-713a366f7f5f- on agent 
edbc3730-e55b-4390-a1f2-5de5a66497f5-S0 at slave(1)@127.0.1.1:5051 (ubuntu)
I0601 12:25:02.902426  2251 hierarchical.cpp:326] Removed framework 
e8288e1d-2c05-4e05-9db7-713a366f7f5f-
W0601 12:25:08.007905  2253 master.cpp:5291] Ignoring unknown exited executor 
'default' 

[jira] [Commented] (MESOS-5468) Add logic in long-lived-framework to handle network partitions.

2016-05-27 Thread Jay Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304360#comment-15304360
 ] 

Jay Guo commented on MESOS-5468:


What is your iptables command? I can constantly reproduce the problem on latest 
build.

* How long does it take for master to disconnect the framework after network 
partition {{iptables command issued}}?

* Do tcp sockets go into FIN_WAIT_1 state?

I think the point is how does a master notice network partition? IIUC, it 
relies on tcp socket timeout, which is typically 13-30 min on a linux box 
(manpage of tcp), and that is the duration I experienced between disconnect and 
give-up. And at this point, tcp socket informs user (mesos-master) of broken 
link while remaining ESTABLISHED. It is up to the app now to handle this 
failure and I suspect that libprocess does not properly close the socket here. 
I'll need to do some more investigation.

I see other users experiencing {{Transport endpoint is not connected}} error 
and I personally see this for many times as well. So I think we should 
definitely take a serious look into that.

Another question, why don't we use a mature http library at the very beginning, 
instead of having our own implementation?

Cheers,
/J

> Add logic in long-lived-framework to handle network partitions.
> ---
>
> Key: MESOS-5468
> URL: https://issues.apache.org/jira/browse/MESOS-5468
> Project: Mesos
>  Issue Type: Task
>  Components: framework, master
>Reporter: Jay Guo
>
> Currently long-lived-framework does not handle network partitions i.e 
> explicitly trying to {{reconnect}} with the master upon not receiving 
> {{HEARTBEAT}} events for a prolonged amount of time. If the master 
> disconnects a framework without the framework being aware of it (one way 
> partition), the framework should explicitly issue a {{reconnect}} request via 
> the scheduler library after a certain period of time.
> *On the other hand*, should we close TCP socket on master side when teardown 
> a framework? Currently the tcp socket is left alive even framework has been 
> deactivated. This results in framework sending invalid {{Call}} to master and 
> re-detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5468) Add logic in long-lived-framework to handle network partitions.

2016-05-27 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304189#comment-15304189
 ] 

Anand Mazumdar commented on MESOS-5468:
---

If for some reason, a framework gets disconnected from the master. The master 
gives it {{failover_timeout}} to register before removing it completely. 
https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L231

We currently don't specify a timeout value for the example long lived framework 
so it defaults to 0ns i.e. it would be removed as soon as it disconnects 
initially.

{noformat}
I0527 05:48:45.583395 13101 master.cpp:1396] Giving framework 
61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++)) 0ns to 
failover
{noformat}

I wasn't able to reproduce the socket closure issue on my end i.e. the socket 
is closed as soon as the master disconnects the long-lived-framework. 

Can you have a look into the reproduction steps on the JIRA and let me know if 
it's missing any steps?

{noformat}
$  ~  netstat -tpn | grep -i 5050
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp0  0 127.0.1.1:5050  127.0.0.1:45226 ESTABLISHED 
32402/lt-mesos-mast
tcp0  0 127.0.0.1:45224 127.0.1.1:5050  ESTABLISHED 
961/lt-long-lived-f
tcp0  0 127.0.0.1:45226 127.0.1.1:5050  ESTABLISHED 
961/lt-long-lived-f
tcp0  0 127.0.1.1:5050  127.0.0.1:45224 ESTABLISHED 
32402/lt-mesos-mast
{noformat}

After following the steps on the JIRA i.e. the long running framework gets 
disconnected.

{noformat}
$ ~  netstat -tpn | grep -i 5050
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp0  0 127.0.0.1:45224 127.0.1.1:5050  TIME_WAIT   
-
tcp0  0 127.0.0.1:45226 127.0.1.1:5050  TIME_WAIT   
-
{noformat}


> Add logic in long-lived-framework to handle network partitions.
> ---
>
> Key: MESOS-5468
> URL: https://issues.apache.org/jira/browse/MESOS-5468
> Project: Mesos
>  Issue Type: Task
>  Components: framework, master
>Reporter: Jay Guo
>
> Currently long-lived-framework does not handle network partitions i.e 
> explicitly trying to {{reconnect}} with the master upon not receiving 
> {{HEARTBEAT}} events for a prolonged amount of time. If the master 
> disconnects a framework without the framework being aware of it (one way 
> partition), the framework should explicitly issue a {{reconnect}} request via 
> the scheduler library after a certain period of time.
> *On the other hand*, should we close TCP socket on master side when teardown 
> a framework? Currently the tcp socket is left alive even framework has been 
> deactivated. This results in framework sending invalid {{Call}} to master and 
> re-detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5468) Add logic in long-lived-framework to handle network partitions.

2016-05-27 Thread Jay Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303589#comment-15303589
 ] 

Jay Guo commented on MESOS-5468:


Another question, how long do we timeout a framework? I don't see the option in 
configurations. Or are we using other mechanisms to invalidate a framework 
instead of timeout?

> Add logic in long-lived-framework to handle network partitions.
> ---
>
> Key: MESOS-5468
> URL: https://issues.apache.org/jira/browse/MESOS-5468
> Project: Mesos
>  Issue Type: Task
>  Components: framework, master
>Reporter: Jay Guo
>
> Currently long-lived-framework does not handle network partitions i.e 
> explicitly trying to {{reconnect}} with the master upon not receiving 
> {{HEARTBEAT}} events for a prolonged amount of time. If the master 
> disconnects a framework without the framework being aware of it (one way 
> partition), the framework should explicitly issue a {{reconnect}} request via 
> the scheduler library after a certain period of time.
> *On the other hand*, should we close TCP socket on master side when teardown 
> a framework? Currently the tcp socket is left alive even framework has been 
> deactivated. This results in framework sending invalid {{Call}} to master and 
> re-detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5468) Add logic in long-lived-framework to handle network partitions.

2016-05-27 Thread Jay Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303587#comment-15303587
 ] 

Jay Guo commented on MESOS-5468:


See steps to reproduce in my first comment.

> Add logic in long-lived-framework to handle network partitions.
> ---
>
> Key: MESOS-5468
> URL: https://issues.apache.org/jira/browse/MESOS-5468
> Project: Mesos
>  Issue Type: Task
>  Components: framework, master
>Reporter: Jay Guo
>
> Currently long-lived-framework does not handle network partitions i.e 
> explicitly trying to {{reconnect}} with the master upon not receiving 
> {{HEARTBEAT}} events for a prolonged amount of time. If the master 
> disconnects a framework without the framework being aware of it (one way 
> partition), the framework should explicitly issue a {{reconnect}} request via 
> the scheduler library after a certain period of time.
> *On the other hand*, should we close TCP socket on master side when teardown 
> a framework? Currently the tcp socket is left alive even framework has been 
> deactivated. This results in framework sending invalid {{Call}} to master and 
> re-detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5468) Add logic in long-lived-framework to handle network partitions.

2016-05-27 Thread Jay Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303581#comment-15303581
 ] 

Jay Guo commented on MESOS-5468:


[~anandmazumdar]
The socket is NOT successfully closed and still left in ESTABLISHED (can be 
observed from {{netstat}}). And I suspect it somehow happens before master 
explicitly issues close. Here's the log:
{code:title=master.log}
E0527 05:48:45.564194 13105 process.cpp:2033] Failed to shutdown socket with fd 
33: Transport endpoint is not connected
I0527 05:48:45.573005 13101 master.cpp:1383] Framework 
61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++)) 
disconnected
I0527 05:48:45.573212 13101 master.cpp:2792] Disconnecting framework 
61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++))
I0527 05:48:45.573431 13101 master.cpp:2816] Deactivating framework 
61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++))
W0527 05:48:45.574806 13101 master.hpp:1846] Master attempted to send message 
to disconnected framework 61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived 
Framework (C++))
I0527 05:48:45.575145 13100 hierarchical.cpp:375] Deactivated framework 
61100b89-f964-4aa2-b084-e1089d205b83-
W0527 05:48:45.580201 13101 master.hpp:1852] Unable to send event to framework 
61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++)): 
connection closed
W0527 05:48:45.581838 13101 master.hpp:1846] Master attempted to send message 
to disconnected framework 61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived 
Framework (C++))
W0527 05:48:45.582034 13101 master.hpp:1852] Unable to send event to framework 
61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++)): 
connection closed
W0527 05:48:45.583015 13101 master.hpp:1846] Master attempted to send message 
to disconnected framework 61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived 
Framework (C++))
W0527 05:48:45.583124 13101 master.hpp:1852] Unable to send event to framework 
61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++)): 
connection closed
I0527 05:48:45.583395 13101 master.cpp:1396] Giving framework 
61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++)) 0ns to 
failover
I0527 05:48:45.585503 13102 master.cpp:5516] Framework failover timeout, 
removing framework 61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived 
Framework (C++))
I0527 05:48:45.585793 13102 master.cpp:6246] Removing framework 
61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++))
I0527 05:48:45.588471 13102 master.cpp:6761] Updating the state of task 2 of 
framework 61100b89-f964-4aa2-b084-e1089d205b83- (latest state: 
TASK_FINISHED, status update state: TASK_KILLED)
I0527 05:48:45.589534 13102 master.cpp:6827] Removing task 2 with resources 
cpus(*):0.001; mem(*):1 of framework 61100b89-f964-4aa2-b084-e1089d205b83- 
on agent af46d7b0-4e75-443d-9e11-e89d5605f012-S2 at slave(1)@10.11.13.10:5051 
(agent-3.novalocal)
I0527 05:48:45.590454 13102 master.cpp:6856] Removing executor 'default' with 
resources cpus(*):0.1; mem(*):32 of framework 
61100b89-f964-4aa2-b084-e1089d205b83- on agent 
af46d7b0-4e75-443d-9e11-e89d5605f012-S2 at slave(1)@10.11.13.10:5051 
(agent-3.novalocal)
I0527 05:48:45.592897 13100 hierarchical.cpp:326] Removed framework 
61100b89-f964-4aa2-b084-e1089d205b83-
W0527 05:48:50.662726 13098 master.cpp:5199] Ignoring unknown exited executor 
'default' of framework 61100b89-f964-4aa2-b084-e1089d205b83- on agent 
af46d7b0-4e75-443d-9e11-e89d5605f012-S2 at slave(1)@10.11.13.10:5051 
(agent-3.novalocal)
{code}

The build is not super fresh (within 1 week), so you may find line number not 
consistent with latest code.

> Add logic in long-lived-framework to handle network partitions.
> ---
>
> Key: MESOS-5468
> URL: https://issues.apache.org/jira/browse/MESOS-5468
> Project: Mesos
>  Issue Type: Task
>  Components: framework, master
>Reporter: Jay Guo
>
> Currently long-lived-framework does not handle network partitions i.e 
> explicitly trying to {{reconnect}} with the master upon not receiving 
> {{HEARTBEAT}} events for a prolonged amount of time. If the master 
> disconnects a framework without the framework being aware of it (one way 
> partition), the framework should explicitly issue a {{reconnect}} request via 
> the scheduler library after a certain period of time.
> *On the other hand*, should we close TCP socket on master side when teardown 
> a framework? Currently the tcp socket is left alive even framework has been 
> deactivated. This results in framework sending invalid {{Call}} to master and 
> re-detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5468) Add logic in long-lived-framework to handle network partitions.

2016-05-26 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303537#comment-15303537
 ] 

Anand Mazumdar commented on MESOS-5468:
---

[~guoger] I edited the JIRA description a bit. Let me know if it does not align 
with your observations.

Also, we do close the socket on the master's side upon a framework 
disconnect/teardown. 
https://github.com/apache/mesos/blob/master/src/master/master.cpp#L2795

Can you confirm on your end if you are not seeing this behavior and some steps 
to reproduce it?

> Add logic in long-lived-framework to handle network partitions.
> ---
>
> Key: MESOS-5468
> URL: https://issues.apache.org/jira/browse/MESOS-5468
> Project: Mesos
>  Issue Type: Task
>  Components: framework, master
>Reporter: Jay Guo
>
> Currently long-lived-framework does not handle network partitions i.e 
> explicitly trying to {{reconnect}} with the master upon not receiving 
> {{HEARTBEAT}} events for a prolonged amount of time. If the master 
> disconnects a framework without the framework being aware of it (one way 
> partition), the framework should explicitly issue a {{reconnect}} request via 
> the scheduler library after a certain period of time.
> *On the other hand*, should we close TCP socket on master side when teardown 
> a framework? Currently the tcp socket is left alive even framework has been 
> deactivated. This results in framework sending invalid {{Call}} to master and 
> re-detection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)