[jira] [Commented] (MESOS-5468) Add logic in long-lived-framework to handle network partitions.
[ https://issues.apache.org/jira/browse/MESOS-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15309244#comment-15309244 ] Jay Guo commented on MESOS-5468: [~anandmazumdar] Sorry for the delay. One out of two connections between framework and master is successfully closed, however another one is left ESTABLISHED when master attempts to remove the framework. Upon network rejoin, master repeatedly denied subscription call from framework. So the question is, is the EVENT connection left open intentionally or accidentally? Here's the full log: {code:title=master.log} I0601 12:12:03.671700 2252 master.cpp:5195] Status update TASK_FINISHED (UUID: e370dac6-2915-4090-876f-c000d0fe71c7) for task 3 of framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- from agent edbc3730-e55b-4390-a1f2-5de5a66497f5-S0 at slave(1)@127.0.1.1:5051 (ubuntu) I0601 12:12:03.671931 2252 master.cpp:5243] Forwarding status update TASK_FINISHED (UUID: e370dac6-2915-4090-876f-c000d0fe71c7) for task 3 of framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- I0601 12:12:03.672360 2252 master.cpp:6853] Updating the state of task 3 of framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- (latest state: TASK_FINISHED, status update state: TASK_FINISHED) I0601 12:14:43.677433 2247 master.cpp:5195] Status update TASK_FINISHED (UUID: e370dac6-2915-4090-876f-c000d0fe71c7) for task 3 of framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- from agent edbc3730-e55b-4390-a1f2-5de5a66497f5-S0 at slave(1)@127.0.1.1:5051 (ubuntu) I0601 12:14:43.677781 2247 master.cpp:5243] Forwarding status update TASK_FINISHED (UUID: e370dac6-2915-4090-876f-c000d0fe71c7) for task 3 of framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- I0601 12:14:43.678387 2247 master.cpp:6853] Updating the state of task 3 of framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- (latest state: TASK_FINISHED, status update state: TASK_FINISHED) I0601 12:20:03.679064 2251 master.cpp:5195] Status update TASK_FINISHED (UUID: e370dac6-2915-4090-876f-c000d0fe71c7) for task 3 of framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- from agent edbc3730-e55b-4390-a1f2-5de5a66497f5-S0 at slave(1)@127.0.1.1:5051 (ubuntu) I0601 12:20:03.679194 2251 master.cpp:5243] Forwarding status update TASK_FINISHED (UUID: e370dac6-2915-4090-876f-c000d0fe71c7) for task 3 of framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- I0601 12:20:03.679565 2251 master.cpp:6853] Updating the state of task 3 of framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- (latest state: TASK_FINISHED, status update state: TASK_FINISHED) E0601 12:25:02.891707 2254 process.cpp:2040] Failed to shutdown socket with fd 13: Transport endpoint is not connected I0601 12:25:02.895753 2248 master.cpp:1388] Framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- (Long Lived Framework (C++)) disconnected I0601 12:25:02.896077 2248 master.cpp:2822] Disconnecting framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- (Long Lived Framework (C++)) I0601 12:25:02.896289 2248 master.cpp:2846] Deactivating framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- (Long Lived Framework (C++)) W0601 12:25:02.896682 2248 master.hpp:1903] Master attempted to send message to disconnected framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- (Long Lived Framework (C++)) W0601 12:25:02.897027 2248 master.hpp:1909] Unable to send event to framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- (Long Lived Framework (C++)): connection closed I0601 12:25:02.897341 2248 master.cpp:1401] Giving framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- (Long Lived Framework (C++)) 0ns to failover I0601 12:25:02.896751 2249 hierarchical.cpp:375] Deactivated framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- I0601 12:25:02.901005 2251 master.cpp:5608] Framework failover timeout, removing framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- (Long Lived Framework (C++)) I0601 12:25:02.901053 2251 master.cpp:6338] Removing framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- (Long Lived Framework (C++)) I0601 12:25:02.901409 2251 master.cpp:6853] Updating the state of task 3 of framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- (latest state: TASK_FINISHED, status update state: TASK_KILLED) I0601 12:25:02.901449 2251 master.cpp:6919] Removing task 3 with resources cpus(*):0.001; mem(*):1 of framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- on agent edbc3730-e55b-4390-a1f2-5de5a66497f5-S0 at slave(1)@127.0.1.1:5051 (ubuntu) I0601 12:25:02.901721 2251 master.cpp:6948] Removing executor 'default' with resources cpus(*):0.1; mem(*):32 of framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- on agent edbc3730-e55b-4390-a1f2-5de5a66497f5-S0 at slave(1)@127.0.1.1:5051 (ubuntu) I0601 12:25:02.902426 2251 hierarchical.cpp:326] Removed framework e8288e1d-2c05-4e05-9db7-713a366f7f5f- W0601 12:25:08.007905 2253 master.cpp:5291] Ignoring unknown exited executor 'default'
[jira] [Commented] (MESOS-5468) Add logic in long-lived-framework to handle network partitions.
[ https://issues.apache.org/jira/browse/MESOS-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304360#comment-15304360 ] Jay Guo commented on MESOS-5468: What is your iptables command? I can constantly reproduce the problem on latest build. * How long does it take for master to disconnect the framework after network partition {{iptables command issued}}? * Do tcp sockets go into FIN_WAIT_1 state? I think the point is how does a master notice network partition? IIUC, it relies on tcp socket timeout, which is typically 13-30 min on a linux box (manpage of tcp), and that is the duration I experienced between disconnect and give-up. And at this point, tcp socket informs user (mesos-master) of broken link while remaining ESTABLISHED. It is up to the app now to handle this failure and I suspect that libprocess does not properly close the socket here. I'll need to do some more investigation. I see other users experiencing {{Transport endpoint is not connected}} error and I personally see this for many times as well. So I think we should definitely take a serious look into that. Another question, why don't we use a mature http library at the very beginning, instead of having our own implementation? Cheers, /J > Add logic in long-lived-framework to handle network partitions. > --- > > Key: MESOS-5468 > URL: https://issues.apache.org/jira/browse/MESOS-5468 > Project: Mesos > Issue Type: Task > Components: framework, master >Reporter: Jay Guo > > Currently long-lived-framework does not handle network partitions i.e > explicitly trying to {{reconnect}} with the master upon not receiving > {{HEARTBEAT}} events for a prolonged amount of time. If the master > disconnects a framework without the framework being aware of it (one way > partition), the framework should explicitly issue a {{reconnect}} request via > the scheduler library after a certain period of time. > *On the other hand*, should we close TCP socket on master side when teardown > a framework? Currently the tcp socket is left alive even framework has been > deactivated. This results in framework sending invalid {{Call}} to master and > re-detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5468) Add logic in long-lived-framework to handle network partitions.
[ https://issues.apache.org/jira/browse/MESOS-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304189#comment-15304189 ] Anand Mazumdar commented on MESOS-5468: --- If for some reason, a framework gets disconnected from the master. The master gives it {{failover_timeout}} to register before removing it completely. https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L231 We currently don't specify a timeout value for the example long lived framework so it defaults to 0ns i.e. it would be removed as soon as it disconnects initially. {noformat} I0527 05:48:45.583395 13101 master.cpp:1396] Giving framework 61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++)) 0ns to failover {noformat} I wasn't able to reproduce the socket closure issue on my end i.e. the socket is closed as soon as the master disconnects the long-lived-framework. Can you have a look into the reproduction steps on the JIRA and let me know if it's missing any steps? {noformat} $ ~ netstat -tpn | grep -i 5050 (Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.) tcp0 0 127.0.1.1:5050 127.0.0.1:45226 ESTABLISHED 32402/lt-mesos-mast tcp0 0 127.0.0.1:45224 127.0.1.1:5050 ESTABLISHED 961/lt-long-lived-f tcp0 0 127.0.0.1:45226 127.0.1.1:5050 ESTABLISHED 961/lt-long-lived-f tcp0 0 127.0.1.1:5050 127.0.0.1:45224 ESTABLISHED 32402/lt-mesos-mast {noformat} After following the steps on the JIRA i.e. the long running framework gets disconnected. {noformat} $ ~ netstat -tpn | grep -i 5050 (Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.) tcp0 0 127.0.0.1:45224 127.0.1.1:5050 TIME_WAIT - tcp0 0 127.0.0.1:45226 127.0.1.1:5050 TIME_WAIT - {noformat} > Add logic in long-lived-framework to handle network partitions. > --- > > Key: MESOS-5468 > URL: https://issues.apache.org/jira/browse/MESOS-5468 > Project: Mesos > Issue Type: Task > Components: framework, master >Reporter: Jay Guo > > Currently long-lived-framework does not handle network partitions i.e > explicitly trying to {{reconnect}} with the master upon not receiving > {{HEARTBEAT}} events for a prolonged amount of time. If the master > disconnects a framework without the framework being aware of it (one way > partition), the framework should explicitly issue a {{reconnect}} request via > the scheduler library after a certain period of time. > *On the other hand*, should we close TCP socket on master side when teardown > a framework? Currently the tcp socket is left alive even framework has been > deactivated. This results in framework sending invalid {{Call}} to master and > re-detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5468) Add logic in long-lived-framework to handle network partitions.
[ https://issues.apache.org/jira/browse/MESOS-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303589#comment-15303589 ] Jay Guo commented on MESOS-5468: Another question, how long do we timeout a framework? I don't see the option in configurations. Or are we using other mechanisms to invalidate a framework instead of timeout? > Add logic in long-lived-framework to handle network partitions. > --- > > Key: MESOS-5468 > URL: https://issues.apache.org/jira/browse/MESOS-5468 > Project: Mesos > Issue Type: Task > Components: framework, master >Reporter: Jay Guo > > Currently long-lived-framework does not handle network partitions i.e > explicitly trying to {{reconnect}} with the master upon not receiving > {{HEARTBEAT}} events for a prolonged amount of time. If the master > disconnects a framework without the framework being aware of it (one way > partition), the framework should explicitly issue a {{reconnect}} request via > the scheduler library after a certain period of time. > *On the other hand*, should we close TCP socket on master side when teardown > a framework? Currently the tcp socket is left alive even framework has been > deactivated. This results in framework sending invalid {{Call}} to master and > re-detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5468) Add logic in long-lived-framework to handle network partitions.
[ https://issues.apache.org/jira/browse/MESOS-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303587#comment-15303587 ] Jay Guo commented on MESOS-5468: See steps to reproduce in my first comment. > Add logic in long-lived-framework to handle network partitions. > --- > > Key: MESOS-5468 > URL: https://issues.apache.org/jira/browse/MESOS-5468 > Project: Mesos > Issue Type: Task > Components: framework, master >Reporter: Jay Guo > > Currently long-lived-framework does not handle network partitions i.e > explicitly trying to {{reconnect}} with the master upon not receiving > {{HEARTBEAT}} events for a prolonged amount of time. If the master > disconnects a framework without the framework being aware of it (one way > partition), the framework should explicitly issue a {{reconnect}} request via > the scheduler library after a certain period of time. > *On the other hand*, should we close TCP socket on master side when teardown > a framework? Currently the tcp socket is left alive even framework has been > deactivated. This results in framework sending invalid {{Call}} to master and > re-detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5468) Add logic in long-lived-framework to handle network partitions.
[ https://issues.apache.org/jira/browse/MESOS-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303581#comment-15303581 ] Jay Guo commented on MESOS-5468: [~anandmazumdar] The socket is NOT successfully closed and still left in ESTABLISHED (can be observed from {{netstat}}). And I suspect it somehow happens before master explicitly issues close. Here's the log: {code:title=master.log} E0527 05:48:45.564194 13105 process.cpp:2033] Failed to shutdown socket with fd 33: Transport endpoint is not connected I0527 05:48:45.573005 13101 master.cpp:1383] Framework 61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++)) disconnected I0527 05:48:45.573212 13101 master.cpp:2792] Disconnecting framework 61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++)) I0527 05:48:45.573431 13101 master.cpp:2816] Deactivating framework 61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++)) W0527 05:48:45.574806 13101 master.hpp:1846] Master attempted to send message to disconnected framework 61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++)) I0527 05:48:45.575145 13100 hierarchical.cpp:375] Deactivated framework 61100b89-f964-4aa2-b084-e1089d205b83- W0527 05:48:45.580201 13101 master.hpp:1852] Unable to send event to framework 61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++)): connection closed W0527 05:48:45.581838 13101 master.hpp:1846] Master attempted to send message to disconnected framework 61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++)) W0527 05:48:45.582034 13101 master.hpp:1852] Unable to send event to framework 61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++)): connection closed W0527 05:48:45.583015 13101 master.hpp:1846] Master attempted to send message to disconnected framework 61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++)) W0527 05:48:45.583124 13101 master.hpp:1852] Unable to send event to framework 61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++)): connection closed I0527 05:48:45.583395 13101 master.cpp:1396] Giving framework 61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++)) 0ns to failover I0527 05:48:45.585503 13102 master.cpp:5516] Framework failover timeout, removing framework 61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++)) I0527 05:48:45.585793 13102 master.cpp:6246] Removing framework 61100b89-f964-4aa2-b084-e1089d205b83- (Long Lived Framework (C++)) I0527 05:48:45.588471 13102 master.cpp:6761] Updating the state of task 2 of framework 61100b89-f964-4aa2-b084-e1089d205b83- (latest state: TASK_FINISHED, status update state: TASK_KILLED) I0527 05:48:45.589534 13102 master.cpp:6827] Removing task 2 with resources cpus(*):0.001; mem(*):1 of framework 61100b89-f964-4aa2-b084-e1089d205b83- on agent af46d7b0-4e75-443d-9e11-e89d5605f012-S2 at slave(1)@10.11.13.10:5051 (agent-3.novalocal) I0527 05:48:45.590454 13102 master.cpp:6856] Removing executor 'default' with resources cpus(*):0.1; mem(*):32 of framework 61100b89-f964-4aa2-b084-e1089d205b83- on agent af46d7b0-4e75-443d-9e11-e89d5605f012-S2 at slave(1)@10.11.13.10:5051 (agent-3.novalocal) I0527 05:48:45.592897 13100 hierarchical.cpp:326] Removed framework 61100b89-f964-4aa2-b084-e1089d205b83- W0527 05:48:50.662726 13098 master.cpp:5199] Ignoring unknown exited executor 'default' of framework 61100b89-f964-4aa2-b084-e1089d205b83- on agent af46d7b0-4e75-443d-9e11-e89d5605f012-S2 at slave(1)@10.11.13.10:5051 (agent-3.novalocal) {code} The build is not super fresh (within 1 week), so you may find line number not consistent with latest code. > Add logic in long-lived-framework to handle network partitions. > --- > > Key: MESOS-5468 > URL: https://issues.apache.org/jira/browse/MESOS-5468 > Project: Mesos > Issue Type: Task > Components: framework, master >Reporter: Jay Guo > > Currently long-lived-framework does not handle network partitions i.e > explicitly trying to {{reconnect}} with the master upon not receiving > {{HEARTBEAT}} events for a prolonged amount of time. If the master > disconnects a framework without the framework being aware of it (one way > partition), the framework should explicitly issue a {{reconnect}} request via > the scheduler library after a certain period of time. > *On the other hand*, should we close TCP socket on master side when teardown > a framework? Currently the tcp socket is left alive even framework has been > deactivated. This results in framework sending invalid {{Call}} to master and > re-detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5468) Add logic in long-lived-framework to handle network partitions.
[ https://issues.apache.org/jira/browse/MESOS-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303537#comment-15303537 ] Anand Mazumdar commented on MESOS-5468: --- [~guoger] I edited the JIRA description a bit. Let me know if it does not align with your observations. Also, we do close the socket on the master's side upon a framework disconnect/teardown. https://github.com/apache/mesos/blob/master/src/master/master.cpp#L2795 Can you confirm on your end if you are not seeing this behavior and some steps to reproduce it? > Add logic in long-lived-framework to handle network partitions. > --- > > Key: MESOS-5468 > URL: https://issues.apache.org/jira/browse/MESOS-5468 > Project: Mesos > Issue Type: Task > Components: framework, master >Reporter: Jay Guo > > Currently long-lived-framework does not handle network partitions i.e > explicitly trying to {{reconnect}} with the master upon not receiving > {{HEARTBEAT}} events for a prolonged amount of time. If the master > disconnects a framework without the framework being aware of it (one way > partition), the framework should explicitly issue a {{reconnect}} request via > the scheduler library after a certain period of time. > *On the other hand*, should we close TCP socket on master side when teardown > a framework? Currently the tcp socket is left alive even framework has been > deactivated. This results in framework sending invalid {{Call}} to master and > re-detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)