[ https://issues.apache.org/jira/browse/MESOS-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303509#comment-15303509 ]
Jay Guo commented on MESOS-5468: -------------------------------- To reproduce: * Start master and agent * Run long-lived-framework * Issue {{# iptables -A OUTPUT -p tcp -d <master-ip> --dport 5050 -j DROP}} on framework machine to emulate network partition * Wait till master deactivates the framework * Remove iptables rule added above to emulate network rejoin * See log of both long-lived-framework and master. {{netstat -tpn}} also shows enormous {{TIME_WAIT}} sockets which is the result of re-detection > Add logic to long-lived-framework to handle HEARTBEAT timeout > ------------------------------------------------------------- > > Key: MESOS-5468 > URL: https://issues.apache.org/jira/browse/MESOS-5468 > Project: Mesos > Issue Type: Bug > Components: framework, master > Reporter: Jay Guo > > Currently long-lived-framework does not handle HEARTBEAT timeout. If master > teardown the framework without framework being aware of it (network > partition), the framework keeps waiting for {{Event}} until reconnected. > *On the other hand*, should we close TCP socket on master side when teardown > a framework? Currently the tcp socket is left alive even framework has been > deactivated. This results in framework sending invalid {{Call}} to master and > re-detection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)