[ https://issues.apache.org/jira/browse/MESOS-8111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cosmin Lehene updated MESOS-8111: --------------------------------- Description: After scaling down a cluster, the master is reporting a task as running although the slave has been long gone. At the same time it reports it can't kill it because the agent is offline {noformat} I1018 16:55:22.000000 6976 master.cpp:4913] Processing KILL call for task 'spark.7b59a77b-b353-11e7-addd-b29ecbf071e1' of framework 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101 W1018 16:55:22.000000 6976 master.cpp:5000] Cannot kill task spark.7b59a77b-b353-11e7-addd-b29ecbf071e1 of framework 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101 because the agent 4d2a982a-0e62-4471-88e8-8df9cc0ae437-S129 at slave(1)@10.0.0.81:5051 (10.0.0.81) is disconnected. Kill will be retried if the agent re-registers {noformat} Clearly, if the agent is offline the task is also not running. Also not sure waiting indefinitely for an agent to recover is a good strategy. was: After scaling down a cluster, the master is reporting a task as running although the slave has been long gone. At the same time it reports it can't kill it because the agent is offline {noformat} I1018 16:55:22.000000 6976 master.cpp:4913] Processing KILL call for task 'spark.7b59a77b-b353-11e7-addd-b29ecbf071e1' of framework 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101 W1018 16:55:22.000000 6976 master.cpp:5000] Cannot kill task spark.7b59a77b-b353-11e7-addd-b29ecbf071e1 of framework 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101 because the agent 4d2a982a-0e62-4471-88e8-8df9cc0ae437-S129 at slave(1)@10.0.0.81:5051 (10.0.0.81) is disconnected. Kill will be retried if the agent re-registers {noformat} > Mesos sees task as running, but cannot kill it because the agent is offline > --------------------------------------------------------------------------- > > Key: MESOS-8111 > URL: https://issues.apache.org/jira/browse/MESOS-8111 > Project: Mesos > Issue Type: Bug > Components: master > Affects Versions: 1.2.3 > Environment: DC/OS 1.9.4 > Reporter: Cosmin Lehene > > After scaling down a cluster, the master is reporting a task as running > although the slave has been long gone. > At the same time it reports it can't kill it because the agent is offline > {noformat} > I1018 16:55:22.000000 6976 master.cpp:4913] Processing KILL call for task > 'spark.7b59a77b-b353-11e7-addd-b29ecbf071e1' of framework > 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at > scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101 > W1018 16:55:22.000000 6976 master.cpp:5000] Cannot kill task > spark.7b59a77b-b353-11e7-addd-b29ecbf071e1 of framework > 4d2a982a-0e62-4471-88e8-8df9cc0ae437-0001 (marathon) at > scheduler-45eafb76-4510-482e-9bcc-06e3ad97c276@172.16.0.7:15101 because the > agent 4d2a982a-0e62-4471-88e8-8df9cc0ae437-S129 at slave(1)@10.0.0.81:5051 > (10.0.0.81) is disconnected. Kill will be retried if the agent re-registers > {noformat} > Clearly, if the agent is offline the task is also not running. Also not sure > waiting indefinitely for an agent to recover is a good strategy. -- This message was sent by Atlassian JIRA (v6.4.14#64029)