Meng Zhu created MESOS-8459: ------------------------------- Summary: Executor could linger without ever receiving any tasks Key: MESOS-8459 URL: https://issues.apache.org/jira/browse/MESOS-8459 Project: Mesos Issue Type: Bug Components: executor Reporter: Meng Zhu
An executor's initial tasks may be killed even after it has been registered. In that case, the executor could linger forever. In MESOS-8411, we have a short-term fix that checks an executor's completed and terminated task queues to see if it had ever received any tasks. if the check is false and there is no queued or launched tasks, agent will shutdown the executor. However, this check is not bullet-proof. The completedTasks queue is a circular_buffer (current size 200) which means earlier completed tasks that are possibly updated by the executor may be ejected and thus are missed by this check. This would lead to false positive shutdowns. Per discussion with [~vinodkone] and [~bmahler]. There are two long term solutions. The first one is to checkpoint additional executor states which indicates whether the executor has ever received any tasks (no more inference from task queue status); The alternative is to add timeouts in the executor driver to shutdown lingering executors automatically. -- This message was sent by Atlassian JIRA (v7.6.3#76005)