[jira] [Commented] (MESOS-8125) Agent should properly handle recovering an executor when its pid is reused

Yan Xu (JIRA) Tue, 09 Jan 2018 23:29:22 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16319818#comment-16319818
 ]


Yan Xu commented on MESOS-8125:
-------------------------------

We used to not need to handle recovering executors after a reboot because the 
agent would have been considered lost, so not only did we not to need recover 
the executors, we also didn't need to resume unacknowledged status updates etc.

In the new scenario we need to handle these so we cannot just simply remove the 
{{latest}} executor run symlink. I guess we should just short circuit the 
executor reconnect/reregister logic based on the {{rebooted}} field in the 
top-level {{State}} but keep the rest of the recovery logic.

> Agent should properly handle recovering an executor when its pid is reused
> --------------------------------------------------------------------------
>
>                 Key: MESOS-8125
>                 URL: https://issues.apache.org/jira/browse/MESOS-8125
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Gastón Kleiman
>            Assignee: Megha Sharma
>            Priority: Critical
>
> We know that all executors will be gone once the host on which an agent is 
> running is rebooted, so there's no need to try to recover these executors.
> Trying to recover stopped executors can lead to problems if another process 
> is assigned the same pid that the executor had before the reboot. In this 
> case the agent will unsuccessfully try to reregister with the executor, and 
> then transition it to a {{TERMINATING}} state. The executor will sadly get 
> stuck in that state, and the tasks that it started will get stuck in whatever 
> state they were in at the time of the reboot.
> One way of getting rid of stuck executors is to remove the {{latest}} symlink 
> under {{work_dir/meta/slaves/latest/frameworks/<framework 
> id>/executors/<executor id>/runs}.
> Here's how to reproduce this issue:
> # Start a task using the Docker containerizer (the same will probably happen 
> with the command executor).
> # Stop the corresponding Mesos agent while the task is running.
> # Change the executor's checkpointed forked pid, which is located in the meta 
> directory, e.g., 
> {{/var/lib/mesos/slave/meta/slaves/latest/frameworks/19faf6e0-3917-48ab-8b8e-97ec4f9ed41e-0001/executors/foo.13faee90-b5f0-11e7-8032-e607d2b4348c/runs/latest/pids/forked.pid}}.
>  I used pid 2, which is normally used by {{kthreadd}}.
> # Reboot the host



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MESOS-8125) Agent should properly handle recovering an executor when its pid is reused

Reply via email to