[
https://issues.apache.org/jira/browse/MESOS-9501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16806707#comment-16806707
]
Qian Zhang edited comment on MESOS-9501 at 4/1/19 12:55 PM:
------------------------------------------------------------
This issue can actually happen even without an agent reboot (highly unlikely
but technically possible):
# Use `mesos-execute` to launch a command task (e.g., `sleep 60`) with
checkpoint enabled.
# Stop agent process.
# After the task finishes, wait for a new process to reuse its pid. We can
simulate this by manually changing the task's checkpointed pid in the meta dir
to an existing process's pid.
# Start agent process.
Then we will see `mesos-execute` receives a TASK_RUNNING status update (see
below) for the command task which has actually finished already. This is
obviously a bug since a finished task is considered running by Mesos.
{code:java}
Received status update TASK_RUNNING for task 'test'
message: 'Unreachable agent re-reregistered'
source: SOURCE_MASTER
reason: REASON_AGENT_REREGISTERED{code}
And TASK_RUNNING is the last status update for that task, no any other task
status updates will be generated for it unless the process which reuses the pid
terminates. The root cause of this issue is, when agent is started in step 4,
it will try to destroy the executor container since the executor cannot
reregister within `–executor_reregistration_timeout` (2 seconds by default),
but the destroy operation cannot complete because it hangs
[here|https://github.com/apache/mesos/blob/1.7.2/src/slave/containerizer/mesos/containerizer.cpp#L2684:L2685]
since Mesos containerizer reaps an irrelevant process.
was (Author: qianzhang):
This issue can actually happen even without an agent reboot (highly unlikely
but technically possible):
# Use `mesos-execute` to launch a command task (e.g., `sleep 60`) with
checkpoint enabled.
# Stop agent process.
# After the task finishes, wait for a new process to reuse its pid. We can
simulate this by manually changing the task's checkpointed pid in the meta dir
to an existing process's pid.
# Start agent process.
Then we will see `mesos-execute` receives a TASK_RUNNING status update (see
below) for the command task which has actually finished already. This is
obviously a bug since a finished task is considered running by Mesos.
{code:java}
Received status update TASK_RUNNING for task 'test'
message: 'Unreachable agent re-reregistered'
source: SOURCE_MASTER
reason: REASON_AGENT_REREGISTERED{code}
> Mesos executor fails to terminate and gets stuck after agent host reboot.
> -------------------------------------------------------------------------
>
> Key: MESOS-9501
> URL: https://issues.apache.org/jira/browse/MESOS-9501
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Affects Versions: 1.5.1, 1.6.1, 1.7.0
> Reporter: Meng Zhu
> Assignee: Qian Zhang
> Priority: Critical
> Fix For: 1.4.3, 1.5.2, 1.6.2, 1.7.1, 1.8.0
>
>
> When an agent host reboots, all of its containers are gone but the agent will
> still try to recover from its checkpointed state after reboot.
> The agent will soon discover that all the cgroup hierarchies are gone and
> assume (correctly) that the containers are destroyed.
> However, when trying to terminate the executor, the agent will first try to
> wait for the exit status of its container:
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2631
> Agent dose so by `waitpid` on the checkpointed child process pid. If, after
> the agent host reboot, a new process with the same pid gets spawned, then the
> parent will wait for the wrong child process. This could get stuck until the
> wrongly waited-for process is somehow exited, see `ReaperProcess::wait()`:
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L88-L114
> This will block the executor termination as well as future task status update
> (e.g. master might still think the task is running).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)