[ 
https://issues.apache.org/jira/browse/MESOS-9501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16806707#comment-16806707
 ] 

Qian Zhang edited comment on MESOS-9501 at 4/1/19 12:55 PM:
------------------------------------------------------------

This issue can actually happen even without an agent reboot (highly unlikely 
but technically possible):
 # Use `mesos-execute` to launch a command task (e.g., `sleep 60`) with 
checkpoint enabled.
 # Stop agent process.
 # After the task finishes, wait for a new process to reuse its pid. We can 
simulate this by manually changing the task's checkpointed pid in the meta dir 
to an existing process's pid.
 # Start agent process.

Then we will see `mesos-execute` receives a TASK_RUNNING status update (see 
below) for the command task which has actually finished already. This is 
obviously a bug since a finished task is considered running by Mesos.
{code:java}
Received status update TASK_RUNNING for task 'test'
  message: 'Unreachable agent re-reregistered'
  source: SOURCE_MASTER
  reason: REASON_AGENT_REREGISTERED{code}
And TASK_RUNNING is the last status update for that task, no any other task 
status updates will be generated for it unless the process which reuses the pid 
terminates. The root cause of this issue is, when agent is started in step 4, 
it will try to destroy the executor container since the executor cannot 
reregister within `–executor_reregistration_timeout` (2 seconds by default), 
but the destroy operation cannot complete because it hangs 
[here|https://github.com/apache/mesos/blob/1.7.2/src/slave/containerizer/mesos/containerizer.cpp#L2684:L2685]
 since Mesos containerizer reaps an irrelevant process.


was (Author: qianzhang):
This issue can actually happen even without an agent reboot (highly unlikely 
but technically possible):
 # Use `mesos-execute` to launch a command task (e.g., `sleep 60`) with 
checkpoint enabled.
 # Stop agent process.
 # After the task finishes, wait for a new process to reuse its pid. We can 
simulate this by manually changing the task's checkpointed pid in the meta dir 
to an existing process's pid.
 # Start agent process.

Then we will see `mesos-execute` receives a TASK_RUNNING status update (see 
below) for the command task which has actually finished already. This is 
obviously a bug since a finished task is considered running by Mesos.
{code:java}
Received status update TASK_RUNNING for task 'test'
  message: 'Unreachable agent re-reregistered'
  source: SOURCE_MASTER
  reason: REASON_AGENT_REREGISTERED{code}

> Mesos executor fails to terminate and gets stuck after agent host reboot.
> -------------------------------------------------------------------------
>
>                 Key: MESOS-9501
>                 URL: https://issues.apache.org/jira/browse/MESOS-9501
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>    Affects Versions: 1.5.1, 1.6.1, 1.7.0
>            Reporter: Meng Zhu
>            Assignee: Qian Zhang
>            Priority: Critical
>             Fix For: 1.4.3, 1.5.2, 1.6.2, 1.7.1, 1.8.0
>
>
> When an agent host reboots, all of its containers are gone but the agent will 
> still try to recover from its checkpointed state after reboot.
> The agent will soon discover that all the cgroup hierarchies are gone and 
> assume (correctly) that the containers are destroyed.
> However, when trying to terminate the executor, the agent will first try to 
> wait for the exit status of its container:
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2631
> Agent dose so by `waitpid` on the checkpointed child process pid. If, after 
> the agent host reboot, a new process with the same pid gets spawned, then the 
> parent will wait for the wrong child process. This could get stuck until the 
> wrongly waited-for  process is somehow exited, see `ReaperProcess::wait()`: 
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L88-L114
> This will block the executor termination as well as future task status update 
> (e.g. master might still think the task is running).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to