[ 
https://issues.apache.org/jira/browse/MESOS-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750531#comment-16750531
 ] 

Joseph Wu commented on MESOS-9507:
----------------------------------

One possible fix is to add a conditional between these two blocks:
https://github.com/apache/mesos/blob/0f8ee9555f89f0a5f139bc12c666a60164c7b09b/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L277-L287

{code}
  if (read.isNone()) {
    // This could happen if the agent died after opening the file for writing
    // but before it checkpointed anything.
    LOG(WARNING) << "Some descriptive warning";

    // <Insert some handling of this case>
  }
{code}

> Agent could not recover due to empty docker volume checkpointed files.
> ----------------------------------------------------------------------
>
>                 Key: MESOS-9507
>                 URL: https://issues.apache.org/jira/browse/MESOS-9507
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>            Reporter: Gilbert Song
>            Priority: Critical
>              Labels: containerizer
>
> Agent could not recover due to empty docker volume checkpointed files. Please 
> see logs:
> {noformat}
> Nov 12 17:12:00 guppy mesos-agent[38960]: E1112 17:12:00.978682 38969 
> slave.cpp:6279] EXIT with status 1: Failed to perform recovery: Collect 
> failed: Collect failed: Failed to recover docker volumes for orphan container 
> e1b04051-1e4a-47a9-b866-1d625cda1d22: JSON parse failed: syntax error at line 
> 1 near:
> Nov 12 17:12:00 guppy mesos-agent[38960]: To remedy this do as follows: 
> Nov 12 17:12:00 guppy mesos-agent[38960]: Step 1: rm -f 
> /var/lib/mesos/slave/meta/slaves/latest
> Nov 12 17:12:00 guppy mesos-agent[38960]: This ensures agent doesn't recover 
> old live executors.
> Nov 12 17:12:00 guppy mesos-agent[38960]: Step 2: Restart the agent. 
> Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service: main process 
> exited, code=exited, status=1/FAILURE
> Nov 12 17:12:00 guppy systemd[1]: Unit dcos-mesos-slave.service entered 
> failed state.
> Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service failed.
> {noformat}
> This is caused by agent recovery after the volume state file is created but 
> before checkpointing finishes. Basically the docker volume is not mounted 
> yet, so the docker volume isolator should skip recovering this volume.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to