[
https://issues.apache.org/jira/browse/MESOS-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750531#comment-16750531
]
Joseph Wu commented on MESOS-9507:
----------------------------------
One possible fix is to add a conditional between these two blocks:
https://github.com/apache/mesos/blob/0f8ee9555f89f0a5f139bc12c666a60164c7b09b/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L277-L287
{code}
if (read.isNone()) {
// This could happen if the agent died after opening the file for writing
// but before it checkpointed anything.
LOG(WARNING) << "Some descriptive warning";
// <Insert some handling of this case>
}
{code}
> Agent could not recover due to empty docker volume checkpointed files.
> ----------------------------------------------------------------------
>
> Key: MESOS-9507
> URL: https://issues.apache.org/jira/browse/MESOS-9507
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Reporter: Gilbert Song
> Priority: Critical
> Labels: containerizer
>
> Agent could not recover due to empty docker volume checkpointed files. Please
> see logs:
> {noformat}
> Nov 12 17:12:00 guppy mesos-agent[38960]: E1112 17:12:00.978682 38969
> slave.cpp:6279] EXIT with status 1: Failed to perform recovery: Collect
> failed: Collect failed: Failed to recover docker volumes for orphan container
> e1b04051-1e4a-47a9-b866-1d625cda1d22: JSON parse failed: syntax error at line
> 1 near:
> Nov 12 17:12:00 guppy mesos-agent[38960]: To remedy this do as follows:
> Nov 12 17:12:00 guppy mesos-agent[38960]: Step 1: rm -f
> /var/lib/mesos/slave/meta/slaves/latest
> Nov 12 17:12:00 guppy mesos-agent[38960]: This ensures agent doesn't recover
> old live executors.
> Nov 12 17:12:00 guppy mesos-agent[38960]: Step 2: Restart the agent.
> Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service: main process
> exited, code=exited, status=1/FAILURE
> Nov 12 17:12:00 guppy systemd[1]: Unit dcos-mesos-slave.service entered
> failed state.
> Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service failed.
> {noformat}
> This is caused by agent recovery after the volume state file is created but
> before checkpointing finishes. Basically the docker volume is not mounted
> yet, so the docker volume isolator should skip recovering this volume.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)