[ https://issues.apache.org/jira/browse/MESOS-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15550349#comment-15550349 ]
Jie Yu commented on MESOS-6302: ------------------------------- commit 8bab70c691a3efeda301f72956de4f80b258464e Author: Gilbert Song songzihao1...@gmail.com Date: Mon Oct 3 15:28:39 2016 -0700 Fixed provisioner recovering with nested containers existed. Previously, in provisioner recover, we firstly get all container ids from the provisioner directory, and then find all rootfses from each container's 'backends' directory. We made an assumption that if a 'container_id' directory exists in the provisioner directory, it must contain a 'backends' directory underneath, which contains at least one rootfs for this container. However, this is no longer true since we added support for nested containers. Because we allow the case that a nested container is specified with a container image while its parent does not have an image specified. In this case, when the provisioner recovers, it can still find the parent container's id in the provisioner directory while no 'backends' directory exists, since all nested containers backend information are under its parent container's directory. As a result, we should skip recovering the 'Info' struct in provisioner for the parent container if it never provisions any image. Review: https://reviews.apache.org/r/52480/ > Agent recovery can fail after nested containers are launched > ------------------------------------------------------------ > > Key: MESOS-6302 > URL: https://issues.apache.org/jira/browse/MESOS-6302 > Project: Mesos > Issue Type: Bug > Components: containerization > Reporter: Greg Mann > Assignee: Gilbert Song > Priority: Blocker > Labels: mesosphere > Fix For: 1.1.0 > > Attachments: read_write_app.json > > > After launching a nested container which used a Docker image, I restarted the > agent which ran that task group and saw the following in the agent logs > during recovery: > {code} > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > I1001 01:45:10.813596 4640 status_update_manager.cpp:203] Recovering status > update manager > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > I1001 01:45:10.813622 4640 status_update_manager.cpp:211] Recovering > executor 'instance-testvolume.02c26bce-8778-11e6-9ff3-7a3cd7c1568e' of > framework 118ca38d-daee-4b2d-b584-b5581738a3dd-0000 > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > I1001 01:45:10.814249 4639 docker.cpp:745] Recovering Docker containers > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > I1001 01:45:10.815294 4642 containerizer.cpp:581] Recovering containerizer > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > Failed to perform recovery: Collect failed: Unable to list rootfses belonged > to container a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53: Unable to list the > container directory: Failed to opendir > '/var/lib/mesos/slave/provisioner/containers/a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53/backends': > No such file or directory > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > To remedy this do as follows: > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > This ensures agent doesn't recover old live executors. > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > Step 2: Restart the agent. > {code} > and the agent continues to restart in this fashion. Attached is the Marathon > app definition that I used to launch the task group. -- This message was sent by Atlassian JIRA (v6.3.4#6332)