[ https://issues.apache.org/jira/browse/MESOS-7471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Park updated MESOS-7471: -------------------------------- Fix Version/s: (was: 1.3.1) 1.3.0 > Provisioner recover should not always assume 'rootfses' dir exists. > ------------------------------------------------------------------- > > Key: MESOS-7471 > URL: https://issues.apache.org/jira/browse/MESOS-7471 > Project: Mesos > Issue Type: Bug > Components: provisioner > Reporter: Gilbert Song > Assignee: Gilbert Song > Labels: provisioner > Fix For: 1.1.2, 1.2.1, 1.3.0, 1.4.0 > > > The mesos agent would restart due to many reasons (e.g., disk full). Always > assume the provisioner 'rootfses' dir exists would block the agent to recover. > {noformat} > Failed to perform recovery: Collect failed: Unable to list rootfses belonged > to container a30b74d5-53ac-4fbf-b8f3-5cfba58ea847: Unable to list the backend > directory: Failed to opendir > '/var/lib/mesos/slave/provisioner/containers/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/backends/overlay/rootfses': > No such file or directory > {noformat} > This issue may occur due to the race between removing the provisioner > container dir and the agent restarts: > {noformat} > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.058349 11441 linux_launcher.cpp:429] Launching container > a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 and cloning with namespaces CLONE_NEWNS > | CLONE_NEWPID > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.072191 11441 systemd.cpp:96] Assigned child process '11577' to > 'mesos_executors.slice' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.075932 11439 containerizer.cpp:1592] Checkpointing container's > forked pid 11577 to > '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/6dd898d6-7f3a-406c-8ead-24b4d55ed262-0008/executors/node__fc5e0825-f10e-465c-a2e2-938b9dc3fe05/runs/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/pids/forked.pid' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.081516 11438 linux_launcher.cpp:429] Launching container > 03a57a37-eede-46ec-8420-dda3cc54e2e0 and cloning with namespaces CLONE_NEWNS > | CLONE_NEWPID > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.083516 11438 systemd.cpp:96] Assigned child process '11579' to > 'mesos_executors.slice' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.087345 11444 containerizer.cpp:1592] Checkpointing container's > forked pid 11579 to > '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/36a25adb-4ea2-49d3-a195-448cff1dc146-0002/executors/66897/runs/03a57a37-eede-46ec-8420-dda3cc54e2e0/pids/forked.pid' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: > Failed to write: No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > W0505 02:14:32.213049 11440 fetcher.cpp:896] Begin fetcher log (stderr in > sandbox) for container 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac from running > command: > /opt/mesosphere/packages/mesos--aaedd03eee0d57f5c0d49c74ff1e5721862cad98/libexec/mesos/mesos-fetcher > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.006201 11561 fetcher.cpp:531] Fetcher Info: > {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/36a25adb-4ea2-49d3-a195-448cff1dc146-S34\/root","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"https:\/\/downloads.mesosphere.com\/libmesos-bundle\/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz"}},{"action":"BYPASS_CACHE", > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.009678 11561 fetcher.cpp:442] Fetching URI > 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.009693 11561 fetcher.cpp:283] Fetching directly into the > sandbox directory > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.009711 11561 fetcher.cpp:220] Fetching URI > 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.009723 11561 fetcher.cpp:163] Downloading resource from > 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz' > to > '/var/lib/mesos/slave/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/6dd898d6-7f3a-406c-8ead-24b4d55ed262-0011/executors/hello__91922a16-889e-4e94-9dab-9f6754f091de/ > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > Failed to fetch > 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz': > Error downloading resource: Failed writing received data to disk/application > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > End fetcher log for container 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > E0505 02:14:32.213114 11440 fetcher.cpp:558] Failed to run mesos-fetcher: > Failed to fetch all URIs for container '6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac' > with exit status: 256 > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > E0505 02:14:32.213351 11444 slave.cpp:4642] Container > '6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac' for executor > 'hello__91922a16-889e-4e94-9dab-9f6754f091de' of framework > 6dd898d6-7f3a-406c-8ead-24b4d55ed262-0011 failed to start: Failed to fetch > all URIs for container '6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac' with exit > status: 256 > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.213614 11443 containerizer.cpp:2071] Destroying container > 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac in FETCHING state > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.213977 11443 linux_launcher.cpp:505] Asked to destroy > container 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.214757 11443 linux_launcher.cpp:548] Using freezer to destroy > cgroup mesos/6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: > Failed to write: No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.216047 11444 cgroups.cpp:2692] Freezing cgroup > /sys/fs/cgroup/freezer/mesos/6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.218407 11443 cgroups.cpp:1405] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac after > 2.326016ms > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.220391 11445 cgroups.cpp:2710] Thawing cgroup > /sys/fs/cgroup/freezer/mesos/6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.222124 11445 cgroups.cpp:1434] Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac after > 1.693952ms > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: > Failed to write: No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > E0505 02:14:32.239018 11441 fetcher.cpp:558] Failed to run mesos-fetcher: > Failed to create 'stdout' file: No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > E0505 02:14:32.239162 11442 slave.cpp:4642] Container > 'a30b74d5-53ac-4fbf-b8f3-5cfba58ea847' for executor > 'node__fc5e0825-f10e-465c-a2e2-938b9dc3fe05' of framework > 6dd898d6-7f3a-406c-8ead-24b4d55ed262-0008 failed to start: Failed to create > 'stdout' file: No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.239284 11445 containerizer.cpp:2071] Destroying container > a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 in FETCHING state > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.239390 11444 linux_launcher.cpp:505] Asked to destroy > container a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.240103 11444 linux_launcher.cpp:548] Using freezer to destroy > cgroup mesos/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.241353 11440 cgroups.cpp:2692] Freezing cgroup > /sys/fs/cgroup/freezer/mesos/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.243120 11444 cgroups.cpp:1405] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 after > 1.726976ms > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: > Failed to write: No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.245045 11440 cgroups.cpp:2710] Thawing cgroup > /sys/fs/cgroup/freezer/mesos/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.246800 11440 cgroups.cpp:1434] Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 after > 1.715968ms > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: > Failed to write: No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: > Failed to write: No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: > Failed to write: No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: > Failed to write: No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: > Failed to write: No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: > Failed to write: No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: > Failed to write: No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.285477 11438 slave.cpp:1625] Got assigned task > 'dse-1-agent__720d6f09-9d60-4667-b224-abcd495e0e58' for framework > 6dd898d6-7f3a-406c-8ead-24b4d55ed262-0009 > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: > Failed to write: No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > F0505 02:14:32.296481 11438 slave.cpp:6381] > CHECK_SOME(state::checkpoint(path, info)): Failed to create temporary file: > No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > *** Check failure stack trace: *** > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > @ 0x7f5856be857d google::LogMessage::Fail() > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: > Failed to write: No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > @ 0x7f5856bea3ad google::LogMessage::SendToLog() > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > @ 0x7f5856be816c google::LogMessage::Flush() > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > @ 0x7f5856beaca9 google::LogMessageFatal::~LogMessageFatal() > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: > Failed to write: No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > @ 0x7f5855e4b5e9 _CheckFatal::~_CheckFatal() > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.314082 11445 containerizer.cpp:2434] Container > 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac has exited > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.314826 11440 containerizer.cpp:2434] Container > a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 has exited > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: > Failed to write: No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.316660 11439 container_assigner.cpp:101] Unregistering > container_id[value: "6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac"]. > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.316761 11474 container_assigner_strategy.cpp:202] Closing > ephemeral-port reader for container[value: > "6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac"] at endpoint[198.51.100.1:34273]. > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.316804 11474 container_reader_impl.cpp:38] Triggering > ContainerReader shutdown > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.316833 11474 sync_util.hpp:39] Dispatching and waiting <=5s > for ticket 7: ~ContainerReaderImpl:shutdown > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.316769 11439 container_assigner.cpp:101] Unregistering > container_id[value: "a30b74d5-53ac-4fbf-b8f3-5cfba58ea847"]. > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.316864 11474 container_reader > {noformat} > In provisioner recover, when listing the container rootfses, it is possible > that the 'rootfses' dir does not exist. Because a possible race between the > provisioner destroy and the agent restart. For instance, while the > provisioner is destroying the container dir the agent restarts. Due to > os::rmdir() is recursive by traversing the FTS tree, it is possible that > 'rootfses' dir is removed but the others (e.g., scratch dir) are not. > Currently, we are returning an error if the 'rootfses' dir does not exist, > which blocks the agent from recovery. We should skip it if 'rootfses' does > not exist. -- This message was sent by Atlassian JIRA (v6.3.15#6346)