[ 
https://issues.apache.org/jira/browse/MESOS-7471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7471:
--------------------------------
    Fix Version/s:     (was: 1.3.1)
                   1.3.0

> Provisioner recover should not always assume 'rootfses' dir exists.
> -------------------------------------------------------------------
>
>                 Key: MESOS-7471
>                 URL: https://issues.apache.org/jira/browse/MESOS-7471
>             Project: Mesos
>          Issue Type: Bug
>          Components: provisioner
>            Reporter: Gilbert Song
>            Assignee: Gilbert Song
>              Labels: provisioner
>             Fix For: 1.1.2, 1.2.1, 1.3.0, 1.4.0
>
>
> The mesos agent would restart due to many reasons (e.g., disk full). Always 
> assume the provisioner 'rootfses' dir exists would block the agent to recover.
> {noformat}
> Failed to perform recovery: Collect failed: Unable to list rootfses belonged 
> to container a30b74d5-53ac-4fbf-b8f3-5cfba58ea847: Unable to list the backend 
> directory: Failed to opendir 
> '/var/lib/mesos/slave/provisioner/containers/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/backends/overlay/rootfses':
>  No such file or directory
> {noformat}
> This issue may occur due to the race between removing the provisioner 
> container dir and the agent restarts:
> {noformat}
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.058349 11441 linux_launcher.cpp:429] Launching container 
> a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.072191 11441 systemd.cpp:96] Assigned child process '11577' to 
> 'mesos_executors.slice'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.075932 11439 containerizer.cpp:1592] Checkpointing container's 
> forked pid 11577 to 
> '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/6dd898d6-7f3a-406c-8ead-24b4d55ed262-0008/executors/node__fc5e0825-f10e-465c-a2e2-938b9dc3fe05/runs/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/pids/forked.pid'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.081516 11438 linux_launcher.cpp:429] Launching container 
> 03a57a37-eede-46ec-8420-dda3cc54e2e0 and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.083516 11438 systemd.cpp:96] Assigned child process '11579' to 
> 'mesos_executors.slice'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.087345 11444 containerizer.cpp:1592] Checkpointing container's 
> forked pid 11579 to 
> '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/36a25adb-4ea2-49d3-a195-448cff1dc146-0002/executors/66897/runs/03a57a37-eede-46ec-8420-dda3cc54e2e0/pids/forked.pid'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: 
> Failed to write: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> W0505 02:14:32.213049 11440 fetcher.cpp:896] Begin fetcher log (stderr in 
> sandbox) for container 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac from running 
> command: 
> /opt/mesosphere/packages/mesos--aaedd03eee0d57f5c0d49c74ff1e5721862cad98/libexec/mesos/mesos-fetcher
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.006201 11561 fetcher.cpp:531] Fetcher Info: 
> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/36a25adb-4ea2-49d3-a195-448cff1dc146-S34\/root","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"https:\/\/downloads.mesosphere.com\/libmesos-bundle\/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz"}},{"action":"BYPASS_CACHE",
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009678 11561 fetcher.cpp:442] Fetching URI 
> 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009693 11561 fetcher.cpp:283] Fetching directly into the 
> sandbox directory
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009711 11561 fetcher.cpp:220] Fetching URI 
> 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009723 11561 fetcher.cpp:163] Downloading resource from 
> 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz'
>  to 
> '/var/lib/mesos/slave/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/6dd898d6-7f3a-406c-8ead-24b4d55ed262-0011/executors/hello__91922a16-889e-4e94-9dab-9f6754f091de/
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> Failed to fetch 
> 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz':
>  Error downloading resource: Failed writing received data to disk/application
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> End fetcher log for container 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> E0505 02:14:32.213114 11440 fetcher.cpp:558] Failed to run mesos-fetcher: 
> Failed to fetch all URIs for container '6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac' 
> with exit status: 256
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> E0505 02:14:32.213351 11444 slave.cpp:4642] Container 
> '6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac' for executor 
> 'hello__91922a16-889e-4e94-9dab-9f6754f091de' of framework 
> 6dd898d6-7f3a-406c-8ead-24b4d55ed262-0011 failed to start: Failed to fetch 
> all URIs for container '6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac' with exit 
> status: 256
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.213614 11443 containerizer.cpp:2071] Destroying container 
> 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac in FETCHING state
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.213977 11443 linux_launcher.cpp:505] Asked to destroy 
> container 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.214757 11443 linux_launcher.cpp:548] Using freezer to destroy 
> cgroup mesos/6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: 
> Failed to write: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.216047 11444 cgroups.cpp:2692] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.218407 11443 cgroups.cpp:1405] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac after 
> 2.326016ms
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.220391 11445 cgroups.cpp:2710] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.222124 11445 cgroups.cpp:1434] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac after 
> 1.693952ms
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: 
> Failed to write: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> E0505 02:14:32.239018 11441 fetcher.cpp:558] Failed to run mesos-fetcher: 
> Failed to create 'stdout' file: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> E0505 02:14:32.239162 11442 slave.cpp:4642] Container 
> 'a30b74d5-53ac-4fbf-b8f3-5cfba58ea847' for executor 
> 'node__fc5e0825-f10e-465c-a2e2-938b9dc3fe05' of framework 
> 6dd898d6-7f3a-406c-8ead-24b4d55ed262-0008 failed to start: Failed to create 
> 'stdout' file: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.239284 11445 containerizer.cpp:2071] Destroying container 
> a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 in FETCHING state
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.239390 11444 linux_launcher.cpp:505] Asked to destroy 
> container a30b74d5-53ac-4fbf-b8f3-5cfba58ea847
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.240103 11444 linux_launcher.cpp:548] Using freezer to destroy 
> cgroup mesos/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.241353 11440 cgroups.cpp:2692] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.243120 11444 cgroups.cpp:1405] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 after 
> 1.726976ms
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: 
> Failed to write: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.245045 11440 cgroups.cpp:2710] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.246800 11440 cgroups.cpp:1434] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 after 
> 1.715968ms
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: 
> Failed to write: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: 
> Failed to write: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: 
> Failed to write: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: 
> Failed to write: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: 
> Failed to write: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: 
> Failed to write: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: 
> Failed to write: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.285477 11438 slave.cpp:1625] Got assigned task 
> 'dse-1-agent__720d6f09-9d60-4667-b224-abcd495e0e58' for framework 
> 6dd898d6-7f3a-406c-8ead-24b4d55ed262-0009
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: 
> Failed to write: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> F0505 02:14:32.296481 11438 slave.cpp:6381] 
> CHECK_SOME(state::checkpoint(path, info)): Failed to create temporary file: 
> No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> *** Check failure stack trace: ***
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> @     0x7f5856be857d  google::LogMessage::Fail()
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: 
> Failed to write: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> @     0x7f5856bea3ad  google::LogMessage::SendToLog()
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> @     0x7f5856be816c  google::LogMessage::Flush()
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> @     0x7f5856beaca9  google::LogMessageFatal::~LogMessageFatal()
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: 
> Failed to write: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> @     0x7f5855e4b5e9  _CheckFatal::~_CheckFatal()
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.314082 11445 containerizer.cpp:2434] Container 
> 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac has exited
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.314826 11440 containerizer.cpp:2434] Container 
> a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 has exited
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: 
> Failed to write: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.316660 11439 container_assigner.cpp:101] Unregistering 
> container_id[value: "6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac"].
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.316761 11474 container_assigner_strategy.cpp:202] Closing 
> ephemeral-port reader for container[value: 
> "6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac"] at endpoint[198.51.100.1:34273].
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.316804 11474 container_reader_impl.cpp:38] Triggering 
> ContainerReader shutdown
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.316833 11474 sync_util.hpp:39] Dispatching and waiting <=5s 
> for ticket 7: ~ContainerReaderImpl:shutdown
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.316769 11439 container_assigner.cpp:101] Unregistering 
> container_id[value: "a30b74d5-53ac-4fbf-b8f3-5cfba58ea847"].
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.316864 11474 container_reader
> {noformat}
> In provisioner recover, when listing the container rootfses, it is possible 
> that the 'rootfses' dir does not exist. Because a possible race between the 
> provisioner destroy and the agent restart. For instance, while the 
> provisioner is destroying the container dir the agent restarts. Due to 
> os::rmdir() is recursive by traversing the FTS tree, it is possible that 
> 'rootfses' dir is removed but the others (e.g., scratch dir) are not.
> Currently, we are returning an error if the 'rootfses' dir does not exist, 
> which blocks the agent from recovery. We should skip it if 'rootfses' does 
> not exist.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to