[jira] [Updated] (MESOS-7471) Provisioner recover should not always assume 'rootfses' dir exists.
[ https://issues.apache.org/jira/browse/MESOS-7471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-7471: -- Priority: Critical (was: Major) > Provisioner recover should not always assume 'rootfses' dir exists. > --- > > Key: MESOS-7471 > URL: https://issues.apache.org/jira/browse/MESOS-7471 > Project: Mesos > Issue Type: Bug > Components: provisioner >Reporter: Gilbert Song >Assignee: Gilbert Song >Priority: Critical > Labels: provisioner > Fix For: 1.1.2, 1.2.1, 1.3.0, 1.4.0 > > > The mesos agent would restart due to many reasons (e.g., disk full). Always > assume the provisioner 'rootfses' dir exists would block the agent to recover. > {noformat} > Failed to perform recovery: Collect failed: Unable to list rootfses belonged > to container a30b74d5-53ac-4fbf-b8f3-5cfba58ea847: Unable to list the backend > directory: Failed to opendir > '/var/lib/mesos/slave/provisioner/containers/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/backends/overlay/rootfses': > No such file or directory > {noformat} > This issue may occur due to the race between removing the provisioner > container dir and the agent restarts: > {noformat} > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.058349 11441 linux_launcher.cpp:429] Launching container > a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 and cloning with namespaces CLONE_NEWNS > | CLONE_NEWPID > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.072191 11441 systemd.cpp:96] Assigned child process '11577' to > 'mesos_executors.slice' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.075932 11439 containerizer.cpp:1592] Checkpointing container's > forked pid 11577 to > '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/6dd898d6-7f3a-406c-8ead-24b4d55ed262-0008/executors/node__fc5e0825-f10e-465c-a2e2-938b9dc3fe05/runs/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/pids/forked.pid' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.081516 11438 linux_launcher.cpp:429] Launching container > 03a57a37-eede-46ec-8420-dda3cc54e2e0 and cloning with namespaces CLONE_NEWNS > | CLONE_NEWPID > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.083516 11438 systemd.cpp:96] Assigned child process '11579' to > 'mesos_executors.slice' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.087345 11444 containerizer.cpp:1592] Checkpointing container's > forked pid 11579 to > '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/36a25adb-4ea2-49d3-a195-448cff1dc146-0002/executors/66897/runs/03a57a37-eede-46ec-8420-dda3cc54e2e0/pids/forked.pid' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: > Failed to write: No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > W0505 02:14:32.213049 11440 fetcher.cpp:896] Begin fetcher log (stderr in > sandbox) for container 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac from running > command: > /opt/mesosphere/packages/mesos--aaedd03eee0d57f5c0d49c74ff1e5721862cad98/libexec/mesos/mesos-fetcher > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.006201 11561 fetcher.cpp:531] Fetcher Info: > {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/36a25adb-4ea2-49d3-a195-448cff1dc146-S34\/root","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"https:\/\/downloads.mesosphere.com\/libmesos-bundle\/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz"}},{"action":"BYPASS_CACHE", > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.009678 11561 fetcher.cpp:442] Fetching URI > 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.009693 11561 fetcher.cpp:283] Fetching directly into the > sandbox directory > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.009711 11561 fetcher.cpp:220] Fetching URI > 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.009723 11561 fetcher.cpp:163] Downloading resource from > 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz' > to >
[jira] [Updated] (MESOS-7471) Provisioner recover should not always assume 'rootfses' dir exists.
[ https://issues.apache.org/jira/browse/MESOS-7471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Park updated MESOS-7471: Fix Version/s: (was: 1.3.1) 1.3.0 > Provisioner recover should not always assume 'rootfses' dir exists. > --- > > Key: MESOS-7471 > URL: https://issues.apache.org/jira/browse/MESOS-7471 > Project: Mesos > Issue Type: Bug > Components: provisioner >Reporter: Gilbert Song >Assignee: Gilbert Song > Labels: provisioner > Fix For: 1.1.2, 1.2.1, 1.3.0, 1.4.0 > > > The mesos agent would restart due to many reasons (e.g., disk full). Always > assume the provisioner 'rootfses' dir exists would block the agent to recover. > {noformat} > Failed to perform recovery: Collect failed: Unable to list rootfses belonged > to container a30b74d5-53ac-4fbf-b8f3-5cfba58ea847: Unable to list the backend > directory: Failed to opendir > '/var/lib/mesos/slave/provisioner/containers/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/backends/overlay/rootfses': > No such file or directory > {noformat} > This issue may occur due to the race between removing the provisioner > container dir and the agent restarts: > {noformat} > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.058349 11441 linux_launcher.cpp:429] Launching container > a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 and cloning with namespaces CLONE_NEWNS > | CLONE_NEWPID > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.072191 11441 systemd.cpp:96] Assigned child process '11577' to > 'mesos_executors.slice' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.075932 11439 containerizer.cpp:1592] Checkpointing container's > forked pid 11577 to > '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/6dd898d6-7f3a-406c-8ead-24b4d55ed262-0008/executors/node__fc5e0825-f10e-465c-a2e2-938b9dc3fe05/runs/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/pids/forked.pid' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.081516 11438 linux_launcher.cpp:429] Launching container > 03a57a37-eede-46ec-8420-dda3cc54e2e0 and cloning with namespaces CLONE_NEWNS > | CLONE_NEWPID > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.083516 11438 systemd.cpp:96] Assigned child process '11579' to > 'mesos_executors.slice' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.087345 11444 containerizer.cpp:1592] Checkpointing container's > forked pid 11579 to > '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/36a25adb-4ea2-49d3-a195-448cff1dc146-0002/executors/66897/runs/03a57a37-eede-46ec-8420-dda3cc54e2e0/pids/forked.pid' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: > Failed to write: No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > W0505 02:14:32.213049 11440 fetcher.cpp:896] Begin fetcher log (stderr in > sandbox) for container 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac from running > command: > /opt/mesosphere/packages/mesos--aaedd03eee0d57f5c0d49c74ff1e5721862cad98/libexec/mesos/mesos-fetcher > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.006201 11561 fetcher.cpp:531] Fetcher Info: > {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/36a25adb-4ea2-49d3-a195-448cff1dc146-S34\/root","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"https:\/\/downloads.mesosphere.com\/libmesos-bundle\/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz"}},{"action":"BYPASS_CACHE", > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.009678 11561 fetcher.cpp:442] Fetching URI > 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.009693 11561 fetcher.cpp:283] Fetching directly into the > sandbox directory > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.009711 11561 fetcher.cpp:220] Fetching URI > 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.009723 11561 fetcher.cpp:163] Downloading resource from > 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz' > to >
[jira] [Updated] (MESOS-7471) Provisioner recover should not always assume 'rootfses' dir exists.
[ https://issues.apache.org/jira/browse/MESOS-7471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Park updated MESOS-7471: Target Version/s: 1.1.2, 1.2.1, 1.3.0, 1.4.0 (was: 1.1.2, 1.2.1, 1.3.1, 1.4.0) > Provisioner recover should not always assume 'rootfses' dir exists. > --- > > Key: MESOS-7471 > URL: https://issues.apache.org/jira/browse/MESOS-7471 > Project: Mesos > Issue Type: Bug > Components: provisioner >Reporter: Gilbert Song >Assignee: Gilbert Song > Labels: provisioner > Fix For: 1.1.2, 1.2.1, 1.3.0, 1.4.0 > > > The mesos agent would restart due to many reasons (e.g., disk full). Always > assume the provisioner 'rootfses' dir exists would block the agent to recover. > {noformat} > Failed to perform recovery: Collect failed: Unable to list rootfses belonged > to container a30b74d5-53ac-4fbf-b8f3-5cfba58ea847: Unable to list the backend > directory: Failed to opendir > '/var/lib/mesos/slave/provisioner/containers/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/backends/overlay/rootfses': > No such file or directory > {noformat} > This issue may occur due to the race between removing the provisioner > container dir and the agent restarts: > {noformat} > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.058349 11441 linux_launcher.cpp:429] Launching container > a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 and cloning with namespaces CLONE_NEWNS > | CLONE_NEWPID > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.072191 11441 systemd.cpp:96] Assigned child process '11577' to > 'mesos_executors.slice' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.075932 11439 containerizer.cpp:1592] Checkpointing container's > forked pid 11577 to > '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/6dd898d6-7f3a-406c-8ead-24b4d55ed262-0008/executors/node__fc5e0825-f10e-465c-a2e2-938b9dc3fe05/runs/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/pids/forked.pid' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.081516 11438 linux_launcher.cpp:429] Launching container > 03a57a37-eede-46ec-8420-dda3cc54e2e0 and cloning with namespaces CLONE_NEWNS > | CLONE_NEWPID > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.083516 11438 systemd.cpp:96] Assigned child process '11579' to > 'mesos_executors.slice' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.087345 11444 containerizer.cpp:1592] Checkpointing container's > forked pid 11579 to > '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/36a25adb-4ea2-49d3-a195-448cff1dc146-0002/executors/66897/runs/03a57a37-eede-46ec-8420-dda3cc54e2e0/pids/forked.pid' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: > Failed to write: No space left on device > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > W0505 02:14:32.213049 11440 fetcher.cpp:896] Begin fetcher log (stderr in > sandbox) for container 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac from running > command: > /opt/mesosphere/packages/mesos--aaedd03eee0d57f5c0d49c74ff1e5721862cad98/libexec/mesos/mesos-fetcher > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.006201 11561 fetcher.cpp:531] Fetcher Info: > {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/36a25adb-4ea2-49d3-a195-448cff1dc146-S34\/root","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"https:\/\/downloads.mesosphere.com\/libmesos-bundle\/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz"}},{"action":"BYPASS_CACHE", > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.009678 11561 fetcher.cpp:442] Fetching URI > 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.009693 11561 fetcher.cpp:283] Fetching directly into the > sandbox directory > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.009711 11561 fetcher.cpp:220] Fetching URI > 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz' > May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: > I0505 02:14:32.009723 11561 fetcher.cpp:163] Downloading resource from > 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz' > to >