[jira] [Updated] (MESOS-7471) Provisioner recover should not always assume 'rootfses' dir exists.

2017-08-17 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7471:
--
Priority: Critical  (was: Major)

> Provisioner recover should not always assume 'rootfses' dir exists.
> ---
>
> Key: MESOS-7471
> URL: https://issues.apache.org/jira/browse/MESOS-7471
> Project: Mesos
>  Issue Type: Bug
>  Components: provisioner
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Critical
>  Labels: provisioner
> Fix For: 1.1.2, 1.2.1, 1.3.0, 1.4.0
>
>
> The mesos agent would restart due to many reasons (e.g., disk full). Always 
> assume the provisioner 'rootfses' dir exists would block the agent to recover.
> {noformat}
> Failed to perform recovery: Collect failed: Unable to list rootfses belonged 
> to container a30b74d5-53ac-4fbf-b8f3-5cfba58ea847: Unable to list the backend 
> directory: Failed to opendir 
> '/var/lib/mesos/slave/provisioner/containers/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/backends/overlay/rootfses':
>  No such file or directory
> {noformat}
> This issue may occur due to the race between removing the provisioner 
> container dir and the agent restarts:
> {noformat}
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.058349 11441 linux_launcher.cpp:429] Launching container 
> a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.072191 11441 systemd.cpp:96] Assigned child process '11577' to 
> 'mesos_executors.slice'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.075932 11439 containerizer.cpp:1592] Checkpointing container's 
> forked pid 11577 to 
> '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/6dd898d6-7f3a-406c-8ead-24b4d55ed262-0008/executors/node__fc5e0825-f10e-465c-a2e2-938b9dc3fe05/runs/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/pids/forked.pid'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.081516 11438 linux_launcher.cpp:429] Launching container 
> 03a57a37-eede-46ec-8420-dda3cc54e2e0 and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.083516 11438 systemd.cpp:96] Assigned child process '11579' to 
> 'mesos_executors.slice'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.087345 11444 containerizer.cpp:1592] Checkpointing container's 
> forked pid 11579 to 
> '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/36a25adb-4ea2-49d3-a195-448cff1dc146-0002/executors/66897/runs/03a57a37-eede-46ec-8420-dda3cc54e2e0/pids/forked.pid'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: 
> Failed to write: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> W0505 02:14:32.213049 11440 fetcher.cpp:896] Begin fetcher log (stderr in 
> sandbox) for container 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac from running 
> command: 
> /opt/mesosphere/packages/mesos--aaedd03eee0d57f5c0d49c74ff1e5721862cad98/libexec/mesos/mesos-fetcher
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.006201 11561 fetcher.cpp:531] Fetcher Info: 
> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/36a25adb-4ea2-49d3-a195-448cff1dc146-S34\/root","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"https:\/\/downloads.mesosphere.com\/libmesos-bundle\/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz"}},{"action":"BYPASS_CACHE",
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009678 11561 fetcher.cpp:442] Fetching URI 
> 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009693 11561 fetcher.cpp:283] Fetching directly into the 
> sandbox directory
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009711 11561 fetcher.cpp:220] Fetching URI 
> 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009723 11561 fetcher.cpp:163] Downloading resource from 
> 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz'
>  to 
> 

[jira] [Updated] (MESOS-7471) Provisioner recover should not always assume 'rootfses' dir exists.

2017-05-12 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7471:

Fix Version/s: (was: 1.3.1)
   1.3.0

> Provisioner recover should not always assume 'rootfses' dir exists.
> ---
>
> Key: MESOS-7471
> URL: https://issues.apache.org/jira/browse/MESOS-7471
> Project: Mesos
>  Issue Type: Bug
>  Components: provisioner
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>  Labels: provisioner
> Fix For: 1.1.2, 1.2.1, 1.3.0, 1.4.0
>
>
> The mesos agent would restart due to many reasons (e.g., disk full). Always 
> assume the provisioner 'rootfses' dir exists would block the agent to recover.
> {noformat}
> Failed to perform recovery: Collect failed: Unable to list rootfses belonged 
> to container a30b74d5-53ac-4fbf-b8f3-5cfba58ea847: Unable to list the backend 
> directory: Failed to opendir 
> '/var/lib/mesos/slave/provisioner/containers/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/backends/overlay/rootfses':
>  No such file or directory
> {noformat}
> This issue may occur due to the race between removing the provisioner 
> container dir and the agent restarts:
> {noformat}
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.058349 11441 linux_launcher.cpp:429] Launching container 
> a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.072191 11441 systemd.cpp:96] Assigned child process '11577' to 
> 'mesos_executors.slice'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.075932 11439 containerizer.cpp:1592] Checkpointing container's 
> forked pid 11577 to 
> '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/6dd898d6-7f3a-406c-8ead-24b4d55ed262-0008/executors/node__fc5e0825-f10e-465c-a2e2-938b9dc3fe05/runs/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/pids/forked.pid'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.081516 11438 linux_launcher.cpp:429] Launching container 
> 03a57a37-eede-46ec-8420-dda3cc54e2e0 and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.083516 11438 systemd.cpp:96] Assigned child process '11579' to 
> 'mesos_executors.slice'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.087345 11444 containerizer.cpp:1592] Checkpointing container's 
> forked pid 11579 to 
> '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/36a25adb-4ea2-49d3-a195-448cff1dc146-0002/executors/66897/runs/03a57a37-eede-46ec-8420-dda3cc54e2e0/pids/forked.pid'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: 
> Failed to write: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> W0505 02:14:32.213049 11440 fetcher.cpp:896] Begin fetcher log (stderr in 
> sandbox) for container 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac from running 
> command: 
> /opt/mesosphere/packages/mesos--aaedd03eee0d57f5c0d49c74ff1e5721862cad98/libexec/mesos/mesos-fetcher
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.006201 11561 fetcher.cpp:531] Fetcher Info: 
> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/36a25adb-4ea2-49d3-a195-448cff1dc146-S34\/root","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"https:\/\/downloads.mesosphere.com\/libmesos-bundle\/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz"}},{"action":"BYPASS_CACHE",
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009678 11561 fetcher.cpp:442] Fetching URI 
> 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009693 11561 fetcher.cpp:283] Fetching directly into the 
> sandbox directory
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009711 11561 fetcher.cpp:220] Fetching URI 
> 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009723 11561 fetcher.cpp:163] Downloading resource from 
> 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz'
>  to 
> 

[jira] [Updated] (MESOS-7471) Provisioner recover should not always assume 'rootfses' dir exists.

2017-05-12 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7471:

Target Version/s: 1.1.2, 1.2.1, 1.3.0, 1.4.0  (was: 1.1.2, 1.2.1, 1.3.1, 
1.4.0)

> Provisioner recover should not always assume 'rootfses' dir exists.
> ---
>
> Key: MESOS-7471
> URL: https://issues.apache.org/jira/browse/MESOS-7471
> Project: Mesos
>  Issue Type: Bug
>  Components: provisioner
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>  Labels: provisioner
> Fix For: 1.1.2, 1.2.1, 1.3.0, 1.4.0
>
>
> The mesos agent would restart due to many reasons (e.g., disk full). Always 
> assume the provisioner 'rootfses' dir exists would block the agent to recover.
> {noformat}
> Failed to perform recovery: Collect failed: Unable to list rootfses belonged 
> to container a30b74d5-53ac-4fbf-b8f3-5cfba58ea847: Unable to list the backend 
> directory: Failed to opendir 
> '/var/lib/mesos/slave/provisioner/containers/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/backends/overlay/rootfses':
>  No such file or directory
> {noformat}
> This issue may occur due to the race between removing the provisioner 
> container dir and the agent restarts:
> {noformat}
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.058349 11441 linux_launcher.cpp:429] Launching container 
> a30b74d5-53ac-4fbf-b8f3-5cfba58ea847 and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.072191 11441 systemd.cpp:96] Assigned child process '11577' to 
> 'mesos_executors.slice'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.075932 11439 containerizer.cpp:1592] Checkpointing container's 
> forked pid 11577 to 
> '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/6dd898d6-7f3a-406c-8ead-24b4d55ed262-0008/executors/node__fc5e0825-f10e-465c-a2e2-938b9dc3fe05/runs/a30b74d5-53ac-4fbf-b8f3-5cfba58ea847/pids/forked.pid'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.081516 11438 linux_launcher.cpp:429] Launching container 
> 03a57a37-eede-46ec-8420-dda3cc54e2e0 and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.083516 11438 systemd.cpp:96] Assigned child process '11579' to 
> 'mesos_executors.slice'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.087345 11444 containerizer.cpp:1592] Checkpointing container's 
> forked pid 11579 to 
> '/var/lib/mesos/slave/meta/slaves/36a25adb-4ea2-49d3-a195-448cff1dc146-S34/frameworks/36a25adb-4ea2-49d3-a195-448cff1dc146-0002/executors/66897/runs/03a57a37-eede-46ec-8420-dda3cc54e2e0/pids/forked.pid'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[17142]: 
> Failed to write: No space left on device
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> W0505 02:14:32.213049 11440 fetcher.cpp:896] Begin fetcher log (stderr in 
> sandbox) for container 6aebb9e0-fd2c-4a42-b8f4-bd6ba11e9eac from running 
> command: 
> /opt/mesosphere/packages/mesos--aaedd03eee0d57f5c0d49c74ff1e5721862cad98/libexec/mesos/mesos-fetcher
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.006201 11561 fetcher.cpp:531] Fetcher Info: 
> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/36a25adb-4ea2-49d3-a195-448cff1dc146-S34\/root","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"https:\/\/downloads.mesosphere.com\/libmesos-bundle\/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz"}},{"action":"BYPASS_CACHE",
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009678 11561 fetcher.cpp:442] Fetching URI 
> 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009693 11561 fetcher.cpp:283] Fetching directly into the 
> sandbox directory
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009711 11561 fetcher.cpp:220] Fetching URI 
> 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz'
> May 05 02:14:32 ip-172-31-7-83.us-west-2.compute.internal mesos-agent[11432]: 
> I0505 02:14:32.009723 11561 fetcher.cpp:163] Downloading resource from 
> 'https://downloads.mesosphere.com/libmesos-bundle/libmesos-bundle-1.9.0-rc2-1.2.0-rc2-1.tar.gz'
>  to 
>