[ https://issues.apache.org/jira/browse/MESOS-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrei Budnik reassigned MESOS-9116: ------------------------------------ Assignee: Andrei Budnik > Flaky processes within POD cause failures of launch nested container due to a > bug in `mnt` namespace detection. > --------------------------------------------------------------------------------------------------------------- > > Key: MESOS-9116 > URL: https://issues.apache.org/jira/browse/MESOS-9116 > Project: Mesos > Issue Type: Task > Components: agent, containerization > Reporter: Andrei Budnik > Assignee: Andrei Budnik > Priority: Major > Labels: mesosphere > > Launch nested container call might fail with the following error: > {code:java} > Failed to enter mount namespace: Failed to open '/proc/29473/ns/mnt': No such > file or directory > {code} > This happens when the containerizer launcher [tries to > enter|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/launch.cpp#L879-L892] > `mnt` namespace using the pid of a terminated process. The pid [was > detected|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/containerizer.cpp#L1930-L1958] > by the agent before spawning the containerizer launcher process because the > process was still running. > The issue can be reproduced using the following test (pseudocode): > {code:java} > launchTask("sleep 1000") > parentContainerId = containerizer.containers().begin() > outputs = [] > for i in range(10): > ContainerId containerId > containerId.parent = parentContainerId > containerId.id = UUID.random() > LAUNCH_NESTED_CONTAINER_SESSION(containerId, "echo echo") > response = ATTACH_CONTAINER_OUTPUT(containerId) > outputs.append(response.reader) > for output in outputs: > stdout, stderr = getProcessIOData(output) > assert("echo" == stdout + stderr){code} > The given test is stably reproduces the issue after adding `::usleep(500 * > 1000);` before returning the result in > [`getMountNamespaceTarget()`|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/utils.cpp#L89]. > The issue can be reproduced using the test above because > `getMountNamespaceTarget()` detects not yet terminated processes from the > previous `LAUNCH_NESTED_CONTAINER` calls while attempting to launch a new > nested container session. > Another possible cause might be related to processes stuck in D-state within > a POD. > The flaky process within the POD can end up in D-state (unkillable). This > process can be a child of a nested container process, so the agent doesn't > know about it. Killing the nested container process doesn't mean that all its > child processes, which stuck in D-state, are killed. > For example, the nested container can be a custom executor which spawns its > own tasks which might end up in D-state. Another example is when the > containerizer launcher > [spawns|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/launch.cpp#L755-L768] > shell scripts like `mount` during its initialization. > In that case, the agent might detect a process which should be already > terminated, but it still exists because it's in D-state. At the moment when > the containerizer launcher is spawned, the stuck process might disappear, so > we get the error. -- This message was sent by Atlassian JIRA (v7.6.3#76005)