[ https://issues.apache.org/jira/browse/MESOS-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16563794#comment-16563794 ]
Andrei Budnik commented on MESOS-9116: -------------------------------------- (Note to myself) The following test fails if `[getMountNamespaceTarget()`|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/utils.cpp#L59] is modified in such a way that it always returns a parent pid: {code:java} NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace {code} > Flaky processes within POD cause failures of launch nested container due to a > bug in `mnt` namespace detection. > --------------------------------------------------------------------------------------------------------------- > > Key: MESOS-9116 > URL: https://issues.apache.org/jira/browse/MESOS-9116 > Project: Mesos > Issue Type: Task > Components: agent, containerization > Reporter: Andrei Budnik > Priority: Major > Labels: mesosphere > > Launch nested container call might fail with the following error: > {code:java} > Failed to enter mount namespace: Failed to open '/proc/29473/ns/mnt': No such > file or directory > {code} > This happens when the containerizer launcher [tries to > enter|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/launch.cpp#L879-L892] > `mnt` namespace using the pid of a terminated process. The pid [was > detected|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/containerizer.cpp#L1930-L1958] > by the agent before spawning the containerizer launcher process because the > process was still running. > The flaky process within the POD can end up in D-state (unkillable). This > process can be a child of a nested container process, so the agent doesn't > know about it. Killing the nested container process doesn't mean that all its > child processes, which stuck in D-state, are killed. > For example, the nested container can be a custom executor which spawns its > own tasks which might end up in D-state. Another example is when the > containerizer launcher > [spawns|https://github.com/apache/mesos/blob/077f122d52671412a2ab5d992d535712cc154002/src/slave/containerizer/mesos/launch.cpp#L755-L768] > shell scripts like `mount` during its initialization. > In that case, the agent might detect a process which should be already > terminated, but it still exists because it's in D-state. At the moment when > the containerizer launcher is spawned, the stuck process might disappear, so > we get the error. -- This message was sent by Atlassian JIRA (v7.6.3#76005)