[ https://issues.apache.org/jira/browse/MESOS-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282757#comment-16282757 ]
James Peach commented on MESOS-8313: ------------------------------------ {quote} The other draw back is that we created another nanny process in addition to the one that'll perform pid 1 reaping. {quote} Right. Currently, the supervisor is optional and inside the container. In this proposal, there would always be a supervisor outside the container, though I think that the one inside the container would remain optional. > Provide a host namespace container supervisor. > ---------------------------------------------- > > Key: MESOS-8313 > URL: https://issues.apache.org/jira/browse/MESOS-8313 > Project: Mesos > Issue Type: Improvement > Components: containerization > Reporter: James Peach > Assignee: James Peach > Attachments: IMG_2629.JPG > > > After more investigation on user namespaces, the current implementation of > creating the container namespaces needs some adjustment before we can > implement user namespaces in a useable fashion. > The problems we need to address are: > 1. The containerizer needs to hold {{CAP_SYS_ADMIN}} over the PID namespace > to mount {{procfs}}. Currently, this prevents containers joining the host PID > namespace. The workaround is to always create a new container PID namespace > (as a child of the user namespace) with the {{namespaces/pid}} isolator. > 2. The containerized needs to hold {{CAP_SYS_ADMIN}} over the network > namespace to mount {{sysfs}}. There's no general workaround for this since we > can't generally require containers to not join the host network namespace. > 3. The containerizer can't enter a user namespace after entering the > {{chroot}}. This restriction makes the existing order of containerizer > operations impossible to remain in the case where we want the executor to be > in a new user namespace that has no children (i.e. to protect the container > from a privileged task). > After some discussion with [~jieyu], we believe that we can some most or all > of these issues by creating a new containerized supervisor that runs fully > outside the container and is responsible for constructing the roots mount > namespace, launching the containerized to enter the rest of the container, > and waiting on the entered process. > Since this new supervisor process is not running in the user namespace, it > will be able to construct the container rootfs in a new mount namespace > without user namespace restrictions. We can then clone a child to fully > create and enter container namespaces along with the prefabricated rootfs > mount namespace. > The only drawback to this approach is that the container's mount namespace > will be owned by the root user namespace rather than the container user > namespace. We are OK with this for now. > The plan here is to retain the existing {{mesos-containerizer launch}} > subcommand and add a new {{mesos-containerizer supervise}} subcommand, which > will be its parent process. This new subcommand will be used for the default > executor and custom executor code paths. -- This message was sent by Atlassian JIRA (v6.4.14#64029)