> The new cgroup namespace currently only allows for superficial > interaction with the user namespace (it checks against the namespace > it was created in whether or not a user has the right capabilities > before allowing mounting, and things like that). However, there is one > glaring feature that appears to be missing from the new cgroup > namespace implementation: unprivileged user namespaces can't modify > their sub-hierarchy. This is particularly frustrating for the > containerisation community, where we are working on adding support for > "rootless containers" in runC (the execution driver of Docker)[1]. It > essentially means that we can't use cgroup resource limiting to limit > *the resources of our own processes*. It also makes things like the > freezer cgroup unusable. > > Here follows how I think we can solve this issue: the most obvious way > of dealing with this would be (in the cgroupv1 view) to create a new > subtree in every controller when you CLONE_NEWCGROUP. This new subtree > is the root of the process's cgroup hierarchy. This doesn't affect any > resource control, but it will result in the process only being able to > affect its *own* resources. However, for cgroupv2 we have the "No > Internal Process Constraint". So, maybe we could also move all of the > other processes into a sibling subtree (with the *exact same* access > permissions as the parent). Thus, the operation would look like this: > > - C0 - P00 > \ P01 > \ P02 (about to setns) > > becomes > > - C0 - C00 - P00 > \ P01 > \ C01 - P02 > > But then we have C00 which is just a waste of cycles (it doesn't have > any resource settings). So maybe there's some optimisation we can do > there, but that's as far as I've gotten into thinking about how to > deal with the constraints of cgroupv2. After that's been solved we can > reuse how we store the user namespace the cgroup was created in > (cgroup_namespace.user_ns), and just check that whatever user is > trying to modify the cgroup has CAP_SYS_ADMIN in that user namespace. > > Do you think this would work? Are there any recommendations on whether > we can make this work better? Also, can you clarify whether or not > CLONE_NEWCGROUP only works for cgroupv2 or does it also work on > cgroupv1 (we haven't yet transitioned to cgroupv2 in runC). > > Thanks. > > [1]: https://github.com/opencontainers/runc/pull/774
Does anyone have an opinion on this proposal? -- Aleksa Sarai (cyphar) www.cyphar.com