On Thu, May 07, 2015 at 12:40:41PM +0300, Cyrill Gorcunov wrote: > On Thu, May 07, 2015 at 12:12:37PM +0300, Cyrill Gorcunov wrote: > > > > > > > > At moment we don't, but looks like we need to add some check if > > > > cgroup been modified is not a top one when write happens from > > > > inside of container maybe? > > > > > > I guess so. > > > > > > Besides, I think we should not bind mount all cgroups inside any > > > container, because allowing a container to create an arbitrary number of > > > cgroups can affect the overall performance badly. IMO this should be > > > configured in the config file of a container. > > > > I see, thanks. Letme think of it. > > We're creating cgroups for container on ve0 but bindmount them > from inside of container, thus on userspace level (via config file) > we can setup which cgroups are allowed for use. Still we're not > limiting anyhow creating new sub-cgroups (via mkdir) inside > container, and this one should be performance penalty mainly > (new cgroup allocation is done via direct kzalloc without > any memory limits as far as I understart).
Actually, it is accounted to memcg, just like any kmalloc, but the problem isn't that we miss accounting. The problem is that the more features we allow to use from inside a container, the more different types of kernel objects a container can create, the more potential security issues we have. E.g. on reclaim the kernel walks over all memory cgroups, as a result a container user can try to DOS the node by creating thousands of cgroups. > Thus why we can limit cgroups set itself I don't see easy way to limit > nested cgroups/dirs without additional kernel modification. Ideas? Let me clarify. Currently, we agreed on the following scheme: - There is a parameter in the config of a CT about which controllers to bind mount inside the CT. By default, if there is no such a parameter the userspace mounts all cgroups except our home-brewed ones (ve, beancounter). Note, it is about the userspace only, the kernel knows nothing about it. - If a cgroup is bind mounted, the user of the container can play with cgroups without any limitations. It is all about trust, in fact. If you cannot trust a container, just disable bind mounting altogether in the config. - There is the only exception to the previous rule though. Even if we trust the container, we obviously don't want it to tweak its own parameters that are set via cgroups (e.g. its memory and swap limits), i.e. we should disallow it to write to files in its bind-mounted root. This should be done unconditionally by the kernel. Just disallow processes inside ve != ve0 to write files of any top-level cgroup. Hope this clears things up. A question still remains what to do with the /proc/cgroups file - we should hide cgroups that are not bind mounted inside the CT there. This may be done by bind mounting this file itself. Again, up to the userspace. _______________________________________________ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel