[Devel] Re: Containers and /proc/sys/vm/drop_caches
On 01/06/2011 03:43 PM, Matt Helsley wrote: On Wed, Jan 05, 2011 at 07:46:17PM +0530, Balbir Singh wrote: On Wed, Jan 5, 2011 at 7:31 PM, Serge Hallyn serge.hal...@canonical.com wrote: Quoting Daniel Lezcano (daniel.lezc...@free.fr): On 01/05/2011 10:40 AM, Mike Hommey wrote: [Copy/pasted from a previous message to lkml, where it was suggested to try containers@] Hi, I noticed that from within a lxc container, writing 3 to /proc/sys/vm/drop_caches would flush the host page cache. That sounds a little dangerous for VPS offerings that would be based on lxc, as in one VPS instance root user could impact the overall performance of the host. I don't know about other containers but I've been told openvz isn't subject to this problem. I only tested the current Debian Squeeze kernel, which is based on 2.6.32.27. There is definitively a big work to do with /proc. Some files should be not accessible (/proc/sys/vm/drop_caches, /proc/sys/kernel/sysrq, ...) and some other should be virtualized (/proc/meminfo, /proc/cpuinfo, ...). Serge suggested to create something similar to the cgroup device whitelist but for /proc, maybe it is a good approach for denying access a specific proc's file. Long-term, user namespaces should fix this - /proc will be owned by the user namespace which mounted it, but we can tell proc to always have some files (like drop_caches) be owned by init_user_ns. Changing ownership so a script can't open a file that it otherwise could may cause scripts to fail when run in a container. Makes the containers less transparent. I'm hoping to push my final targeted capabilities prototype in the next few weeks, and after that I start seriously attacking VFS interaction. In the meantime, though, you can use SELinux/Smack, or a custom cgroup file does sound useful. Can cgroups be modules nowadays? (I can't keep up) If so, an out of tree proc-cgroup module seems like a good interim solution. Ideally a drop_cache should drop page cache in that container, but given container have a lot of shared page cache, what is suggested might be a good way to work around the problem One gross hack that comes to mind: Instead of a hard permission model limit the frequency with which the container could actually drop caches. Then the container's ability to interfere with host performance is more limited (but still non-zero). Or limit frequency on a per-user basis (more like Serge's design) because running more containers by a compromised user account shouldn't allow more frequent cache dropping. Disk access causes at best multi-milisecond latency spikes, which can cause a heavily loaded server to go into thrashing meltdown. So a container could screw up another container with this pretty badly. The easy short-term fix is to make containers silently ignore writes to drop_caches. That said, the more important question is why should we provide drop_caches inside a container? My understanding is it's largely a workload-debugging tool and not something meant to truly solve problems. A heavily loaded system that goes deep into swap without triggering the OOM killer can become pretty useless. My home laptop with 2 gigs of ram gets so sluggish whenever I compile something that you can't use the touchpad anymore because hitting the boundary of a widget with the mouse pointer causes a 5 second freeze while it bounces a off three or four processes to handle the message, evicting yet more pages to fault in the pages to handle the X events. By the time the pointer moves again it's way overshot. (Ok, having firefox, chrome, and kmail open with several dozen tabs open in each may have something to do with this.) When it does this, ctrl-alt-f1 echo 1 /proc/sys/vm/drop_caches is just about the only thing that will snap it out of it short of killing processes. The system has ~600 megs of ram tied up in disk cache while being so short of anonymous pages the mouse is useless. That doesn't necessarily apply to containers but that's one use case of using it as a stick to hit the darn overburdened machine when it's making stupid memory allocation decisions. (Playing with swappiness puts the OOM killer on a hair trigger, depending on kernel version du jour.) However, it's not guaranteed to do anything (the cached data could be dirty, mmaped by some process, immediately faulted back in by some other process), so ignoring writes to drop_caches from a container is probably legal behavior anyway. Rob ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Containers and /proc/sys/vm/drop_caches
On 01/07/2011 09:12 AM, Serge Hallyn wrote: Changing ownership so a script can't open a file that it otherwise could may cause scripts to fail when run in a container. Makes the containers less transparent. While my goal next week is to make containers more transparent, the official stance from kernel summit a few years ago was: transparent containers are not a valid goal (as seen from kernel). Do you have a reference for that? I'm still coming up to speed on all this. Trying to collect documentation... A heavily loaded system that goes deep into swap without triggering the OOM killer can become pretty useless. My home laptop with 2 gigs Isn't a cgroup that controls both memory and swap access the right answer to this? There are other ways to work around it, sure. (It's yet to be proven that they do actually work better in resource constrained desktop environments under real-world load, but they seem very promising.) I was just pointing out that this has seen some use as a recovery mechanism, slightly less drastic than the OOM killer. (Didn't say it was a _good_ use. Also, error avoidance and error recovery are different issues, and virtual memory is an inherently overcommitted resource domain.) (And do we have that now, btw?) I think it's coming, rather than actually here. (I thought the beancounters stuff was OpenVZ, controlled by syscalls that the kernel developers rejected. Have resource constraints on anything other than scheduler made it into vanilla yet? If so, what's the UI to control them?) By the way, from a UI perspective, most of the containers stuff I've seen so far is apparently aimed at big iron deployments (or attempts to make PC clusters look like mainframes, I.E. this cloud stuff). I'm glad to see more diverse uses of it, but one of the downsides of cobbling together a mechanism from a dozen different unrelated pieces of infrastructure (clone flags, cgroup filesystem, extra mount flags on proc and such so they behave differently) is that we need a lot of documentation/example code/libraries to make it easy to use. You can do X and it's easy to reliably do X have a gap that may take a while to close... Rob ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Containers and /proc/sys/vm/drop_caches
Quoting Rob Landley (rland...@parallels.com): On 01/07/2011 09:12 AM, Serge Hallyn wrote: Changing ownership so a script can't open a file that it otherwise could may cause scripts to fail when run in a container. Makes the containers less transparent. While my goal next week is to make containers more transparent, the official stance from kernel summit a few years ago was: transparent containers are not a valid goal (as seen from kernel). Do you have a reference for that? I'm still coming up to speed on all this. Trying to collect documentation... Sorry, I don't offhand, and a quick google search wasn't helpful. I think it was from the very first containers discussion at ksummit, but not sure. There is http://lwn.net/Articles/191923/. Toward the bottom it claims that noone thought it would be a problem to tweak distros to run in containers without /sys and /proc. But this was 2006, when pid namespaces were still a new idea, and noone was actually using containers. It certainly is possible that sentiment has changed, which is why I do feel that it's worth it for someone to try some native containerization inside fs/proc/*.c. While user namespaces should make it possible to make fuse proc filtering less wishy-washy, they won't make it any less ugly :) -serge ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Containers and /proc/sys/vm/drop_caches
Quoting Rob Landley (rland...@parallels.com): On 01/06/2011 03:43 PM, Matt Helsley wrote: On Wed, Jan 05, 2011 at 07:46:17PM +0530, Balbir Singh wrote: On Wed, Jan 5, 2011 at 7:31 PM, Serge Hallyn serge.hal...@canonical.com wrote: Quoting Daniel Lezcano (daniel.lezc...@free.fr): On 01/05/2011 10:40 AM, Mike Hommey wrote: [Copy/pasted from a previous message to lkml, where it was suggested to try contain...@] Hi, I noticed that from within a lxc container, writing 3 to /proc/sys/vm/drop_caches would flush the host page cache. That sounds a little dangerous for VPS offerings that would be based on lxc, as in one VPS instance root user could impact the overall performance of the host. I don't know about other containers but I've been told openvz isn't subject to this problem. I only tested the current Debian Squeeze kernel, which is based on 2.6.32.27. There is definitively a big work to do with /proc. Some files should be not accessible (/proc/sys/vm/drop_caches, /proc/sys/kernel/sysrq, ...) and some other should be virtualized (/proc/meminfo, /proc/cpuinfo, ...). Serge suggested to create something similar to the cgroup device whitelist but for /proc, maybe it is a good approach for denying access a specific proc's file. Long-term, user namespaces should fix this - /proc will be owned by the user namespace which mounted it, but we can tell proc to always have some files (like drop_caches) be owned by init_user_ns. Changing ownership so a script can't open a file that it otherwise could may cause scripts to fail when run in a container. Makes the containers less transparent. While my goal next week is to make containers more transparent, the official stance from kernel summit a few years ago was: transparent containers are not a valid goal (as seen from kernel). Not saying that what you're saying above is wrong, but I *do* argue that 'silently ignoring the write' is more wrong than refusing the write :) Fooling userspace is a lose, imo. Also, we can use a FUSE fs over proc to hide the files. Doing that now is insufficient because root in the container can just remount proc over the filter. But after user namespaces, root in the container has the choice of leaving the filter in place for the sake of his own usespace, or removing it and getting a bunch of files he can't use. ... A heavily loaded system that goes deep into swap without triggering the OOM killer can become pretty useless. My home laptop with 2 gigs Isn't a cgroup that controls both memory and swap access the right answer to this? (And do we have that now, btw?) (I'm doing too many things at once so probably not thinking this through enough) -serge ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Containers and /proc/sys/vm/drop_caches
On Wed, Jan 05, 2011 at 07:46:17PM +0530, Balbir Singh wrote: On Wed, Jan 5, 2011 at 7:31 PM, Serge Hallyn serge.hal...@canonical.com wrote: Quoting Daniel Lezcano (daniel.lezc...@free.fr): On 01/05/2011 10:40 AM, Mike Hommey wrote: [Copy/pasted from a previous message to lkml, where it was suggested to try contain...@] Hi, I noticed that from within a lxc container, writing 3 to /proc/sys/vm/drop_caches would flush the host page cache. That sounds a little dangerous for VPS offerings that would be based on lxc, as in one VPS instance root user could impact the overall performance of the host. I don't know about other containers but I've been told openvz isn't subject to this problem. I only tested the current Debian Squeeze kernel, which is based on 2.6.32.27. There is definitively a big work to do with /proc. Some files should be not accessible (/proc/sys/vm/drop_caches, /proc/sys/kernel/sysrq, ...) and some other should be virtualized (/proc/meminfo, /proc/cpuinfo, ...). Serge suggested to create something similar to the cgroup device whitelist but for /proc, maybe it is a good approach for denying access a specific proc's file. Long-term, user namespaces should fix this - /proc will be owned by the user namespace which mounted it, but we can tell proc to always have some files (like drop_caches) be owned by init_user_ns. I'm hoping to push my final targeted capabilities prototype in the next few weeks, and after that I start seriously attacking VFS interaction. In the meantime, though, you can use SELinux/Smack, or a custom cgroup file does sound useful. Can cgroups be modules nowadays? (I can't keep up) If so, an out of tree proc-cgroup module seems like a good interim solution. Ideally a drop_cache should drop page cache in that container, but given container have a lot of shared page cache, what is suggested might be a good way to work around the problem One gross hack that comes to mind: Instead of a hard permission model limit the frequency with which the container could actually drop caches. Then the container's ability to interfere with host performance is more limited (but still non-zero). Or limit frequency on a per-user basis (more like Serge's design) because running more containers by a compromised user account shouldn't allow more frequent cache dropping. That said, the more important question is why should we provide drop_caches inside a container? My understanding is it's largely a workload-debugging tool and not something meant to truly solve problems. If that's the case then we shouldn't provide it at all or it should actually interfere with the host cache. Cheers, -Matt Helsley ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Containers and /proc/sys/vm/drop_caches
On Thu, 2011-01-06 at 13:43 -0800, Matt Helsley wrote: That said, the more important question is why should we provide drop_caches inside a container? My understanding is it's largely a workload-debugging tool and not something meant to truly solve problems. If that's the case then we shouldn't provide it at all or it should actually interfere with the host cache. Yeah, what's the problem that you're solving with drop_caches? The odds are, there's a better way. That said, it _might_ be worth doing things like dropping (inode or dentry) caches per-sb. That's a much better fit than using big, ugly, loosely-defined, system-wide knobs like drop_caches. Also, unless we start giving containers real ownership of devices or partitions, it's going to be pretty darn hard to let things clear caches in a meaningful way. What if one container wants an object cleared while another doesn't? -- Dave ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Containers and /proc/sys/vm/drop_caches
On Thu, Jan 06, 2011 at 01:50:05PM -0800, Dave Hansen wrote: On Thu, 2011-01-06 at 13:43 -0800, Matt Helsley wrote: That said, the more important question is why should we provide drop_caches inside a container? My understanding is it's largely a workload-debugging tool and not something meant to truly solve problems. If that's the case then we shouldn't provide it at all or it should actually interfere with the host cache. Yeah, what's the problem that you're solving with drop_caches? The odds are, there's a better way. That said, it _might_ be worth doing things like dropping (inode or dentry) caches per-sb. That's a much better fit than using big, ugly, loosely-defined, system-wide knobs like drop_caches. Yup. Since many containers will have their own mount namespaces with separate sbs it's a more reasonable approximation of per-container dropping of caches. Also, unless we start giving containers real ownership of devices or partitions, it's going to be pretty darn hard to let things clear caches in a meaningful way. What if one container wants an object cleared while another doesn't? Good point. First reaction: we'd want to keep it cached if any of the containers want it. But even that's a bad policy under certain circumstances containers (aka VPS) might be used for. Is drop_caches well-defined? IOW would it be permissible to not actually drop all or any of the cache entries or to do nothing and still report success instead of, say, EPERM, to a container? Cheers, -Matt Helsley ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Containers and /proc/sys/vm/drop_caches
On Thu, 2011-01-06 at 14:08 -0800, Matt Helsley wrote: Is drop_caches well-defined? IOW would it be permissible to not actually drop all or any of the cache entries or to do nothing and still report success instead of, say, EPERM, to a container? It's really just a hint or a request. It's possible that an echo 3 /proc/sys/vm/drop_caches returns '2' (for the two bytes written), indicating success and yet, not a single object was freed. There's currently no way to tell how much work it did, or to figure out why it did a certain amount of work. Frankly, in a container, it probably just shouldn't even show up in /proc. -- Dave ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Containers and /proc/sys/vm/drop_caches
On 01/05/2011 10:40 AM, Mike Hommey wrote: [Copy/pasted from a previous message to lkml, where it was suggested to try contain...@] Hi, I noticed that from within a lxc container, writing 3 to /proc/sys/vm/drop_caches would flush the host page cache. That sounds a little dangerous for VPS offerings that would be based on lxc, as in one VPS instance root user could impact the overall performance of the host. I don't know about other containers but I've been told openvz isn't subject to this problem. I only tested the current Debian Squeeze kernel, which is based on 2.6.32.27. There is definitively a big work to do with /proc. Some files should be not accessible (/proc/sys/vm/drop_caches, /proc/sys/kernel/sysrq, ...) and some other should be virtualized (/proc/meminfo, /proc/cpuinfo, ...). Serge suggested to create something similar to the cgroup device whitelist but for /proc, maybe it is a good approach for denying access a specific proc's file. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Containers and /proc/sys/vm/drop_caches
Quoting Daniel Lezcano (daniel.lezc...@free.fr): On 01/05/2011 10:40 AM, Mike Hommey wrote: [Copy/pasted from a previous message to lkml, where it was suggested to try contain...@] Hi, I noticed that from within a lxc container, writing 3 to /proc/sys/vm/drop_caches would flush the host page cache. That sounds a little dangerous for VPS offerings that would be based on lxc, as in one VPS instance root user could impact the overall performance of the host. I don't know about other containers but I've been told openvz isn't subject to this problem. I only tested the current Debian Squeeze kernel, which is based on 2.6.32.27. There is definitively a big work to do with /proc. Some files should be not accessible (/proc/sys/vm/drop_caches, /proc/sys/kernel/sysrq, ...) and some other should be virtualized (/proc/meminfo, /proc/cpuinfo, ...). Serge suggested to create something similar to the cgroup device whitelist but for /proc, maybe it is a good approach for denying access a specific proc's file. Long-term, user namespaces should fix this - /proc will be owned by the user namespace which mounted it, but we can tell proc to always have some files (like drop_caches) be owned by init_user_ns. I'm hoping to push my final targeted capabilities prototype in the next few weeks, and after that I start seriously attacking VFS interaction. In the meantime, though, you can use SELinux/Smack, or a custom cgroup file does sound useful. Can cgroups be modules nowadays? (I can't keep up) If so, an out of tree proc-cgroup module seems like a good interim solution. -serge ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Containers and /proc/sys/vm/drop_caches
On Wed, Jan 5, 2011 at 7:31 PM, Serge Hallyn serge.hal...@canonical.com wrote: Quoting Daniel Lezcano (daniel.lezc...@free.fr): On 01/05/2011 10:40 AM, Mike Hommey wrote: [Copy/pasted from a previous message to lkml, where it was suggested to try contain...@] Hi, I noticed that from within a lxc container, writing 3 to /proc/sys/vm/drop_caches would flush the host page cache. That sounds a little dangerous for VPS offerings that would be based on lxc, as in one VPS instance root user could impact the overall performance of the host. I don't know about other containers but I've been told openvz isn't subject to this problem. I only tested the current Debian Squeeze kernel, which is based on 2.6.32.27. There is definitively a big work to do with /proc. Some files should be not accessible (/proc/sys/vm/drop_caches, /proc/sys/kernel/sysrq, ...) and some other should be virtualized (/proc/meminfo, /proc/cpuinfo, ...). Serge suggested to create something similar to the cgroup device whitelist but for /proc, maybe it is a good approach for denying access a specific proc's file. Long-term, user namespaces should fix this - /proc will be owned by the user namespace which mounted it, but we can tell proc to always have some files (like drop_caches) be owned by init_user_ns. I'm hoping to push my final targeted capabilities prototype in the next few weeks, and after that I start seriously attacking VFS interaction. In the meantime, though, you can use SELinux/Smack, or a custom cgroup file does sound useful. Can cgroups be modules nowadays? (I can't keep up) If so, an out of tree proc-cgroup module seems like a good interim solution. Ideally a drop_cache should drop page cache in that container, but given container have a lot of shared page cache, what is suggested might be a good way to work around the problem Balbir ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel