[Devel] Re: Containers and /proc/sys/vm/drop_caches

2011-02-24 Thread Rob Landley
On 01/06/2011 03:43 PM, Matt Helsley wrote:
 On Wed, Jan 05, 2011 at 07:46:17PM +0530, Balbir Singh wrote:
 On Wed, Jan 5, 2011 at 7:31 PM, Serge Hallyn serge.hal...@canonical.com 
 wrote:
 Quoting Daniel Lezcano (daniel.lezc...@free.fr):
 On 01/05/2011 10:40 AM, Mike Hommey wrote:
 [Copy/pasted from a previous message to lkml, where it was suggested to
  try containers@]

 Hi,

 I noticed that from within a lxc container, writing 3 to
 /proc/sys/vm/drop_caches would flush the host page cache. That sounds a
 little dangerous for VPS offerings that would be based on lxc, as in one
 VPS instance root user could impact the overall performance of the host.
 I don't know about other containers but I've been told openvz isn't
 subject to this problem.
 I only tested the current Debian Squeeze kernel, which is based on
 2.6.32.27.

 There is definitively a big work to do with /proc.

 Some files should be not accessible (/proc/sys/vm/drop_caches,
 /proc/sys/kernel/sysrq, ...) and some other should be virtualized
 (/proc/meminfo, /proc/cpuinfo, ...).

 Serge suggested to create something similar to the cgroup device
 whitelist but for /proc, maybe it is a good approach for denying
 access a specific proc's file.

 Long-term, user namespaces should fix this - /proc will be owned
 by the user namespace which mounted it, but we can tell proc to
 always have some files (like drop_caches) be owned by init_user_ns.

Changing ownership so a script can't open a file that it otherwise
could may cause scripts to fail when run in a container.  Makes the
containers less transparent.

 I'm hoping to push my final targeted capabilities prototype in the
 next few weeks, and after that I start seriously attacking VFS
 interaction.

 In the meantime, though, you can use SELinux/Smack, or a custom
 cgroup file does sound useful.  Can cgroups be modules nowadays?
 (I can't keep up)  If so, an out of tree proc-cgroup module seems
 like a good interim solution.


 Ideally a drop_cache should drop page cache in that container, but
 given container have a lot of shared page cache, what is suggested
 might be a good way to work around the problem
 
 One gross hack that comes to mind: Instead of a hard permission model
 limit the frequency with which the container could actually drop caches.
 Then the container's ability to interfere with host performance is more
 limited (but still non-zero). Or limit frequency on a per-user basis
 (more like Serge's design) because running more containers by a
 compromised user account shouldn't allow more frequent cache dropping.

Disk access causes at best multi-milisecond latency spikes, which can cause
a heavily loaded server to go into thrashing meltdown.  So a container
could screw up another container with this pretty badly.

The easy short-term fix is to make containers silently ignore writes to
drop_caches.

 That said, the more important question is why should we provide
 drop_caches inside a container? My understanding is it's largely a
 workload-debugging tool and not something meant to truly solve
 problems.

A heavily loaded system that goes deep into swap without triggering
the OOM killer can become pretty useless.  My home laptop with 2 gigs
of ram gets so sluggish whenever I compile something that you can't
use the touchpad anymore because hitting the boundary of a widget
with the mouse pointer causes a 5 second freeze while it bounces a
off three or four processes to handle the message, evicting yet more
pages to fault in the pages to handle the X events.  By the time
the pointer moves again it's way overshot.  (Ok, having firefox,
chrome, and kmail open with several dozen tabs open in each may have
something to do with this.)

When it does this, ctrl-alt-f1 echo 1  /proc/sys/vm/drop_caches
is just about the only thing that will snap it out of it short of
killing processes.  The system has ~600 megs of ram tied up in
disk cache while being so short of anonymous pages the mouse is
useless.

That doesn't necessarily apply to containers but that's one use case
of using it as a stick to hit the darn overburdened machine when it's
making stupid memory allocation decisions.  (Playing with swappiness
puts the OOM killer on a hair trigger, depending on kernel version
du jour.)

However, it's not guaranteed to do anything (the cached data could
be dirty, mmaped by some process, immediately faulted back in
by some other process), so ignoring writes to drop_caches from a
container is probably legal behavior anyway.

Rob
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Containers and /proc/sys/vm/drop_caches

2011-02-24 Thread Rob Landley
On 01/07/2011 09:12 AM, Serge Hallyn wrote:
 Changing ownership so a script can't open a file that it otherwise
  could may cause scripts to fail when run in a container.  Makes
 the containers less transparent.
 
 While my goal next week is to make containers more transparent, the 
 official stance from kernel summit a few years ago was:  transparent
  containers are not a valid goal (as seen from kernel).

Do you have a reference for that?  I'm still coming up to speed on all this.  
Trying to collect documentation...

 A heavily loaded system that goes deep into swap without triggering
 the OOM killer can become pretty useless.  My home laptop with 2
 gigs
 
 Isn't a cgroup that controls both memory and swap access the right 
 answer to this?

There are other ways to work around it, sure.  (It's yet to be proven that they 
do actually work better in resource constrained desktop environments under 
real-world load, but they seem very promising.)

I was just pointing out that this has seen some use as a recovery mechanism, 
slightly less drastic than the OOM killer.  (Didn't say it was a _good_ use.  
Also, error avoidance and error recovery are different issues, and virtual 
memory is an inherently overcommitted resource domain.)

 (And do we have that now, btw?)

I think it's coming, rather than actually here.  (I thought the beancounters 
stuff was OpenVZ, controlled by syscalls that the kernel developers rejected.  
Have resource constraints on anything other than scheduler made it into vanilla 
yet?  If so, what's the UI to control them?)

By the way, from a UI perspective, most of the containers stuff I've seen so 
far is apparently aimed at big iron deployments (or attempts to make PC 
clusters look like mainframes, I.E. this cloud stuff).  I'm glad to see more 
diverse uses of it, but one of the downsides of cobbling together a mechanism 
from a dozen different unrelated pieces of infrastructure (clone flags, cgroup 
filesystem, extra mount flags on proc and such so they behave differently) is 
that we need a lot of documentation/example code/libraries to make it easy to 
use.  You can do X and it's easy to reliably do X have a gap that may take 
a while to close...

Rob
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Containers and /proc/sys/vm/drop_caches

2011-01-11 Thread Serge Hallyn
Quoting Rob Landley (rland...@parallels.com):
 On 01/07/2011 09:12 AM, Serge Hallyn wrote:
  Changing ownership so a script can't open a file that it otherwise
   could may cause scripts to fail when run in a container.  Makes
  the containers less transparent.
  
  While my goal next week is to make containers more transparent, the 
  official stance from kernel summit a few years ago was:  transparent
   containers are not a valid goal (as seen from kernel).
 
 Do you have a reference for that?  I'm still coming up to speed on all this.  
 Trying to collect documentation...

Sorry, I don't offhand, and a quick google search wasn't helpful.  I think
it was from the very first containers discussion at ksummit, but not sure.
There is http://lwn.net/Articles/191923/.  Toward the bottom it claims that
noone thought it would be a problem to tweak distros to run in containers
without /sys and /proc.

But this was 2006, when pid namespaces were still a new idea, and noone
was actually using containers.  It certainly is possible that sentiment
has changed, which is why I do feel that it's worth it for someone to
try some native containerization inside fs/proc/*.c.  While user namespaces
should make it possible to make fuse proc filtering less wishy-washy, they
won't make it any less ugly :)

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Containers and /proc/sys/vm/drop_caches

2011-01-07 Thread Serge Hallyn
Quoting Rob Landley (rland...@parallels.com):
 On 01/06/2011 03:43 PM, Matt Helsley wrote:
  On Wed, Jan 05, 2011 at 07:46:17PM +0530, Balbir Singh wrote:
  On Wed, Jan 5, 2011 at 7:31 PM, Serge Hallyn serge.hal...@canonical.com 
  wrote:
  Quoting Daniel Lezcano (daniel.lezc...@free.fr):
  On 01/05/2011 10:40 AM, Mike Hommey wrote:
  [Copy/pasted from a previous message to lkml, where it was suggested to
   try contain...@]
 
  Hi,
 
  I noticed that from within a lxc container, writing 3 to
  /proc/sys/vm/drop_caches would flush the host page cache. That sounds a
  little dangerous for VPS offerings that would be based on lxc, as in one
  VPS instance root user could impact the overall performance of the host.
  I don't know about other containers but I've been told openvz isn't
  subject to this problem.
  I only tested the current Debian Squeeze kernel, which is based on
  2.6.32.27.
 
  There is definitively a big work to do with /proc.
 
  Some files should be not accessible (/proc/sys/vm/drop_caches,
  /proc/sys/kernel/sysrq, ...) and some other should be virtualized
  (/proc/meminfo, /proc/cpuinfo, ...).
 
  Serge suggested to create something similar to the cgroup device
  whitelist but for /proc, maybe it is a good approach for denying
  access a specific proc's file.
 
  Long-term, user namespaces should fix this - /proc will be owned
  by the user namespace which mounted it, but we can tell proc to
  always have some files (like drop_caches) be owned by init_user_ns.
 
 Changing ownership so a script can't open a file that it otherwise
 could may cause scripts to fail when run in a container.  Makes the
 containers less transparent.

While my goal next week is to make containers more transparent, the
official stance from kernel summit a few years ago was:  transparent
containers are not a valid goal (as seen from kernel).

Not saying that what you're saying above is wrong, but I *do* argue
that 'silently ignoring the write' is more wrong than refusing the
write :)  Fooling userspace is a lose, imo.

Also, we can use a FUSE fs over proc to hide the files.  Doing that
now is insufficient because root in the container can just remount
proc over the filter.  But after user namespaces, root in the container
has the choice of leaving the filter in place for the sake of his own
usespace, or removing it and getting a bunch of files he can't use.

...

 A heavily loaded system that goes deep into swap without triggering
 the OOM killer can become pretty useless.  My home laptop with 2 gigs

Isn't a cgroup that controls both memory and swap access the right
answer to this?  (And do we have that now, btw?)

(I'm doing too many things at once so probably not thinking this
through enough)

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Containers and /proc/sys/vm/drop_caches

2011-01-06 Thread Matt Helsley
On Wed, Jan 05, 2011 at 07:46:17PM +0530, Balbir Singh wrote:
 On Wed, Jan 5, 2011 at 7:31 PM, Serge Hallyn serge.hal...@canonical.com 
 wrote:
  Quoting Daniel Lezcano (daniel.lezc...@free.fr):
  On 01/05/2011 10:40 AM, Mike Hommey wrote:
  [Copy/pasted from a previous message to lkml, where it was suggested to
    try contain...@]
  
  Hi,
  
  I noticed that from within a lxc container, writing 3 to
  /proc/sys/vm/drop_caches would flush the host page cache. That sounds a
  little dangerous for VPS offerings that would be based on lxc, as in one
  VPS instance root user could impact the overall performance of the host.
  I don't know about other containers but I've been told openvz isn't
  subject to this problem.
  I only tested the current Debian Squeeze kernel, which is based on
  2.6.32.27.
 
  There is definitively a big work to do with /proc.
 
  Some files should be not accessible (/proc/sys/vm/drop_caches,
  /proc/sys/kernel/sysrq, ...) and some other should be virtualized
  (/proc/meminfo, /proc/cpuinfo, ...).
 
  Serge suggested to create something similar to the cgroup device
  whitelist but for /proc, maybe it is a good approach for denying
  access a specific proc's file.
 
  Long-term, user namespaces should fix this - /proc will be owned
  by the user namespace which mounted it, but we can tell proc to
  always have some files (like drop_caches) be owned by init_user_ns.
 
  I'm hoping to push my final targeted capabilities prototype in the
  next few weeks, and after that I start seriously attacking VFS
  interaction.
 
  In the meantime, though, you can use SELinux/Smack, or a custom
  cgroup file does sound useful.  Can cgroups be modules nowadays?
  (I can't keep up)  If so, an out of tree proc-cgroup module seems
  like a good interim solution.
 
 
 Ideally a drop_cache should drop page cache in that container, but
 given container have a lot of shared page cache, what is suggested
 might be a good way to work around the problem

One gross hack that comes to mind: Instead of a hard permission model
limit the frequency with which the container could actually drop caches.
Then the container's ability to interfere with host performance is more
limited (but still non-zero). Or limit frequency on a per-user basis
(more like Serge's design) because running more containers by a
compromised user account shouldn't allow more frequent cache dropping.

That said, the more important question is why should we provide
drop_caches inside a container? My understanding is it's largely a
workload-debugging tool and not something meant to truly solve
problems. If that's the case then we shouldn't provide it at all or it
should actually interfere with the host cache.

Cheers,
-Matt Helsley
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Containers and /proc/sys/vm/drop_caches

2011-01-06 Thread Dave Hansen
On Thu, 2011-01-06 at 13:43 -0800, Matt Helsley wrote:
 That said, the more important question is why should we provide
 drop_caches inside a container? My understanding is it's largely a
 workload-debugging tool and not something meant to truly solve
 problems. If that's the case then we shouldn't provide it at all or it
 should actually interfere with the host cache. 

Yeah, what's the problem that you're solving with drop_caches?  The odds
are, there's a better way.

That said, it _might_ be worth doing things like dropping (inode or
dentry) caches per-sb.  That's a much better fit than using big, ugly,
loosely-defined, system-wide knobs like drop_caches.

Also, unless we start giving containers real ownership of devices or
partitions, it's going to be pretty darn hard to let things clear caches
in a meaningful way.  What if one container wants an object cleared
while another doesn't?

-- Dave

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Containers and /proc/sys/vm/drop_caches

2011-01-06 Thread Matt Helsley
On Thu, Jan 06, 2011 at 01:50:05PM -0800, Dave Hansen wrote:
 On Thu, 2011-01-06 at 13:43 -0800, Matt Helsley wrote:
  That said, the more important question is why should we provide
  drop_caches inside a container? My understanding is it's largely a
  workload-debugging tool and not something meant to truly solve
  problems. If that's the case then we shouldn't provide it at all or it
  should actually interfere with the host cache. 
 
 Yeah, what's the problem that you're solving with drop_caches?  The odds
 are, there's a better way.
 
 That said, it _might_ be worth doing things like dropping (inode or
 dentry) caches per-sb.  That's a much better fit than using big, ugly,
 loosely-defined, system-wide knobs like drop_caches.

Yup. Since many containers will have their own mount namespaces with
separate sbs it's a more reasonable approximation of per-container
dropping of caches.

 
 Also, unless we start giving containers real ownership of devices or
 partitions, it's going to be pretty darn hard to let things clear caches
 in a meaningful way.  What if one container wants an object cleared
 while another doesn't?

Good point. First reaction: we'd want to keep it cached if any of the
containers want it. But even that's a bad policy under certain
circumstances containers (aka VPS) might be used for.

Is drop_caches well-defined? IOW would it be permissible to
not actually drop all or any of the cache entries or to do nothing and
still report success instead of, say, EPERM, to a container?

Cheers,
-Matt Helsley
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Containers and /proc/sys/vm/drop_caches

2011-01-06 Thread Dave Hansen
On Thu, 2011-01-06 at 14:08 -0800, Matt Helsley wrote:
 Is drop_caches well-defined? IOW would it be permissible to
 not actually drop all or any of the cache entries or to do nothing and
 still report success instead of, say, EPERM, to a container?

It's really just a hint or a request.  It's possible that an

echo 3  /proc/sys/vm/drop_caches

returns '2' (for the two bytes written), indicating success and yet, not
a single object was freed.  There's currently no way to tell how much
work it did, or to figure out why it did a certain amount of work.

Frankly, in a container, it probably just shouldn't even show up
in /proc.

-- Dave


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Containers and /proc/sys/vm/drop_caches

2011-01-05 Thread Daniel Lezcano
On 01/05/2011 10:40 AM, Mike Hommey wrote:
 [Copy/pasted from a previous message to lkml, where it was suggested to
   try contain...@]

 Hi,

 I noticed that from within a lxc container, writing 3 to
 /proc/sys/vm/drop_caches would flush the host page cache. That sounds a
 little dangerous for VPS offerings that would be based on lxc, as in one
 VPS instance root user could impact the overall performance of the host.
 I don't know about other containers but I've been told openvz isn't
 subject to this problem.
 I only tested the current Debian Squeeze kernel, which is based on
 2.6.32.27.

There is definitively a big work to do with /proc.

Some files should be not accessible (/proc/sys/vm/drop_caches, 
/proc/sys/kernel/sysrq, ...) and some other should be virtualized 
(/proc/meminfo, /proc/cpuinfo, ...).

Serge suggested to create something similar to the cgroup device 
whitelist but for /proc, maybe it is a good approach for denying access 
a specific proc's file.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Containers and /proc/sys/vm/drop_caches

2011-01-05 Thread Serge Hallyn
Quoting Daniel Lezcano (daniel.lezc...@free.fr):
 On 01/05/2011 10:40 AM, Mike Hommey wrote:
 [Copy/pasted from a previous message to lkml, where it was suggested to
   try contain...@]
 
 Hi,
 
 I noticed that from within a lxc container, writing 3 to
 /proc/sys/vm/drop_caches would flush the host page cache. That sounds a
 little dangerous for VPS offerings that would be based on lxc, as in one
 VPS instance root user could impact the overall performance of the host.
 I don't know about other containers but I've been told openvz isn't
 subject to this problem.
 I only tested the current Debian Squeeze kernel, which is based on
 2.6.32.27.
 
 There is definitively a big work to do with /proc.
 
 Some files should be not accessible (/proc/sys/vm/drop_caches,
 /proc/sys/kernel/sysrq, ...) and some other should be virtualized
 (/proc/meminfo, /proc/cpuinfo, ...).
 
 Serge suggested to create something similar to the cgroup device
 whitelist but for /proc, maybe it is a good approach for denying
 access a specific proc's file.

Long-term, user namespaces should fix this - /proc will be owned
by the user namespace which mounted it, but we can tell proc to
always have some files (like drop_caches) be owned by init_user_ns.

I'm hoping to push my final targeted capabilities prototype in the
next few weeks, and after that I start seriously attacking VFS
interaction.

In the meantime, though, you can use SELinux/Smack, or a custom
cgroup file does sound useful.  Can cgroups be modules nowadays?
(I can't keep up)  If so, an out of tree proc-cgroup module seems
like a good interim solution.

-serge
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Containers and /proc/sys/vm/drop_caches

2011-01-05 Thread Balbir Singh
On Wed, Jan 5, 2011 at 7:31 PM, Serge Hallyn serge.hal...@canonical.com wrote:
 Quoting Daniel Lezcano (daniel.lezc...@free.fr):
 On 01/05/2011 10:40 AM, Mike Hommey wrote:
 [Copy/pasted from a previous message to lkml, where it was suggested to
   try contain...@]
 
 Hi,
 
 I noticed that from within a lxc container, writing 3 to
 /proc/sys/vm/drop_caches would flush the host page cache. That sounds a
 little dangerous for VPS offerings that would be based on lxc, as in one
 VPS instance root user could impact the overall performance of the host.
 I don't know about other containers but I've been told openvz isn't
 subject to this problem.
 I only tested the current Debian Squeeze kernel, which is based on
 2.6.32.27.

 There is definitively a big work to do with /proc.

 Some files should be not accessible (/proc/sys/vm/drop_caches,
 /proc/sys/kernel/sysrq, ...) and some other should be virtualized
 (/proc/meminfo, /proc/cpuinfo, ...).

 Serge suggested to create something similar to the cgroup device
 whitelist but for /proc, maybe it is a good approach for denying
 access a specific proc's file.

 Long-term, user namespaces should fix this - /proc will be owned
 by the user namespace which mounted it, but we can tell proc to
 always have some files (like drop_caches) be owned by init_user_ns.

 I'm hoping to push my final targeted capabilities prototype in the
 next few weeks, and after that I start seriously attacking VFS
 interaction.

 In the meantime, though, you can use SELinux/Smack, or a custom
 cgroup file does sound useful.  Can cgroups be modules nowadays?
 (I can't keep up)  If so, an out of tree proc-cgroup module seems
 like a good interim solution.


Ideally a drop_cache should drop page cache in that container, but
given container have a lot of shared page cache, what is suggested
might be a good way to work around the problem

Balbir
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel