Re: [RFCv1 0/6] Page Detective

2024-11-19 Thread Roman Gushchin
On Tue, Nov 19, 2024 at 11:35:47AM -0800, Yosry Ahmed wrote:
> On Tue, Nov 19, 2024 at 11:30 AM Pasha Tatashin
>  wrote:
> >
> > On Tue, Nov 19, 2024 at 1:23 PM Roman Gushchin  
> > wrote:
> > >
> > > On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote:
> > > > On Mon, Nov 18, 2024 at 8:09 PM Greg KH  
> > > > wrote:
> > > > >
> > > > > On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote:
> > > > > > Additionally, using crash/drgn is not feasible for us at this time, 
> > > > > > it
> > > > > > requires keeping external tools on our hosts, also it requires
> > > > > > approval and a security review for each script before deployment in
> > > > > > our fleet.
> > > > >
> > > > > So it's ok to add a totally insecure kernel feature to your fleet
> > > > > instead?  You might want to reconsider that policy decision :)
> > > >
> > > > Hi Greg,
> > > >
> > > > While some risk is inherent, we believe the potential for abuse here
> > > > is limited, especially given the existing  CAP_SYS_ADMIN requirement.
> > > > But, even with root access compromised, this tool presents a smaller
> > > > attack surface than alternatives like crash/drgn. It exposes less
> > > > sensitive information, unlike crash/drgn, which could potentially
> > > > allow reading all of kernel memory.
> > >
> > > The problem here is with using dmesg for output. No security-sensitive
> > > information should go there. Even exposing raw kernel pointers is not
> > > considered safe.
> >
> > I am OK in writing the output to a debugfs file in the next version,
> > the only concern I have is that implies that dump_page() would need to
> > be basically duplicated, as it now outputs everything via printk's.
> 
> Perhaps you can refactor the code in dump_page() to use a seq_buf,
> then have dump_page() printk that seq_buf using seq_buf_do_printk(),
> and have page detective output that seq_buf to the debugfs file?
> 
> We do something very similar with memory_stat_format(). We use the
> same function to generate the memcg stats in a seq_buf, then we use
> that seq_buf to output the stats to memory.stat as well as the OOM
> log.

+1

Thanks!



Re: [RFCv1 0/6] Page Detective

2024-11-19 Thread Roman Gushchin
On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote:
> On Mon, Nov 18, 2024 at 8:09 PM Greg KH  wrote:
> >
> > On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote:
> > > Additionally, using crash/drgn is not feasible for us at this time, it
> > > requires keeping external tools on our hosts, also it requires
> > > approval and a security review for each script before deployment in
> > > our fleet.
> >
> > So it's ok to add a totally insecure kernel feature to your fleet
> > instead?  You might want to reconsider that policy decision :)
> 
> Hi Greg,
> 
> While some risk is inherent, we believe the potential for abuse here
> is limited, especially given the existing  CAP_SYS_ADMIN requirement.
> But, even with root access compromised, this tool presents a smaller
> attack surface than alternatives like crash/drgn. It exposes less
> sensitive information, unlike crash/drgn, which could potentially
> allow reading all of kernel memory.

The problem here is with using dmesg for output. No security-sensitive
information should go there. Even exposing raw kernel pointers is not
considered safe.

I'm also not sure about what presents a bigger attack surface. Yes,
drgn allows to read more, but it's using /proc/kcore, so the in-kernel
code is much simpler. But I don't think it's a relevant discussion,
if a malicious user has a root access, there are better options than
both drgn and page detective.



Re: [RFCv1 0/6] Page Detective

2024-11-18 Thread Roman Gushchin
On Sat, Nov 16, 2024 at 05:59:16PM +, Pasha Tatashin wrote:
> Page Detective is a new kernel debugging tool that provides detailed
> information about the usage and mapping of physical memory pages.
> 
> It is often known that a particular page is corrupted, but it is hard to
> extract more information about such a page from live system. Examples
> are:
> 
> - Checksum failure during live migration
> - Filesystem journal failure
> - dump_page warnings on the console log
> - Unexcpected segfaults
> 
> Page Detective helps to extract more information from the kernel, so it
> can be used by developers to root cause the associated problem.
> 
> It operates through the Linux debugfs interface, with two files: "virt"
> and "phys".
> 
> The "virt" file takes a virtual address and PID and outputs information
> about the corresponding page.
> 
> The "phys" file takes a physical address and outputs information about
> that page.
> 
> The output is presented via kernel log messages (can be accessed with
> dmesg), and includes information such as the page's reference count,
> mapping, flags, and memory cgroup. It also shows whether the page is
> mapped in the kernel page table, and if so, how many times.

This looks questionable both from the security and convenience points of view.
Given the request-response nature of the interface, the output can be
provided using a "normal" seq-based pseudo-file.

But I have a more generic question:
doesn't it make sense to implement it as a set of drgn scripts instead
of kernel code? This provides more flexibility, is safer (even if it's buggy,
you won't crash the host) and should be at least in theory equally
powerful.

Thanks!



Re: [PATCH v2 00/39] Memory allocation profiling

2023-10-24 Thread Roman Gushchin
On Tue, Oct 24, 2023 at 06:45:57AM -0700, Suren Baghdasaryan wrote:
> Updates since the last version [1]
> - Simplified allocation tagging macros;
> - Runtime enable/disable sysctl switch (/proc/sys/vm/mem_profiling)
> instead of kernel command-line option;
> - CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT to select default enable state;
> - Changed the user-facing API from debugfs to procfs (/proc/allocinfo);
> - Removed context capture support to make patch incremental;
> - Renamed uninstrumented allocation functions to use _noprof suffix;
> - Added __GFP_LAST_BIT to make the code cleaner;
> - Removed lazy per-cpu counters; it turned out the memory savings was
> minimal and not worth the performance impact;

Hello Suren,

> Performance overhead:
> To evaluate performance we implemented an in-kernel test executing
> multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> affinity set to a specific CPU to minimize the noise. Below is performance
> comparison between the baseline kernel, profiling when enabled, profiling
> when disabled and (for comparison purposes) baseline with
> CONFIG_MEMCG_KMEM enabled and allocations using __GFP_ACCOUNT:
> 
> kmalloc pgalloc
> (1 baseline)12.041s 49.190s
> (2 default disabled)14.970s (+24.33%)   49.684s (+1.00%)
> (3 default enabled) 16.859s (+40.01%)   56.287s (+14.43%)
> (4 runtime enabled) 16.983s (+41.04%)   55.760s (+13.36%)
> (5 memcg)   33.831s (+180.96%)  51.433s (+4.56%)

some recent changes [1] to the kmem accounting should have made it quite a bit
faster. Would be great if you can provide new numbers for the comparison.
Maybe with the next revision?

And btw thank you (and Kent): your numbers inspired me to do this kmemcg
performance work. I expect it still to be ~twice more expensive than your
stuff because on the memcg side we handle separately charge and statistics,
but hopefully the difference will be lower.

Thank you!

[1]:
  patches from next tree, so no stable hashes:
mm: kmem: reimplement get_obj_cgroup_from_current()
percpu: scoped objcg protection
mm: kmem: scoped objcg protection
mm: kmem: make memcg keep a reference to the original objcg
mm: kmem: add direct objcg pointer to task_struct
mm: kmem: optimize get_obj_cgroup_from_current()



Re: [PATCH] docs: fix memory.low description in cgroup-v2.rst

2019-09-25 Thread Roman Gushchin
On Wed, Sep 25, 2019 at 12:56:04PM -0700, Jon Haslam wrote:
> The current cgroup-v2.rst file contains an incorrect description of when
> memory is reclaimed from a cgroup that is using the 'memory.low'
> mechanism. This fix simply corrects the text to reflect the actual
> implementation.
> 
> Fixes: 7854207fe954 ("mm/docs: describe memory.low refinements")
> Signed-off-by: Jon Haslam 

Acked-by: Roman Gushchin 

Thanks, Jon!

> ---
>  Documentation/admin-guide/cgroup-v2.rst | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst 
> b/Documentation/admin-guide/cgroup-v2.rst
> index 0fa8c0e615c2..26d1cde6b34a 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1117,8 +1117,8 @@ PAGE_SIZE multiple when read back.
>  
>   Best-effort memory protection.  If the memory usage of a
>   cgroup is within its effective low boundary, the cgroup's
> - memory won't be reclaimed unless memory can be reclaimed
> - from unprotected cgroups.
> + memory won't be reclaimed unless there is no reclaimable
> + memory available in unprotected cgroups.
>  
>   Effective low boundary is limited by memory.low values of
>   all ancestor cgroups. If there is memory.low overcommitment
> @@ -1914,7 +1914,7 @@ Cpuset Interface Files
>  
>  It accepts only the following input values when written to.
>  
> -"root"   - a paritition root
> +"root"   - a partition root
>  "member" - a non-root member of a partition
>  
>   When set to be a partition root, the current cgroup is the
> -- 
> 2.17.1
> 


Re: [PATCH] mm, slab: Extend slab/shrink to shrink all the memcg caches

2019-07-02 Thread Roman Gushchin
On Tue, Jul 02, 2019 at 02:37:30PM -0400, Waiman Long wrote:
> Currently, a value of '1" is written to /sys/kernel/slab//shrink
> file to shrink the slab by flushing all the per-cpu slabs and free
> slabs in partial lists. This applies only to the root caches, though.
> 
> Extends this capability by shrinking all the child memcg caches and
> the root cache when a value of '2' is written to the shrink sysfs file.
> 
> On a 4-socket 112-core 224-thread x86-64 system after a parallel kernel
> build, the the amount of memory occupied by slabs before shrinking
> slabs were:
> 
>  # grep task_struct /proc/slabinfo
>  task_struct 7114   7296   774448 : tunables00
>  0 : slabdata   1824   1824  0
>  # grep "^S[lRU]" /proc/meminfo
>  Slab:1310444 kB
>  SReclaimable: 377604 kB
>  SUnreclaim:   932840 kB
> 
> After shrinking slabs:
> 
>  # grep "^S[lRU]" /proc/meminfo
>  Slab: 695652 kB
>  SReclaimable: 322796 kB
>  SUnreclaim:   372856 kB
>  # grep task_struct /proc/slabinfo
>  task_struct 2262   2572   77444    8 : tunables00
>  0 : slabdata643643  0
> 
> Signed-off-by: Waiman Long 

Acked-by: Roman Gushchin 

Thanks, Waiman!


Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs

2019-06-28 Thread Roman Gushchin
On Fri, Jun 28, 2019 at 10:16:13AM -0700, Yang Shi wrote:
> On Fri, Jun 28, 2019 at 8:32 AM Christopher Lameter  wrote:
> >
> > On Thu, 27 Jun 2019, Roman Gushchin wrote:
> >
> > > so that objects belonging to different memory cgroups can share the same 
> > > page
> > > and kmem_caches.
> > >
> > > It's a fairly big change though.
> >
> > Could this be done at another level? Put a cgoup pointer into the
> > corresponding structures and then go back to just a single kmen_cache for
> > the system as a whole? You can still account them per cgroup and there
> > will be no cleanup problem anymore. You could scan through a slab cache
> > to remove the objects of a certain cgroup and then the fragmentation
> > problem that cgroups create here will be handled by the slab allocators in
> > the traditional way. The duplication of the kmem_cache was not designed
> > into the allocators but bolted on later.
> 
> I'm afraid this may bring in another problem for memcg page reclaim.
> When shrinking the slabs, the shrinker may end up scanning a very long
> list to find out the slabs for a specific memcg. Particularly for the
> count operation, it may have to scan the list from the beginning all
> the way down to the end. It may take unbounded time.
> 
> When I worked on THP deferred split shrinker problem, I used to do
> like this, but it turns out it may take milliseconds to count the
> objects on the list, but it may just need reclaim a few of them.

I don't think the shrinker mechanism should be altered. Shrinker lists
already contain individual objects, and I don't see any reasons, why
these objects can't reside on a shared set of pages.

What we're discussing is that it's way too costly (under some conditions)
to have many sets of kmem_caches, if each of them is containing only
few objects.

Thanks!


Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs

2019-06-28 Thread Roman Gushchin
On Fri, Jun 28, 2019 at 03:32:28PM +, Christopher Lameter wrote:
> On Thu, 27 Jun 2019, Roman Gushchin wrote:
> 
> > so that objects belonging to different memory cgroups can share the same 
> > page
> > and kmem_caches.
> >
> > It's a fairly big change though.
> 
> Could this be done at another level? Put a cgoup pointer into the
> corresponding structures and then go back to just a single kmen_cache for
> the system as a whole?
> You can still account them per cgroup and there
> will be no cleanup problem anymore. You could scan through a slab cache
> to remove the objects of a certain cgroup and then the fragmentation
> problem that cgroups create here will be handled by the slab allocators in
> the traditional way. The duplication of the kmem_cache was not designed
> into the allocators but bolted on later.
> 

Yeah, this is exactly what I'm talking about. Idk how big the performance
penalty will be for small and short-living objects, it should be measured.
But for long-living objects it will be much better for sure...

Thanks!


Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs

2019-06-27 Thread Roman Gushchin
On Thu, Jun 27, 2019 at 04:57:50PM -0400, Waiman Long wrote:
> On 6/26/19 4:19 PM, Roman Gushchin wrote:
> >>  
> >> +#ifdef CONFIG_MEMCG_KMEM
> >> +static void kmem_cache_shrink_memcg(struct mem_cgroup *memcg,
> >> +  void __maybe_unused *arg)
> >> +{
> >> +  struct kmem_cache *s;
> >> +
> >> +  if (memcg == root_mem_cgroup)
> >> +  return;
> >> +  mutex_lock(&slab_mutex);
> >> +  list_for_each_entry(s, &memcg->kmem_caches,
> >> +  memcg_params.kmem_caches_node) {
> >> +  kmem_cache_shrink(s);
> >> +  }
> >> +  mutex_unlock(&slab_mutex);
> >> +  cond_resched();
> >> +}
> > A couple of questions:
> > 1) how about skipping already offlined kmem_caches? They are already shrunk,
> >so you probably won't get much out of them. Or isn't it true?
> 
> I have been thinking about that. This patch is based on the linux tree
> and so don't have an easy to find out if the kmem caches have been
> shrinked. Rebasing this on top of linux-next, I can use the
> SLAB_DEACTIVATED flag as a marker for skipping the shrink.
> 
> With all the latest patches, I am still seeing 121 out of a total of 726
> memcg kmem caches (1/6) that are deactivated caches after system bootup
> one of the test systems. My system is still using cgroup v1 and so the
> number may be different in a v2 setup. The next step is probably to
> figure out why those deactivated caches are still there.

It's not a secret: these kmem_caches are holding objects, which are in use.
It's a drawback of the current slab accounting implementation: every
object holds a whole page and the corresponding kmem_cache. It's optimized
for a large number of objects, which are created and destroyed within
the life of the cgroup (e.g. task_structs), and it works worse for long-living
objects like vfs cache.

Long-term I think we need a different implementation for long-living objects,
so that objects belonging to different memory cgroups can share the same page
and kmem_caches.

It's a fairly big change though.

> 
> > 2) what's your long-term vision here? do you think that we need to shrink
> >kmem_caches periodically, depending on memory pressure? how a user
> >will use this new sysctl?
> Shrinking the kmem caches under extreme memory pressure can be one way
> to free up extra pages, but the effect will probably be temporary.
> > What's the problem you're trying to solve in general?
> 
> At least for the slub allocator, shrinking the caches allow the number
> of active objects reported in slabinfo to be more accurate. In addition,
> this allow to know the real slab memory consumption. I have been working
> on a BZ about continuous memory leaks with a container based workloads.
> The ability to shrink caches allow us to get a more accurate memory
> consumption picture. Another alternative is to turn on slub_debug which
> will then disables all the per-cpu slabs.

I see... I agree with Michal here, that extending drop_caches sysctl isn't
the best idea. Isn't it possible to achieve the same effect using slub sysfs?

Thanks!


Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs

2019-06-26 Thread Roman Gushchin
On Mon, Jun 24, 2019 at 01:42:19PM -0400, Waiman Long wrote:
> With the slub memory allocator, the numbers of active slab objects
> reported in /proc/slabinfo are not real because they include objects
> that are held by the per-cpu slab structures whether they are actually
> used or not.  The problem gets worse the more CPUs a system have. For
> instance, looking at the reported number of active task_struct objects,
> one will wonder where all the missing tasks gone.
> 
> I know it is hard and costly to get a real count of active objects. So
> I am not advocating for that. Instead, this patch extends the
> /proc/sys/vm/drop_caches sysctl parameter by using a new bit (bit 3)
> to shrink all the kmem slabs which will flush out all the slabs in the
> per-cpu structures and give a more accurate view of how much memory are
> really used up by the active slab objects. This is a costly operation,
> of course, but it gives a way to have a clearer picture of the actual
> number of slab objects used, if the need arises.
> 
> The upper range of the drop_caches sysctl parameter is increased to 15
> to allow all possible combinations of the lowest 4 bits.
> 
> On a 2-socket 64-core 256-thread ARM64 system with 64k page size after
> a parallel kernel build, the amount of memory occupied by slabs before
> and after echoing to drop_caches were:
> 
>  # grep task_struct /proc/slabinfo
>  task_struct48376  48434   4288   614 : tunables00
>  0 : slabdata794794  0
>  # grep "^S[lRU]" /proc/meminfo
>  Slab:3419072 kB
>  SReclaimable: 354688 kB
>  SUnreclaim:  3064384 kB
>  # echo 3 > /proc/sys/vm/drop_caches
>  # grep "^S[lRU]" /proc/meminfo
>  Slab:3351680 kB
>  SReclaimable: 316096 kB
>  SUnreclaim:  3035584 kB
>  # echo 8 > /proc/sys/vm/drop_caches
>  # grep "^S[lRU]" /proc/meminfo
>  Slab:1008192 kB
>  SReclaimable: 126912 kB
>  SUnreclaim:   881280 kB
>  # grep task_struct /proc/slabinfo
>  task_struct 2601   6588   4288   614 : tunables00
>  0 : slabdata108108  0
> 
> Shrinking the slabs saves more than 2GB of memory in this case. This
> new feature certainly fulfills the promise of dropping caches.
> 
> Unlike counting objects in the per-node caches done by /proc/slabinfo
> which is rather light weight, iterating all the per-cpu caches and
> shrinking them is much more heavy weight.
> 
> For this particular instance, the time taken to shrinks all the root
> caches was about 30.2ms. There were 73 memory cgroup and the longest
> time taken for shrinking the largest one was about 16.4ms. The total
> shrinking time was about 101ms.
> 
> Because of the potential long time to shrinks all the caches, the
> slab_mutex was taken multiple times - once for all the root caches
> and once for each memory cgroup. This is to reduce the slab_mutex hold
> time to minimize impact to other running applications that may need to
> acquire the mutex.
> 
> The slab shrinking feature is only available when CONFIG_MEMCG_KMEM is
> defined as the code need to access slab_root_caches to iterate all the
> root caches.
> 
> Signed-off-by: Waiman Long 
> ---
>  Documentation/sysctl/vm.txt | 11 --
>  fs/drop_caches.c|  4 
>  include/linux/slab.h|  1 +
>  kernel/sysctl.c |  4 ++--
>  mm/slab_common.c| 44 +
>  5 files changed, 60 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 749322060f10..b643ac8968d2 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -207,8 +207,8 @@ Setting this to zero disables periodic writeback 
> altogether.
>  drop_caches
>  
>  Writing to this will cause the kernel to drop clean caches, as well as
> -reclaimable slab objects like dentries and inodes.  Once dropped, their
> -memory becomes free.
> +reclaimable slab objects like dentries and inodes.  It can also be used
> +to shrink the slabs.  Once dropped, their memory becomes free.
>  
>  To free pagecache:
>   echo 1 > /proc/sys/vm/drop_caches
> @@ -216,6 +216,8 @@ To free reclaimable slab objects (includes dentries and 
> inodes):
>   echo 2 > /proc/sys/vm/drop_caches
>  To free slab objects and pagecache:
>   echo 3 > /proc/sys/vm/drop_caches
> +To shrink the slabs:
> + echo 8 > /proc/sys/vm/drop_caches
>  
>  This is a non-destructive operation and will not free any dirty objects.
>  To increase the number of objects freed by this operation, the user may run
> @@ -223,6 +225,11 @@ To increase the number of objects freed by this 
> operation, the user may run
>  number of dirty objects on the system and create more candidates to be
>  dropped.
>  
> +Shrinking the slabs can reduce the memory footprint used by the slabs.
> +It also makes the number of active objects reported in /proc/slabinfo
> +more representative of the actual number of objects us

[PATCH v8 7/7] cgroup: document cgroup v2 freezer interface

2019-02-19 Thread Roman Gushchin
Describe cgroup v2 freezer interface in the cgroup v2 admin guide.

Signed-off-by: Roman Gushchin 
Reviewed-by: Mike Rapoport 
Cc: Tejun Heo 
Cc: linux-doc@vger.kernel.org
Cc: kernel-t...@fb.com
---
 Documentation/admin-guide/cgroup-v2.rst | 27 +
 1 file changed, 27 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 61f8bbb0a1b2..78f078ddbe9c 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -864,6 +864,8 @@ All cgroup core files are prefixed with "cgroup."
  populated
1 if the cgroup or its descendants contains any live
processes; otherwise, 0.
+ frozen
+   1 if the cgroup is frozen; otherwise, 0.
 
   cgroup.max.descendants
A read-write single value files.  The default is "max".
@@ -897,6 +899,31 @@ All cgroup core files are prefixed with "cgroup."
A dying cgroup can consume system resources not exceeding
limits, which were active at the moment of cgroup deletion.
 
+  cgroup.freeze
+   A read-write single value file which exists on non-root cgroups.
+   Allowed values are "0" and "1". The default is "0".
+
+   Writing "1" to the file causes freezing of the cgroup and all
+   descendant cgroups. This means that all belonging processes will
+   be stopped and will not run until the cgroup will be explicitly
+   unfrozen. Freezing of the cgroup may take some time; when this action
+   is completed, the "frozen" value in the cgroup.events control file
+   will be updated to "1" and the corresponding notification will be
+   issued.
+
+   A cgroup can be frozen either by its own settings, or by settings
+   of any ancestor cgroups. If any of ancestor cgroups is frozen, the
+   cgroup will remain frozen.
+
+   Processes in the frozen cgroup can be killed by a fatal signal.
+   They also can enter and leave a frozen cgroup: either by an explicit
+   move by a user, or if freezing of the cgroup races with fork().
+   If a process is moved to a frozen cgroup, it stops. If a process is
+   moved out of a frozen cgroup, it becomes running.
+
+   Frozen status of a cgroup doesn't affect any cgroup tree operations:
+   it's possible to delete a frozen (and empty) cgroup, as well as
+   create new sub-cgroups.
 
 Controllers
 ===
-- 
2.20.1



[PATCH v7 7/7] cgroup: document cgroup v2 freezer interface

2019-02-12 Thread Roman Gushchin
Describe cgroup v2 freezer interface in the cgroup v2 admin guide.

Signed-off-by: Roman Gushchin 
Reviewed-by: Mike Rapoport 
Cc: Tejun Heo 
Cc: linux-doc@vger.kernel.org
Cc: kernel-t...@fb.com
---
 Documentation/admin-guide/cgroup-v2.rst | 27 +
 1 file changed, 27 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 61f8bbb0a1b2..78f078ddbe9c 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -864,6 +864,8 @@ All cgroup core files are prefixed with "cgroup."
  populated
1 if the cgroup or its descendants contains any live
processes; otherwise, 0.
+ frozen
+   1 if the cgroup is frozen; otherwise, 0.
 
   cgroup.max.descendants
A read-write single value files.  The default is "max".
@@ -897,6 +899,31 @@ All cgroup core files are prefixed with "cgroup."
A dying cgroup can consume system resources not exceeding
limits, which were active at the moment of cgroup deletion.
 
+  cgroup.freeze
+   A read-write single value file which exists on non-root cgroups.
+   Allowed values are "0" and "1". The default is "0".
+
+   Writing "1" to the file causes freezing of the cgroup and all
+   descendant cgroups. This means that all belonging processes will
+   be stopped and will not run until the cgroup will be explicitly
+   unfrozen. Freezing of the cgroup may take some time; when this action
+   is completed, the "frozen" value in the cgroup.events control file
+   will be updated to "1" and the corresponding notification will be
+   issued.
+
+   A cgroup can be frozen either by its own settings, or by settings
+   of any ancestor cgroups. If any of ancestor cgroups is frozen, the
+   cgroup will remain frozen.
+
+   Processes in the frozen cgroup can be killed by a fatal signal.
+   They also can enter and leave a frozen cgroup: either by an explicit
+   move by a user, or if freezing of the cgroup races with fork().
+   If a process is moved to a frozen cgroup, it stops. If a process is
+   moved out of a frozen cgroup, it becomes running.
+
+   Frozen status of a cgroup doesn't affect any cgroup tree operations:
+   it's possible to delete a frozen (and empty) cgroup, as well as
+   create new sub-cgroups.
 
 Controllers
 ===
-- 
2.20.1



[PATCH v6 7/7] cgroup: document cgroup v2 freezer interface

2018-12-21 Thread Roman Gushchin
Describe cgroup v2 freezer interface in the cgroup v2 admin guide.

Signed-off-by: Roman Gushchin 
Reviewed-by: Mike Rapoport 
Cc: Tejun Heo 
Cc: linux-doc@vger.kernel.org
Cc: kernel-t...@fb.com
---
 Documentation/admin-guide/cgroup-v2.rst | 27 +
 1 file changed, 27 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 07e06136a550..f8335e26b362 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -864,6 +864,8 @@ All cgroup core files are prefixed with "cgroup."
  populated
1 if the cgroup or its descendants contains any live
processes; otherwise, 0.
+ frozen
+   1 if the cgroup is frozen; otherwise, 0.
 
   cgroup.max.descendants
A read-write single value files.  The default is "max".
@@ -897,6 +899,31 @@ All cgroup core files are prefixed with "cgroup."
A dying cgroup can consume system resources not exceeding
limits, which were active at the moment of cgroup deletion.
 
+  cgroup.freeze
+   A read-write single value file which exists on non-root cgroups.
+   Allowed values are "0" and "1". The default is "0".
+
+   Writing "1" to the file causes freezing of the cgroup and all
+   descendant cgroups. This means that all belonging processes will
+   be stopped and will not run until the cgroup will be explicitly
+   unfrozen. Freezing of the cgroup may take some time; when this action
+   is completed, the "frozen" value in the cgroup.events control file
+   will be updated to "1" and the corresponding notification will be
+   issued.
+
+   A cgroup can be frozen either by its own settings, or by settings
+   of any ancestor cgroups. If any of ancestor cgroups is frozen, the
+   cgroup will remain frozen.
+
+   Processes in the frozen cgroup can be killed by a fatal signal.
+   They also can enter and leave a frozen cgroup: either by an explicit
+   move by a user, or if freezing of the cgroup races with fork().
+   If a process is moved to a frozen cgroup, it stops. If a process is
+   moved out of a frozen cgroup, it becomes running.
+
+   Frozen status of a cgroup doesn't affect any cgroup tree operations:
+   it's possible to delete a frozen (and empty) cgroup, as well as
+   create new sub-cgroups.
 
 Controllers
 ===
-- 
2.19.2



Re: [PATCH v5 4/7] cgroup: cgroup v2 freezer

2018-12-20 Thread Roman Gushchin
On Thu, Dec 20, 2018 at 05:16:50PM +0100, Oleg Nesterov wrote:
> On 12/18, Roman Gushchin wrote:
> >
> > > > > > --- a/kernel/freezer.c
> > > > > > +++ b/kernel/freezer.c
> > > > > > @@ -134,7 +134,7 @@ bool freeze_task(struct task_struct *p)
> > > > > > return false;
> > > > > >
> > > > > > spin_lock_irqsave(&freezer_lock, flags);
> > > > > > -   if (!freezing(p) || frozen(p)) {
> > > > > > +   if (!freezing(p) || frozen(p) || cgroup_task_frozen()) {
> > > > > > spin_unlock_irqrestore(&freezer_lock, flags);
> > > > > > return false;
> > > > > > }
> > > > > >
> > > > > > --
> > > > > >
> > > > > > If the task is already frozen by the cgroup freezer, we don't have 
> > > > > > to do
> > > > > > anything additionally.
> > > > >
> > > > > I don't think so. A cgroup_task_frozen() task can be killed after
> > > > > try_to_freeze_tasks() succeeds, and the exiting task can close files,
> > > > > do IO, etc. Or it can be thawed by cgroup_freeze_task(false).
> > > > >
> > > > > In short, if try_to_freeze_tasks() succeeds, the caller has all rights
> > > > > to assume that nobody can escape from __refrigerator().
> > > >
> > > > But this is what we do with stopped and ptraced tasks, isn't it?
> > >
> > > No,
> > >
> > > > We do use freezable_schedule() and the system freezer just ignores such 
> > > > tasks.
> > >
> > >   static inline void freezable_schedule(void)
> > >   {
> > >   freezer_do_not_count();
> > >   schedule();
> > >   freezer_count();
> > >   }
> > >
> > > and note that freezer_count() calls try_to_freeze().
> > >
> > > IOW, the task sleeping in freezable_schedule() doesn't really differ from 
> > > the
> > > task sleeping in __refrigerator(). It is not that "the system freezer just
> > > ignores such tasks", it ignores them because it can safely count them as 
> > > frozen.
> >
> > Right, so the task is sleeping peacefully, and we know, that it won't get
> > anywhere, because we'll catch it in freezer_count(). We allow it to sleep
> > there, we don't force it to __refrigerator(), and we treat it as frozen.
> >
> > How's that different to cgroup v2 freezer? If the task is frozen by cgroup 
> > v2
> > freezer, let it sleep there, and catch if it tries to escape. Exactly as it
> > works for SIGSTOP.
> >
> > Am I missing something?
> 
> Roman, perhaps we misunderstood each other...
> 
> I still think that the cgroup_task_frozen() check in freeze_task() you 
> proposed
> a) is not right, and b) it is not what we do with the STOPPED/TRACED tasks 
> which
> call freezable_schedule(). This is what I tried to say.
> 
> If you meant that freezer v2 can too use freezable_schedule() - I agree.

Sorry for the confusion. Yeah, what I'm saying, is that freezable_schedule()
will work for v2 as well.

> 
> > So, you think that v2 freezer should follow the same approach, and allow 
> > tasks
> > sleeping on SIGSTOP, for instance, to be treated as frozen?
> > Hm, maybe. I have to think more here.
> 
> I think this would be nice. Otherwise, say, CGRP_FREEZE can be never reported
> if I read this code correctly. And this looks "symmetrical" with the fact that
> a ->frozen task reacts to SIGSTOP and it is still treated as frozen after 
> that.

Yeah, looks so. I'll try to implement this in v6.

Thanks!


Re: [PATCH v5 4/7] cgroup: cgroup v2 freezer

2018-12-18 Thread Roman Gushchin
On Tue, Dec 18, 2018 at 06:12:30PM +0100, Oleg Nesterov wrote:
> On 12/18, Roman Gushchin wrote:
> >
> > On Wed, Dec 12, 2018 at 06:49:02PM +0100, Oleg Nesterov wrote:
> > > > > and btw what about suspend? try_to_freeze_tasks() will obviously 
> > > > > fail
> > > > > if there is a ->frozen thread?
> > > >
> > > > I have to think a bit more here, but something like this will probably 
> > > > work:
> > > >
> > > > diff --git a/kernel/freezer.c b/kernel/freezer.c
> > > > index b162b74611e4..590ac4d10b02 100644
> > > > --- a/kernel/freezer.c
> > > > +++ b/kernel/freezer.c
> > > > @@ -134,7 +134,7 @@ bool freeze_task(struct task_struct *p)
> > > > return false;
> > > >
> > > > spin_lock_irqsave(&freezer_lock, flags);
> > > > -   if (!freezing(p) || frozen(p)) {
> > > > +   if (!freezing(p) || frozen(p) || cgroup_task_frozen()) {
> > > > spin_unlock_irqrestore(&freezer_lock, flags);
> > > > return false;
> > > > }
> > > >
> > > > --
> > > >
> > > > If the task is already frozen by the cgroup freezer, we don't have to do
> > > > anything additionally.
> > >
> > > I don't think so. A cgroup_task_frozen() task can be killed after
> > > try_to_freeze_tasks() succeeds, and the exiting task can close files,
> > > do IO, etc. Or it can be thawed by cgroup_freeze_task(false).
> > >
> > > In short, if try_to_freeze_tasks() succeeds, the caller has all rights
> > > to assume that nobody can escape from __refrigerator().
> >
> > But this is what we do with stopped and ptraced tasks, isn't it?
> 
> No,
> 
> > We do use freezable_schedule() and the system freezer just ignores such 
> > tasks.
> 
>   static inline void freezable_schedule(void)
>   {
>   freezer_do_not_count();
>   schedule();
>   freezer_count();
>   }
> 
> and note that freezer_count() calls try_to_freeze().
> 
> IOW, the task sleeping in freezable_schedule() doesn't really differ from the
> task sleeping in __refrigerator(). It is not that "the system freezer just
> ignores such tasks", it ignores them because it can safely count them as 
> frozen.

Right, so the task is sleeping peacefully, and we know, that it won't get
anywhere, because we'll catch it in freezer_count(). We allow it to sleep
there, we don't force it to __refrigerator(), and we treat it as frozen.

How's that different to cgroup v2 freezer? If the task is frozen by cgroup v2
freezer, let it sleep there, and catch if it tries to escape. Exactly as it
works for SIGSTOP.

Am I missing something?

> 
> > > And what about TASK_STOPPED/TASK_TRACED tasks? They can not be frozen
> > > or thawed, right? This doesn't look good, and this differs from the
> > > current freezer controller...
> >
> > Good question!
> >
> > It looks like cgroup v1 freezer just ignores them treating as already 
> > frozen,
> > which doesn't look nice.
> 
> Not sure I understand you, but see above... cgroup v1 freezer looks fine wrt
> stopped/traced tasks.

So, you think that v2 freezer should follow the same approach, and allow tasks
sleeping on SIGSTOP, for instance, to be treated as frozen?
Hm, maybe. I have to think more here.

Thank you!


Re: [PATCH v5 4/7] cgroup: cgroup v2 freezer

2018-12-17 Thread Roman Gushchin
On Wed, Dec 12, 2018 at 06:49:02PM +0100, Oleg Nesterov wrote:
> On 12/11, Roman Gushchin wrote:
> >
> > On Tue, Dec 11, 2018 at 05:26:32PM +0100, Oleg Nesterov wrote:
> > > On 12/07, Roman Gushchin wrote:
> > > >
> > > > Cgroup v2 freezer tries to put tasks into a state similar to jobctl
> > > > stop. This means that tasks can be killed, ptraced (using
> > > > PTRACE_SEIZE*), and interrupted. It is possible to attach to
> > > > a frozen task, get some information (e.g. read registers) and detach.
> > >
> > > I fail to understand how this all supposed to work.
> > >
> > > > @@ -368,6 +369,8 @@ static inline int signal_pending_state(long state, 
> > > > struct task_struct *p)
> > > > return 0;
> > > > if (!signal_pending(p))
> > > > return 0;
> > > > +   if (unlikely(cgroup_task_frozen(p) && p->jobctl == 
> > > > JOBCTL_TRAP_FREEZE))
> > > > +   return __fatal_signal_pending(p);
> > >
> > > I think I will never agree with this change ;) and I don't think it 
> > > actually helps.
> >
> > See below.
> >
> > >
> > > > +void cgroup_enter_frozen(void)
> > > > +{
> > > > +   if (!current->frozen) {
> > > > +   spin_lock_irq(&css_set_lock);
> > > > +   current->frozen = true;
> > > > +   cgroup_inc_frozen_cnt(task_dfl_cgroup(current), false, 
> > > > true);
> > > > +   spin_unlock_irq(&css_set_lock);
> > > > +   }
> > > > +
> > > > +   __set_current_state(TASK_INTERRUPTIBLE);
> > > > +   schedule();
> > >
> > > So once again, suppose it races with PTRACE_INTERRUPT, or SIGSTOP, or 
> > > something
> > > else which should be handled by get_signal() before do_freezer_trap().
> > >
> > > If (say) PTRACE_INTERRUPT comes before schedule it will be lost. Otherwise
> > > the frozen task will react. This can't be right. Or I am totally confused.
> >
> > Why?
> > PTRACE_INTERRUPT will set JOBCTL_TRAP_STOP, so signal_pending_state()
> > will return true, schedule() will return immediately, and we'll handle the 
> > trap.
> 
> OK, I misread the JOBCTL_TRAP_FREEZE check as "jobctl & JOBCTL_TRAP_FREEZE".
> 
> But p->jobctl == JOBCTL_TRAP_FREEZE doesn't look right too. For example,
> JOBCTL_STOP_DEQUEUED can be set. You probably need something like
> 
>   jobctl & (JOBCTL_PENDING_MASK | JOBCTL_TRAP_FREEZE) == 
> JOBCTL_TRAP_FREEZE
> 
> And you need a barrier in between, iow you need 
> set_current_state(TASK_INTERRUPTIBLE).
> 
> But this doesn't really matter. I don't think you need to modify 
> signal_pending_state()
> and penalize schedule(). You can do something like
> 
>   spin_lock_irq(sigllock);
>   if (jobctl & (JOBCTL_PENDING_MASK | JOBCTL_TRAP_FREEZE) == 
> JOBCTL_TRAP_FREEZE &&
>   !__fatal_signal_pending())
>   {
>   __set_current_state(TASK_INTERRUPTIBLE);
>   clear_thread_flag(TIF_SIGPENDING);
>   }
>   spin_unlock_irq(siglock);
> 
>   schedule();
>   // recalc_sigpending() is not needed
> 
> in cgroup_enter_frozen() with the same effect. Which looks equally ugly and
> suboptimal, but at least this doesn't touch the sched code.

Gotcha. Will follow this approach in v6.

> 
> > > and btw what about suspend? try_to_freeze_tasks() will obviously fail
> > > if there is a ->frozen thread?
> >
> > I have to think a bit more here, but something like this will probably work:
> >
> > diff --git a/kernel/freezer.c b/kernel/freezer.c
> > index b162b74611e4..590ac4d10b02 100644
> > --- a/kernel/freezer.c
> > +++ b/kernel/freezer.c
> > @@ -134,7 +134,7 @@ bool freeze_task(struct task_struct *p)
> > return false;
> >
> > spin_lock_irqsave(&freezer_lock, flags);
> > -   if (!freezing(p) || frozen(p)) {
> > +   if (!freezing(p) || frozen(p) || cgroup_task_frozen()) {
> > spin_unlock_irqrestore(&freezer_lock, flags);
> > return false;
> > }
> >
> > --
> >
> > If the task is already frozen by the cgroup freezer, we don't have to do
> > anything additionally.
> 
> I don't think so. A cgroup_task_frozen() task can be killed after
> tr

Re: [PATCH v5 4/7] cgroup: cgroup v2 freezer

2018-12-11 Thread Roman Gushchin
On Tue, Dec 11, 2018 at 05:26:32PM +0100, Oleg Nesterov wrote:
> On 12/07, Roman Gushchin wrote:
> >
> > Cgroup v2 freezer tries to put tasks into a state similar to jobctl
> > stop. This means that tasks can be killed, ptraced (using
> > PTRACE_SEIZE*), and interrupted. It is possible to attach to
> > a frozen task, get some information (e.g. read registers) and detach.
> 
> I fail to understand how this all supposed to work.
> 
> > @@ -368,6 +369,8 @@ static inline int signal_pending_state(long state, 
> > struct task_struct *p)
> > return 0;
> > if (!signal_pending(p))
> > return 0;
> > +   if (unlikely(cgroup_task_frozen(p) && p->jobctl == JOBCTL_TRAP_FREEZE))
> > +   return __fatal_signal_pending(p);
> 
> I think I will never agree with this change ;) and I don't think it actually 
> helps.

See below.

> 
> > +void cgroup_enter_frozen(void)
> > +{
> > +   if (!current->frozen) {
> > +   spin_lock_irq(&css_set_lock);
> > +   current->frozen = true;
> > +   cgroup_inc_frozen_cnt(task_dfl_cgroup(current), false, true);
> > +   spin_unlock_irq(&css_set_lock);
> > +   }
> > +
> > +   __set_current_state(TASK_INTERRUPTIBLE);
> > +   schedule();
> 
> So once again, suppose it races with PTRACE_INTERRUPT, or SIGSTOP, or 
> something
> else which should be handled by get_signal() before do_freezer_trap().
> 
> If (say) PTRACE_INTERRUPT comes before schedule it will be lost. Otherwise
> the frozen task will react. This can't be right. Or I am totally confused.

Why?
PTRACE_INTERRUPT will set JOBCTL_TRAP_STOP, so signal_pending_state()
will return true, schedule() will return immediately, and we'll handle the trap.

> 
> Perhaps you can split this patch? start with cgroup_enter_frozen() using
> TASK_KILLABLE, then teach it to handle ptrace/stop/etc? I think this way it
> would be simpler to discuss the necessary changes and document what exactly
> are you trying to do.
> 
> and btw what about suspend? try_to_freeze_tasks() will obviously fail
> if there is a ->frozen thread?

I have to think a bit more here, but something like this will probably work:

diff --git a/kernel/freezer.c b/kernel/freezer.c
index b162b74611e4..590ac4d10b02 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -134,7 +134,7 @@ bool freeze_task(struct task_struct *p)
return false;
 
spin_lock_irqsave(&freezer_lock, flags);
-   if (!freezing(p) || frozen(p)) {
+   if (!freezing(p) || frozen(p) || cgroup_task_frozen()) {
spin_unlock_irqrestore(&freezer_lock, flags);
return false;
}

--

If the task is already frozen by the cgroup freezer, we don't have to do
anything additionally.

Thanks!


[PATCH v5 2/7] cgroup: implement __cgroup_task_count() helper

2018-12-07 Thread Roman Gushchin
The helper is identical to the existing cgroup_task_count()
except it doesn't take the css_set_lock by itself, assuming
that the caller does.

Also, move cgroup_task_count() implementation into
kernel/cgroup/cgroup.c, as there is nothing specific to cgroup v1.

Signed-off-by: Roman Gushchin 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
---
 kernel/cgroup/cgroup-internal.h |  1 +
 kernel/cgroup/cgroup-v1.c   | 16 
 kernel/cgroup/cgroup.c  | 33 +
 3 files changed, 34 insertions(+), 16 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index c950864016e2..a195328431ce 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -226,6 +226,7 @@ int cgroup_rmdir(struct kernfs_node *kn);
 int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
 struct kernfs_root *kf_root);
 
+int __cgroup_task_count(const struct cgroup *cgrp);
 int cgroup_task_count(const struct cgroup *cgrp);
 
 /*
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 51063e7a93c2..6134fef07d57 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -336,22 +336,6 @@ static struct cgroup_pidlist 
*cgroup_pidlist_find_create(struct cgroup *cgrp,
return l;
 }
 
-/**
- * cgroup_task_count - count the number of tasks in a cgroup.
- * @cgrp: the cgroup in question
- */
-int cgroup_task_count(const struct cgroup *cgrp)
-{
-   int count = 0;
-   struct cgrp_cset_link *link;
-
-   spin_lock_irq(&css_set_lock);
-   list_for_each_entry(link, &cgrp->cset_links, cset_link)
-   count += link->cset->nr_tasks;
-   spin_unlock_irq(&css_set_lock);
-   return count;
-}
-
 /*
  * Load a cgroup's pidarray with either procs' tgids or tasks' pids
  */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index e06994fd4e34..7519a4307021 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -563,6 +563,39 @@ static void cgroup_get_live(struct cgroup *cgrp)
css_get(&cgrp->self);
 }
 
+/**
+ * __cgroup_task_count - count the number of tasks in a cgroup. The caller
+ * is responsible for taking the css_set_lock.
+ * @cgrp: the cgroup in question
+ */
+int __cgroup_task_count(const struct cgroup *cgrp)
+{
+   int count = 0;
+   struct cgrp_cset_link *link;
+
+   lockdep_assert_held(&css_set_lock);
+
+   list_for_each_entry(link, &cgrp->cset_links, cset_link)
+   count += link->cset->nr_tasks;
+
+   return count;
+}
+
+/**
+ * cgroup_task_count - count the number of tasks in a cgroup.
+ * @cgrp: the cgroup in question
+ */
+int cgroup_task_count(const struct cgroup *cgrp)
+{
+   int count;
+
+   spin_lock_irq(&css_set_lock);
+   count = __cgroup_task_count(cgrp);
+   spin_unlock_irq(&css_set_lock);
+
+   return count;
+}
+
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
struct cgroup *cgrp = of->kn->parent->priv;
-- 
2.17.2



[PATCH v5 0/7] freezer for cgroup v2

2018-12-07 Thread Roman Gushchin
This patchset implements freezer for cgroup v2.

It provides similar functionality as v1 freezer, but the interface
conforms to the cgroup v2 interface design principles, and it
provides a better user experience: tasks can be killed, ptrace works,
there is no separate controller, which has to be enabled, etc.

Patches (1), (2) and (3) are some preparational work, patch (4) contains
the implementation, patch (5) is a small cgroup kselftest fix,
patch (6) covers freezer adds 6 new kselftests to cover the freezer
functionality. Patch (7) adds corresponding docs.

v5->v4:
  - rewored cgroup state transition code (suggested by Tejun Heo)
  - look at JOBCTL_TRAP_FREEZE instead of task->frozen in
recalc_sigpending(), check for task->frozen and JOBCTL_TRAP_FREEZE
in signal_pending_state() (suggested by Oleg Nesterov)
  - some cosmetic changes in signal.c (suggested by Oleg Nesterov)
  - cleaned up comments

v4->v3:
  - reading nr_descendants doesn't require taking css_set_lock anymore
  - fixed docs based on Mike Rapoport's feedback
  - fixed double irq lock found by Dan Carpenter

v3->v2:
  - dropped TASK_FROZEN for now, frozen tasks are put into TASK_INTERRUPTIBLE
  state; it's probably not the final version, but the API question can be
  discussed separately
  - don't clear TIF_SIGPENDING before going to sleep, instead add
  task->frozen check in signal_pending_state() and recalc_sigpending()
  - cgroup-level counter are now synchronized using css_set_lock,
  which simplified the whole code (e.g. per-cgroup works were removed)
  - the amount of comments increased significantly
  - many other improvements incorporating feedback from Tejun and Oleg

v2->v1:
  - fixed locking aroung calling cgroup_freezer_leave()
  - added docs

Roman Gushchin (7):
  cgroup: rename freezer.c into legacy_freezer.c
  cgroup: implement __cgroup_task_count() helper
  cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock
  cgroup: cgroup v2 freezer
  kselftests: cgroup: don't fail on cg_kill_all() error in cg_destroy()
  kselftests: cgroup: add freezer controller self-tests
  cgroup: document cgroup v2 freezer interface

 Documentation/admin-guide/cgroup-v2.rst   |  27 +
 include/linux/cgroup-defs.h   |  33 +
 include/linux/cgroup.h|  42 ++
 include/linux/sched.h |   2 +
 include/linux/sched/jobctl.h  |   2 +
 include/linux/sched/signal.h  |   3 +
 kernel/cgroup/Makefile|   4 +-
 kernel/cgroup/cgroup-internal.h   |   1 +
 kernel/cgroup/cgroup-v1.c |  16 -
 kernel/cgroup/cgroup.c| 145 +++-
 kernel/cgroup/freezer.c   | 634 ++--
 kernel/cgroup/legacy_freezer.c| 481 
 kernel/signal.c   |  55 +-
 tools/testing/selftests/cgroup/.gitignore |   1 +
 tools/testing/selftests/cgroup/Makefile   |   2 +
 tools/testing/selftests/cgroup/cgroup_util.c  |  85 ++-
 tools/testing/selftests/cgroup/cgroup_util.h  |   7 +
 tools/testing/selftests/cgroup/test_freezer.c | 685 ++
 18 files changed, 1794 insertions(+), 431 deletions(-)
 create mode 100644 kernel/cgroup/legacy_freezer.c
 create mode 100644 tools/testing/selftests/cgroup/test_freezer.c

-- 
2.17.2



[PATCH v5 6/7] kselftests: cgroup: add freezer controller self-tests

2018-12-07 Thread Roman Gushchin
This patch implements six tests for the freezer controller for
cgroup v2:
1) a simple test, which aims to freeze and unfreeze a cgroup with 100
processes
2) a more complicated tree test, which creates a hierarchy of cgroups,
puts some processes in some cgroups, and tries to freeze and unfreeze
different parts of the subtree
3) a forkbomb test: the test aims to freeze a forkbomb running in a
cgroup, kill all tasks in the cgroup and remove the cgroup without
the unfreezing.
4) rmdir test: the test creates two nested cgroups, freezes the parent
one, checks that the child can be successfully removed, and a new
child can be created
5) migration tests: the test checks migration of a task between
frozen cgroups: from a frozen to a running, from a running to a
frozen, and from a frozen to a frozen.
6) ptrace test: the test checks that it's possible to attach to
a process in a frozen cgroup, get some information and detach, and
the cgroup will remain frozen.

Expected output:

  $ ./test_freezer
  ok 1 test_cgfreezer_simple
  ok 2 test_cgfreezer_tree
  ok 3 test_cgfreezer_forkbomb
  ok 4 test_cgrreezer_rmdir
  ok 5 test_cgfreezer_migrate
  ok 6 test_cgfreezer_ptrace

Signed-off-by: Roman Gushchin 
Cc: Shuah Khan 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: linux-kselft...@vger.kernel.org
---
 tools/testing/selftests/cgroup/.gitignore |   1 +
 tools/testing/selftests/cgroup/Makefile   |   2 +
 tools/testing/selftests/cgroup/cgroup_util.c  |  81 ++-
 tools/testing/selftests/cgroup/cgroup_util.h  |   7 +
 tools/testing/selftests/cgroup/test_freezer.c | 685 ++
 5 files changed, 775 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/cgroup/test_freezer.c

diff --git a/tools/testing/selftests/cgroup/.gitignore 
b/tools/testing/selftests/cgroup/.gitignore
index adacda50a4b2..7f9835624793 100644
--- a/tools/testing/selftests/cgroup/.gitignore
+++ b/tools/testing/selftests/cgroup/.gitignore
@@ -1,2 +1,3 @@
 test_memcontrol
 test_core
+test_freezer
diff --git a/tools/testing/selftests/cgroup/Makefile 
b/tools/testing/selftests/cgroup/Makefile
index 23fbaa4a9630..8d369b6a2069 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -5,8 +5,10 @@ all:
 
 TEST_GEN_PROGS = test_memcontrol
 TEST_GEN_PROGS += test_core
+TEST_GEN_PROGS += test_freezer
 
 include ../lib.mk
 
 $(OUTPUT)/test_memcontrol: cgroup_util.c
 $(OUTPUT)/test_core: cgroup_util.c
+$(OUTPUT)/test_freezer: cgroup_util.c
diff --git a/tools/testing/selftests/cgroup/cgroup_util.c 
b/tools/testing/selftests/cgroup/cgroup_util.c
index eba06f94433b..e9cdad673901 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -74,6 +74,16 @@ char *cg_name_indexed(const char *root, const char *name, 
int index)
return ret;
 }
 
+char *cg_control(const char *cgroup, const char *control)
+{
+   size_t len = strlen(cgroup) + strlen(control) + 2;
+   char *ret = malloc(len);
+
+   snprintf(ret, len, "%s/%s", cgroup, control);
+
+   return ret;
+}
+
 int cg_read(const char *cgroup, const char *control, char *buf, size_t len)
 {
char path[PATH_MAX];
@@ -196,7 +206,59 @@ int cg_create(const char *cgroup)
return mkdir(cgroup, 0644);
 }
 
-static int cg_killall(const char *cgroup)
+int cg_for_all_procs(const char *cgroup, int (*fn)(int pid, void *arg),
+void *arg)
+{
+   char buf[PAGE_SIZE];
+   char *ptr = buf;
+   int ret;
+
+   if (cg_read(cgroup, "cgroup.procs", buf, sizeof(buf)))
+   return -1;
+
+   while (ptr < buf + sizeof(buf)) {
+   int pid = strtol(ptr, &ptr, 10);
+
+   if (pid == 0)
+   break;
+   if (*ptr)
+   ptr++;
+   else
+   break;
+   ret = fn(pid, arg);
+   if (ret)
+   return ret;
+   }
+
+   return 0;
+}
+
+int cg_wait_for_proc_count(const char *cgroup, int count)
+{
+   char buf[10 * PAGE_SIZE] = {0};
+   int attempts;
+   char *ptr;
+
+   for (attempts = 10; attempts >= 0; attempts--) {
+   int nr = 0;
+
+   if (cg_read(cgroup, "cgroup.procs", buf, sizeof(buf)))
+   break;
+
+   for (ptr = buf; *ptr; ptr++)
+   if (*ptr == '\n')
+   nr++;
+
+   if (nr >= count)
+   return 0;
+
+   usleep(10);
+   }
+
+   return -1;
+}
+
+int cg_killall(const char *cgroup)
 {
char buf[PAGE_SIZE];
char *ptr = buf;
@@ -238,6 +300,14 @@ int cg_destroy(const char *cgroup)
return ret;
 }
 
+int cg_enter(const char *cgroup, int pid)
+{
+   char pidbuf[64];
+
+   snprintf(pidbuf, sizeof(pidbuf), "%d", pid);
+   return 

[PATCH v5 3/7] cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock

2018-12-07 Thread Roman Gushchin
The number of descendant cgroups and the number of dying
descendant cgroups are currently synchronized using the cgroup_mutex.

The number of descendant cgroups will be required by the cgroup v2
freezer, which will use it to determine if a cgroup is frozen
(depending on total number of descendants and number of frozen
descendants). It's not always acceptable to grab the cgroup_mutex,
especially from quite hot paths (e.g. exit()).

To avoid this, let's additionally synchronize these counters using
the css_set_lock.

So, it's safe to read these counters with either cgroup_mutex or
css_set_lock locked, and for changing both locks should be acquired.

Signed-off-by: Roman Gushchin 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
---
 include/linux/cgroup-defs.h | 5 +
 kernel/cgroup/cgroup.c  | 6 ++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 8fcbae1b8db0..03355d7008ff 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -348,6 +348,11 @@ struct cgroup {
 * Dying cgroups are cgroups which were deleted by a user,
 * but are still existing because someone else is holding a reference.
 * max_descendants is a maximum allowed number of descent cgroups.
+*
+* nr_descendants and nr_dying_descendants are protected
+* by cgroup_mutex and css_set_lock. It's fine to read them holding
+* any of cgroup_mutex and css_set_lock; for writing both locks
+* should be held.
 */
int nr_descendants;
int nr_dying_descendants;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 7519a4307021..f89dde50f693 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4723,9 +4723,11 @@ static void css_release_work_fn(struct work_struct *work)
if (cgroup_on_dfl(cgrp))
cgroup_rstat_flush(cgrp);
 
+   spin_lock_irq(&css_set_lock);
for (tcgrp = cgroup_parent(cgrp); tcgrp;
 tcgrp = cgroup_parent(tcgrp))
tcgrp->nr_dying_descendants--;
+   spin_unlock_irq(&css_set_lock);
 
cgroup_idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
cgrp->id = -1;
@@ -4943,12 +4945,14 @@ static struct cgroup *cgroup_create(struct cgroup 
*parent)
if (ret)
goto out_psi_free;
 
+   spin_lock_irq(&css_set_lock);
for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp)) {
cgrp->ancestor_ids[tcgrp->level] = tcgrp->id;
 
if (tcgrp != cgrp)
tcgrp->nr_descendants++;
}
+   spin_unlock_irq(&css_set_lock);
 
if (notify_on_release(parent))
set_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
@@ -5233,10 +5237,12 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
if (parent && cgroup_is_threaded(cgrp))
parent->nr_threaded_children--;
 
+   spin_lock_irq(&css_set_lock);
for (tcgrp = cgroup_parent(cgrp); tcgrp; tcgrp = cgroup_parent(tcgrp)) {
tcgrp->nr_descendants--;
tcgrp->nr_dying_descendants++;
}
+   spin_unlock_irq(&css_set_lock);
 
cgroup1_check_for_release(parent);
 
-- 
2.17.2



[PATCH v5 4/7] cgroup: cgroup v2 freezer

2018-12-07 Thread Roman Gushchin
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.

This patch implements freezer for cgroup v2.

Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.

This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.

Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.

* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.

There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.

Signed-off-by: Roman Gushchin 
Cc: Tejun Heo 
Cc: Oleg Nesterov 
Cc: kernel-t...@fb.com
---
 include/linux/cgroup-defs.h  |  28 
 include/linux/cgroup.h   |  42 +
 include/linux/sched.h|   2 +
 include/linux/sched/jobctl.h |   2 +
 include/linux/sched/signal.h |   3 +
 kernel/cgroup/Makefile   |   2 +-
 kernel/cgroup/cgroup.c   | 106 +++-
 kernel/cgroup/freezer.c  | 313 +++
 kernel/signal.c  |  55 +-
 9 files changed, 544 insertions(+), 9 deletions(-)
 create mode 100644 kernel/cgroup/freezer.c

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 03355d7008ff..e2e3e2f4c692 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -64,6 +64,12 @@ enum {
 * specified at mount time and thus is implemented here.
 */
CGRP_CPUSET_CLONE_CHILDREN,
+
+   /* Control group has to be frozen. */
+   CGRP_FREEZE,
+
+   /* Cgroup is frozen. */
+   CGRP_FROZEN,
 };
 
 /* cgroup_root->flags */
@@ -316,6 +322,25 @@ struct cgroup_rstat_cpu {
struct cgroup *updated_next;/* NULL iff not on the list */
 };
 
+struct cgroup_freezer_state {
+   /* Should the cgroup and its descendants be frozen. */
+   bool freeze;
+
+   /* Should the cgroup actually be frozen? */
+   int e_freeze;
+
+   /* Fields below are protected by css_set_lock */
+
+   /* Number of frozen descendant cgroups */
+   int nr_frozen_descendants;
+
+   /* Number of tasks to freeze */
+   int nr_tasks_to_freeze;
+
+   /* Number of frozen tasks */
+   int nr_frozen_tasks;
+};
+
 struct cgroup {
/* self css with NULL ->ss, points back to this cgroup */
struct cgroup_subsys_state self;
@@ -452,6 +477,9 @@ struct cgroup {
/* If there is block congestion on this cgroup. */
atomic_t congestion_count;
 
+   /* Used to store internal freezer state */
+   struct cgroup_freezer_state freezer;
+
/* ids of the ancestors at each level including self */
int ancestor_ids[];
 };
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 9d12757a65b0..3405a9d476ff 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -877,4 +877,46 @@ static inline void put_cgroup_ns(struct cgroup_namespace 
*ns)
free_cgroup_ns(ns);
 }
 
+#ifdef CONFIG_CGROUPS
+
+void cgroup_enter_frozen(void);
+void cgroup_leave_frozen(void);
+void cgroup_dec_tasks_to_freeze(struct cgroup *cgrp);
+void cgroup_freeze(struct cgroup *cgrp, bool freeze);
+void cgroup_freezer_migrate_task(struct task_struct *task, struct cgroup *src,
+struct cgroup *dst);
+static inline bool cgroup_task_freeze(struct task_struct *task)
+{
+   bool ret;
+
+   if (task->flags & PF_KTHREAD)
+   return false;
+
+   rcu_read_lock();
+   ret = test_

[PATCH v5 5/7] kselftests: cgroup: don't fail on cg_kill_all() error in cg_destroy()

2018-12-07 Thread Roman Gushchin
If the cgroup destruction races with an exit() of a belonging
process(es), cg_kill_all() may fail. It's not a good reason to make
cg_destroy() fail and leave the cgroup in place, potentially causing
next test runs to fail.

Signed-off-by: Roman Gushchin 
Cc: Shuah Khan 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: linux-kselft...@vger.kernel.org
---
 tools/testing/selftests/cgroup/cgroup_util.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/tools/testing/selftests/cgroup/cgroup_util.c 
b/tools/testing/selftests/cgroup/cgroup_util.c
index 14c9fe284806..eba06f94433b 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -227,9 +227,7 @@ int cg_destroy(const char *cgroup)
 retry:
ret = rmdir(cgroup);
if (ret && errno == EBUSY) {
-   ret = cg_killall(cgroup);
-   if (ret)
-   return ret;
+   cg_killall(cgroup);
usleep(100);
goto retry;
}
-- 
2.17.2



[PATCH v5 7/7] cgroup: document cgroup v2 freezer interface

2018-12-07 Thread Roman Gushchin
Describe cgroup v2 freezer interface in the cgroup v2 admin guide.

Signed-off-by: Roman Gushchin 
Reviewed-by: Mike Rapoport 
Cc: Tejun Heo 
Cc: linux-doc@vger.kernel.org
Cc: kernel-t...@fb.com
---
 Documentation/admin-guide/cgroup-v2.rst | 27 +
 1 file changed, 27 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 07e06136a550..f8335e26b362 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -864,6 +864,8 @@ All cgroup core files are prefixed with "cgroup."
  populated
1 if the cgroup or its descendants contains any live
processes; otherwise, 0.
+ frozen
+   1 if the cgroup is frozen; otherwise, 0.
 
   cgroup.max.descendants
A read-write single value files.  The default is "max".
@@ -897,6 +899,31 @@ All cgroup core files are prefixed with "cgroup."
A dying cgroup can consume system resources not exceeding
limits, which were active at the moment of cgroup deletion.
 
+  cgroup.freeze
+   A read-write single value file which exists on non-root cgroups.
+   Allowed values are "0" and "1". The default is "0".
+
+   Writing "1" to the file causes freezing of the cgroup and all
+   descendant cgroups. This means that all belonging processes will
+   be stopped and will not run until the cgroup will be explicitly
+   unfrozen. Freezing of the cgroup may take some time; when this action
+   is completed, the "frozen" value in the cgroup.events control file
+   will be updated to "1" and the corresponding notification will be
+   issued.
+
+   A cgroup can be frozen either by its own settings, or by settings
+   of any ancestor cgroups. If any of ancestor cgroups is frozen, the
+   cgroup will remain frozen.
+
+   Processes in the frozen cgroup can be killed by a fatal signal.
+   They also can enter and leave a frozen cgroup: either by an explicit
+   move by a user, or if freezing of the cgroup races with fork().
+   If a process is moved to a frozen cgroup, it stops. If a process is
+   moved out of a frozen cgroup, it becomes running.
+
+   Frozen status of a cgroup doesn't affect any cgroup tree operations:
+   it's possible to delete a frozen (and empty) cgroup, as well as
+   create new sub-cgroups.
 
 Controllers
 ===
-- 
2.17.2



[PATCH v5 1/7] cgroup: rename freezer.c into legacy_freezer.c

2018-12-07 Thread Roman Gushchin
Freezer.c will contain an implementation of cgroup v2 freezer,
so let's rename the v1 freezer to avoid naming conflicts.

Signed-off-by: Roman Gushchin 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
---
 kernel/cgroup/Makefile| 2 +-
 kernel/cgroup/{freezer.c => legacy_freezer.c} | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename kernel/cgroup/{freezer.c => legacy_freezer.c} (100%)

diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index bfcdae896122..8d5689ca94b9 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-y := cgroup.o rstat.o namespace.o cgroup-v1.o
 
-obj-$(CONFIG_CGROUP_FREEZER) += freezer.o
+obj-$(CONFIG_CGROUP_FREEZER) += legacy_freezer.o
 obj-$(CONFIG_CGROUP_PIDS) += pids.o
 obj-$(CONFIG_CGROUP_RDMA) += rdma.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
diff --git a/kernel/cgroup/freezer.c b/kernel/cgroup/legacy_freezer.c
similarity index 100%
rename from kernel/cgroup/freezer.c
rename to kernel/cgroup/legacy_freezer.c
-- 
2.17.2



Re: [PATCH v4 4/7] cgroup: cgroup v2 freezer

2018-12-03 Thread Roman Gushchin
On Mon, Dec 03, 2018 at 03:47:18PM +0100, Oleg Nesterov wrote:
> To be honest, I fail to understand this patch. At least after a quick glance,
> I will try to read it again tomorrow but so far I do not even understand the
> desired semantics wrt signals/ptrace.
> 
> On 11/30, Roman Gushchin wrote:
> >
> > @@ -368,6 +369,8 @@ static inline int signal_pending_state(long state, 
> > struct task_struct *p)
> > return 0;
> > if (!signal_pending(p))
> > return 0;
> > +   if (unlikely(cgroup_task_frozen(p)))
> > +   return __fatal_signal_pending(p);
> 
> Oh, this is not nice. And doesn't look right.
> 
> > +/*
> > + * Entry path into frozen state.
> > + * If the task was not frozen before, counters are updated and the cgroup 
> > state
> > + * is revisited. Otherwise, the task is put into the TASK_KILLABLE sleep.
> > + */
> > +void cgroup_enter_frozen(void)
> > +{
> > +   if (!current->frozen) {
> > +   struct cgroup *cgrp;
> > +
> > +   spin_lock_irq(&css_set_lock);
> > +   current->frozen = true;
> > +   cgrp = task_dfl_cgroup(current);
> > +   cgrp->freezer.nr_frozen_tasks++;
> > +   WARN_ON_ONCE(cgrp->freezer.nr_frozen_tasks >
> > +cgrp->freezer.nr_tasks_to_freeze);
> > +   cgroup_update_frozen(cgrp, true);
> > +   spin_unlock_irq(&css_set_lock);
> > +   }
> > +
> > +   __set_current_state(TASK_INTERRUPTIBLE);
> > +   schedule();
> 
> The comment above says TASK_KILLABLE, very confusing.

Sorry, it's a leftover from one of the previous versions. Fixed.

> 
> Probably this pairs with the change in signal_pending_state() above. So this
> schedule() should actually "work" in that it won't return if signal_pending().
> 
> But this can't protect from another signal_wake_up(). Yes, iiuc in this case
> cgroup_enter_frozen() will be called again "soon" but this all looks strange.

So, the idea here is to make ptrace traps and fatal signals working, but
non-fatal shouldn't be delivered.

As soon as the frozen task is looping in the signal delivery loop, it's fine,
it's not going anywhere.

Without the change above the task is getting out of schedule() immediately,
if any signal is pending (including non-fatals). So it becomes a busy-loop.

> 
> > --- a/kernel/ptrace.c
> > +++ b/kernel/ptrace.c
> > @@ -410,6 +410,13 @@ static int ptrace_attach(struct task_struct *task, 
> > long request,
> >
> > spin_lock(&task->sighand->siglock);
> >
> > +   /*
> > +* If the process is frozen, let's wake it up to give it a chance
> > +* to enter the ptrace trap.
> > +*/
> > +   if (cgroup_task_frozen(task))
> > +   wake_up_process(task);
> 
> And why this can't race with cgroup_enter_frozen() ?
> 
> Or think of PTRACE_INTERRUPT. It can race with cgroup_enter_frozen() too, the
> tracee can miss this request because of that change in signal_pending_state().

It's a good point. So I need an additional synchronization around
checking/setting the JOBCTL_TRAP_FREEZE?

> 
> 
> >  static void do_jobctl_trap(void)
> >  {
> > +   struct sighand_struct *sighand = current->sighand;
> > struct signal_struct *signal = current->signal;
> > int signr = current->jobctl & JOBCTL_STOP_SIGMASK;
> >  
> > -   if (current->ptrace & PT_SEIZED) {
> > -   if (!signal->group_stop_count &&
> > -   !(signal->flags & SIGNAL_STOP_STOPPED))
> > -   signr = SIGTRAP;
> > -   WARN_ON_ONCE(!signr);
> > -   ptrace_do_notify(signr, signr | (PTRACE_EVENT_STOP << 8),
> > -CLD_STOPPED);
> > -   } else {
> > -   WARN_ON_ONCE(!signr);
> > -   ptrace_stop(signr, CLD_STOPPED, 0, NULL);
> > -   current->exit_code = 0;
> > +   if (current->jobctl & (JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)) {
> > +   if (current->ptrace & PT_SEIZED) {
> > +   if (!signal->group_stop_count &&
> > +   !(signal->flags & SIGNAL_STOP_STOPPED))
> > +   signr = SIGTRAP;
> > +   WARN_ON_ONCE(!signr);
> > +   ptrace_do_notify(signr,
> > +signr | (PTRACE_EVENT_STOP << 8),
> > +CLD_STOPPED);

Re: [PATCH v4 3/7] cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock

2018-12-03 Thread Roman Gushchin
On Mon, Dec 03, 2018 at 08:17:06AM -0800, Tejun Heo wrote:
> On Fri, Nov 30, 2018 at 03:47:41PM -0800, Roman Gushchin wrote:
> > +* nr_descendants and nr_dying_descendants are protected
> > +* by cgroup_mutex and css_set_lock.
> 
> Can you be a bit more specific - hold both for writes, either for
> reads.

Sure. Thanks!


[PATCH v4 0/7] freezer for cgroup v2

2018-11-30 Thread Roman Gushchin
freezer for cgroup v2

This patchset implements freezer for cgroup v2.

It provides similar functionality as v1 freezer, but the interface
conforms to the cgroup v2 interface design principles, and it
provides a better user experience: tasks can be killed, ptrace works,
there is no separate controller, which has to be enabled, etc.

Patches (1), (2) and (3) are some preparational work, patch (4) contains
the implementation, patch (5) is a small cgroup kselftest fix,
patch (6) covers freezer adds 6 new kselftests to cover the freezer
functionality. Patch (7) adds corresponding docs.

v4->v3:
  - fixed cgroup state transitions on task migration (by Tejun Heo)
  - reading nr_descendants doesn't require taking css_set_lock anymore
  - fixed docs based on Mike Rapoport's feedback
  - fixed double irq lock found by Dan Carpenter

v3->v2:
  - dropped TASK_FROZEN for now, frozen tasks are put into TASK_INTERRUPTIBLE
  state; it's probably not the final version, but the API question can be
  discussed separately
  - don't clear TIF_SIGPENDING before going to sleep, instead add
  task->frozen check in signal_pending_state() and recalc_sigpending()
  - cgroup-level counter are now synchronized using css_set_lock,
  which simplified the whole code (e.g. per-cgroup works were removed)
  - the amount of comments increased significantly
  - many other improvements incorporating feedback from Tejun and Oleg

v2->v1:
  - fixed locking aroung calling cgroup_freezer_leave()
  - added docs

Roman Gushchin (7):
  cgroup: rename freezer.c into legacy_freezer.c
  cgroup: implement __cgroup_task_count() helper
  cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock
  cgroup: cgroup v2 freezer
  kselftests: cgroup: don't fail on cg_kill_all() error in cg_destroy()
  kselftests: cgroup: add freezer controller self-tests
  cgroup: document cgroup v2 freezer interface

 Documentation/admin-guide/cgroup-v2.rst   |  27 +
 include/linux/cgroup-defs.h   |  31 +
 include/linux/cgroup.h|  42 ++
 include/linux/sched.h |   2 +
 include/linux/sched/jobctl.h  |   2 +
 include/linux/sched/signal.h  |   3 +
 kernel/cgroup/Makefile|   4 +-
 kernel/cgroup/cgroup-internal.h   |   1 +
 kernel/cgroup/cgroup-v1.c |  16 -
 kernel/cgroup/cgroup.c| 151 +++-
 kernel/cgroup/freezer.c   | 653 +++--
 kernel/cgroup/legacy_freezer.c| 481 
 kernel/ptrace.c   |   7 +
 kernel/signal.c   |  58 +-
 tools/testing/selftests/cgroup/.gitignore |   1 +
 tools/testing/selftests/cgroup/Makefile   |   2 +
 tools/testing/selftests/cgroup/cgroup_util.c  |  85 ++-
 tools/testing/selftests/cgroup/cgroup_util.h  |   7 +
 tools/testing/selftests/cgroup/test_freezer.c | 685 ++
 19 files changed, 1811 insertions(+), 447 deletions(-)
 create mode 100644 kernel/cgroup/legacy_freezer.c
 create mode 100644 tools/testing/selftests/cgroup/test_freezer.c

-- 
2.17.2



[PATCH v4 1/7] cgroup: rename freezer.c into legacy_freezer.c

2018-11-30 Thread Roman Gushchin
Freezer.c will contain an implementation of cgroup v2 freezer,
so let's rename the v1 freezer to avoid naming conflicts.

Signed-off-by: Roman Gushchin 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
---
 kernel/cgroup/Makefile| 2 +-
 kernel/cgroup/{freezer.c => legacy_freezer.c} | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename kernel/cgroup/{freezer.c => legacy_freezer.c} (100%)

diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index bfcdae896122..8d5689ca94b9 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-y := cgroup.o rstat.o namespace.o cgroup-v1.o
 
-obj-$(CONFIG_CGROUP_FREEZER) += freezer.o
+obj-$(CONFIG_CGROUP_FREEZER) += legacy_freezer.o
 obj-$(CONFIG_CGROUP_PIDS) += pids.o
 obj-$(CONFIG_CGROUP_RDMA) += rdma.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
diff --git a/kernel/cgroup/freezer.c b/kernel/cgroup/legacy_freezer.c
similarity index 100%
rename from kernel/cgroup/freezer.c
rename to kernel/cgroup/legacy_freezer.c
-- 
2.17.2



[PATCH v4 4/7] cgroup: cgroup v2 freezer

2018-11-30 Thread Roman Gushchin
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.

This patch implements freezer for cgroup v2.

Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.

This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.

Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.

* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.

There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.

Signed-off-by: Roman Gushchin 
Cc: Tejun Heo 
Cc: Oleg Nesterov 
Cc: kernel-t...@fb.com
---
 include/linux/cgroup-defs.h  |  28 +++
 include/linux/cgroup.h   |  42 +
 include/linux/sched.h|   2 +
 include/linux/sched/jobctl.h |   2 +
 include/linux/sched/signal.h |   3 +
 kernel/cgroup/Makefile   |   2 +-
 kernel/cgroup/cgroup.c   | 112 +++-
 kernel/cgroup/freezer.c  | 318 +++
 kernel/ptrace.c  |   7 +
 kernel/signal.c  |  58 +--
 10 files changed, 556 insertions(+), 18 deletions(-)
 create mode 100644 kernel/cgroup/freezer.c

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 4327fd6e8121..23a99be114b8 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -64,6 +64,12 @@ enum {
 * specified at mount time and thus is implemented here.
 */
CGRP_CPUSET_CLONE_CHILDREN,
+
+   /* Control group has to be frozen. */
+   CGRP_FREEZE,
+
+   /* Cgroup is frozen. */
+   CGRP_FROZEN,
 };
 
 /* cgroup_root->flags */
@@ -316,6 +322,25 @@ struct cgroup_rstat_cpu {
struct cgroup *updated_next;/* NULL iff not on the list */
 };
 
+struct cgroup_freezer_state {
+   /* Should the cgroup and its descendants be frozen. */
+   bool freeze;
+
+   /* Should the cgroup actually be frozen? */
+   int e_freeze;
+
+   /* Fields below are protected by css_set_lock */
+
+   /* Number of frozen descendant cgroups */
+   int nr_frozen_descendants;
+
+   /* Number of tasks to freeze */
+   int nr_tasks_to_freeze;
+
+   /* Number of frozen tasks */
+   int nr_frozen_tasks;
+};
+
 struct cgroup {
/* self css with NULL ->ss, points back to this cgroup */
struct cgroup_subsys_state self;
@@ -450,6 +475,9 @@ struct cgroup {
/* If there is block congestion on this cgroup. */
atomic_t congestion_count;
 
+   /* Used to store internal freezer state */
+   struct cgroup_freezer_state freezer;
+
/* ids of the ancestors at each level including self */
int ancestor_ids[];
 };
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 9d12757a65b0..dc25b321ae1c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -877,4 +877,46 @@ static inline void put_cgroup_ns(struct cgroup_namespace 
*ns)
free_cgroup_ns(ns);
 }
 
+#ifdef CONFIG_CGROUPS
+
+void cgroup_enter_frozen(void);
+void cgroup_leave_frozen(void);
+void cgroup_freeze(struct cgroup *cgrp, bool freeze);
+void cgroup_update_frozen(struct cgroup *cgrp, bool frozen);
+void cgroup_freezer_migrate_task(struct task_struct *task, struct cgroup *src,
+struct cgroup *dst);
+static inline bool cgroup_task_freeze(struct task_struct *task)
+{
+   bool ret;
+
+   if (task->flags & PF_KTHREAD)
+   return false;

[PATCH v4 2/7] cgroup: implement __cgroup_task_count() helper

2018-11-30 Thread Roman Gushchin
The helper is identical to the existing cgroup_task_count()
except it doesn't take the css_set_lock by itself, assuming
that the caller does.

Also, move cgroup_task_count() implementation into
kernel/cgroup/cgroup.c, as there is nothing specific to cgroup v1.

Signed-off-by: Roman Gushchin 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
---
 kernel/cgroup/cgroup-internal.h |  1 +
 kernel/cgroup/cgroup-v1.c   | 16 
 kernel/cgroup/cgroup.c  | 33 +
 3 files changed, 34 insertions(+), 16 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index c950864016e2..a195328431ce 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -226,6 +226,7 @@ int cgroup_rmdir(struct kernfs_node *kn);
 int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
 struct kernfs_root *kf_root);
 
+int __cgroup_task_count(const struct cgroup *cgrp);
 int cgroup_task_count(const struct cgroup *cgrp);
 
 /*
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 51063e7a93c2..6134fef07d57 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -336,22 +336,6 @@ static struct cgroup_pidlist 
*cgroup_pidlist_find_create(struct cgroup *cgrp,
return l;
 }
 
-/**
- * cgroup_task_count - count the number of tasks in a cgroup.
- * @cgrp: the cgroup in question
- */
-int cgroup_task_count(const struct cgroup *cgrp)
-{
-   int count = 0;
-   struct cgrp_cset_link *link;
-
-   spin_lock_irq(&css_set_lock);
-   list_for_each_entry(link, &cgrp->cset_links, cset_link)
-   count += link->cset->nr_tasks;
-   spin_unlock_irq(&css_set_lock);
-   return count;
-}
-
 /*
  * Load a cgroup's pidarray with either procs' tgids or tasks' pids
  */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index e06994fd4e34..7519a4307021 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -563,6 +563,39 @@ static void cgroup_get_live(struct cgroup *cgrp)
css_get(&cgrp->self);
 }
 
+/**
+ * __cgroup_task_count - count the number of tasks in a cgroup. The caller
+ * is responsible for taking the css_set_lock.
+ * @cgrp: the cgroup in question
+ */
+int __cgroup_task_count(const struct cgroup *cgrp)
+{
+   int count = 0;
+   struct cgrp_cset_link *link;
+
+   lockdep_assert_held(&css_set_lock);
+
+   list_for_each_entry(link, &cgrp->cset_links, cset_link)
+   count += link->cset->nr_tasks;
+
+   return count;
+}
+
+/**
+ * cgroup_task_count - count the number of tasks in a cgroup.
+ * @cgrp: the cgroup in question
+ */
+int cgroup_task_count(const struct cgroup *cgrp)
+{
+   int count;
+
+   spin_lock_irq(&css_set_lock);
+   count = __cgroup_task_count(cgrp);
+   spin_unlock_irq(&css_set_lock);
+
+   return count;
+}
+
 struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
 {
struct cgroup *cgrp = of->kn->parent->priv;
-- 
2.17.2



[PATCH v4 3/7] cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock

2018-11-30 Thread Roman Gushchin
The number of descendant cgroups and the number of dying
descendant cgroups are currently synchronized using the cgroup_mutex.

The number of descendant cgroups will be required by the cgroup v2
freezer, which will use it to determine if a cgroup is frozen
(depending on total number of descendants and number of frozen
descendants). It's not always acceptable to grab the cgroup_mutex,
especially from quite hot paths (e.g. exit()).

To avoid this, let's additionally synchronize these counters using
the css_set_lock.

So, it's safe to read these counters with either cgroup_mutex or
css_set_lock locked, and for changing both locks should be acquired.

Signed-off-by: Roman Gushchin 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
---
 include/linux/cgroup-defs.h | 3 +++
 kernel/cgroup/cgroup.c  | 6 ++
 2 files changed, 9 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 8fcbae1b8db0..4327fd6e8121 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -348,6 +348,9 @@ struct cgroup {
 * Dying cgroups are cgroups which were deleted by a user,
 * but are still existing because someone else is holding a reference.
 * max_descendants is a maximum allowed number of descent cgroups.
+*
+* nr_descendants and nr_dying_descendants are protected
+* by cgroup_mutex and css_set_lock.
 */
int nr_descendants;
int nr_dying_descendants;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 7519a4307021..f89dde50f693 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4723,9 +4723,11 @@ static void css_release_work_fn(struct work_struct *work)
if (cgroup_on_dfl(cgrp))
cgroup_rstat_flush(cgrp);
 
+   spin_lock_irq(&css_set_lock);
for (tcgrp = cgroup_parent(cgrp); tcgrp;
 tcgrp = cgroup_parent(tcgrp))
tcgrp->nr_dying_descendants--;
+   spin_unlock_irq(&css_set_lock);
 
cgroup_idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
cgrp->id = -1;
@@ -4943,12 +4945,14 @@ static struct cgroup *cgroup_create(struct cgroup 
*parent)
if (ret)
goto out_psi_free;
 
+   spin_lock_irq(&css_set_lock);
for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp)) {
cgrp->ancestor_ids[tcgrp->level] = tcgrp->id;
 
if (tcgrp != cgrp)
tcgrp->nr_descendants++;
}
+   spin_unlock_irq(&css_set_lock);
 
if (notify_on_release(parent))
set_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
@@ -5233,10 +5237,12 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
if (parent && cgroup_is_threaded(cgrp))
parent->nr_threaded_children--;
 
+   spin_lock_irq(&css_set_lock);
for (tcgrp = cgroup_parent(cgrp); tcgrp; tcgrp = cgroup_parent(tcgrp)) {
tcgrp->nr_descendants--;
tcgrp->nr_dying_descendants++;
}
+   spin_unlock_irq(&css_set_lock);
 
cgroup1_check_for_release(parent);
 
-- 
2.17.2



[PATCH v4 5/7] kselftests: cgroup: don't fail on cg_kill_all() error in cg_destroy()

2018-11-30 Thread Roman Gushchin
If the cgroup destruction races with an exit() of a belonging
process(es), cg_kill_all() may fail. It's not a good reason to make
cg_destroy() fail and leave the cgroup in place, potentially causing
next test runs to fail.

Signed-off-by: Roman Gushchin 
Cc: Shuah Khan 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: linux-kselft...@vger.kernel.org
---
 tools/testing/selftests/cgroup/cgroup_util.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/tools/testing/selftests/cgroup/cgroup_util.c 
b/tools/testing/selftests/cgroup/cgroup_util.c
index 14c9fe284806..eba06f94433b 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -227,9 +227,7 @@ int cg_destroy(const char *cgroup)
 retry:
ret = rmdir(cgroup);
if (ret && errno == EBUSY) {
-   ret = cg_killall(cgroup);
-   if (ret)
-   return ret;
+   cg_killall(cgroup);
usleep(100);
goto retry;
}
-- 
2.17.2



[PATCH v4 6/7] kselftests: cgroup: add freezer controller self-tests

2018-11-30 Thread Roman Gushchin
This patch implements six tests for the freezer controller for
cgroup v2:
1) a simple test, which aims to freeze and unfreeze a cgroup with 100
processes
2) a more complicated tree test, which creates a hierarchy of cgroups,
puts some processes in some cgroups, and tries to freeze and unfreeze
different parts of the subtree
3) a forkbomb test: the test aims to freeze a forkbomb running in a
cgroup, kill all tasks in the cgroup and remove the cgroup without
the unfreezing.
4) rmdir test: the test creates two nested cgroups, freezes the parent
one, checks that the child can be successfully removed, and a new
child can be created
5) migration tests: the test checks migration of a task between
frozen cgroups: from a frozen to a running, from a running to a
frozen, and from a frozen to a frozen.
6) ptrace test: the test checks that it's possible to attach to
a process in a frozen cgroup, get some information and detach, and
the cgroup will remain frozen.

Expected output:

  $ ./test_freezer
  ok 1 test_cgfreezer_simple
  ok 2 test_cgfreezer_tree
  ok 3 test_cgfreezer_forkbomb
  ok 4 test_cgrreezer_rmdir
  ok 5 test_cgfreezer_migrate
  ok 6 test_cgfreezer_ptrace

Signed-off-by: Roman Gushchin 
Cc: Shuah Khan 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: linux-kselft...@vger.kernel.org
---
 tools/testing/selftests/cgroup/.gitignore |   1 +
 tools/testing/selftests/cgroup/Makefile   |   2 +
 tools/testing/selftests/cgroup/cgroup_util.c  |  81 ++-
 tools/testing/selftests/cgroup/cgroup_util.h  |   7 +
 tools/testing/selftests/cgroup/test_freezer.c | 685 ++
 5 files changed, 775 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/cgroup/test_freezer.c

diff --git a/tools/testing/selftests/cgroup/.gitignore 
b/tools/testing/selftests/cgroup/.gitignore
index adacda50a4b2..7f9835624793 100644
--- a/tools/testing/selftests/cgroup/.gitignore
+++ b/tools/testing/selftests/cgroup/.gitignore
@@ -1,2 +1,3 @@
 test_memcontrol
 test_core
+test_freezer
diff --git a/tools/testing/selftests/cgroup/Makefile 
b/tools/testing/selftests/cgroup/Makefile
index 23fbaa4a9630..8d369b6a2069 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -5,8 +5,10 @@ all:
 
 TEST_GEN_PROGS = test_memcontrol
 TEST_GEN_PROGS += test_core
+TEST_GEN_PROGS += test_freezer
 
 include ../lib.mk
 
 $(OUTPUT)/test_memcontrol: cgroup_util.c
 $(OUTPUT)/test_core: cgroup_util.c
+$(OUTPUT)/test_freezer: cgroup_util.c
diff --git a/tools/testing/selftests/cgroup/cgroup_util.c 
b/tools/testing/selftests/cgroup/cgroup_util.c
index eba06f94433b..e9cdad673901 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -74,6 +74,16 @@ char *cg_name_indexed(const char *root, const char *name, 
int index)
return ret;
 }
 
+char *cg_control(const char *cgroup, const char *control)
+{
+   size_t len = strlen(cgroup) + strlen(control) + 2;
+   char *ret = malloc(len);
+
+   snprintf(ret, len, "%s/%s", cgroup, control);
+
+   return ret;
+}
+
 int cg_read(const char *cgroup, const char *control, char *buf, size_t len)
 {
char path[PATH_MAX];
@@ -196,7 +206,59 @@ int cg_create(const char *cgroup)
return mkdir(cgroup, 0644);
 }
 
-static int cg_killall(const char *cgroup)
+int cg_for_all_procs(const char *cgroup, int (*fn)(int pid, void *arg),
+void *arg)
+{
+   char buf[PAGE_SIZE];
+   char *ptr = buf;
+   int ret;
+
+   if (cg_read(cgroup, "cgroup.procs", buf, sizeof(buf)))
+   return -1;
+
+   while (ptr < buf + sizeof(buf)) {
+   int pid = strtol(ptr, &ptr, 10);
+
+   if (pid == 0)
+   break;
+   if (*ptr)
+   ptr++;
+   else
+   break;
+   ret = fn(pid, arg);
+   if (ret)
+   return ret;
+   }
+
+   return 0;
+}
+
+int cg_wait_for_proc_count(const char *cgroup, int count)
+{
+   char buf[10 * PAGE_SIZE] = {0};
+   int attempts;
+   char *ptr;
+
+   for (attempts = 10; attempts >= 0; attempts--) {
+   int nr = 0;
+
+   if (cg_read(cgroup, "cgroup.procs", buf, sizeof(buf)))
+   break;
+
+   for (ptr = buf; *ptr; ptr++)
+   if (*ptr == '\n')
+   nr++;
+
+   if (nr >= count)
+   return 0;
+
+   usleep(10);
+   }
+
+   return -1;
+}
+
+int cg_killall(const char *cgroup)
 {
char buf[PAGE_SIZE];
char *ptr = buf;
@@ -238,6 +300,14 @@ int cg_destroy(const char *cgroup)
return ret;
 }
 
+int cg_enter(const char *cgroup, int pid)
+{
+   char pidbuf[64];
+
+   snprintf(pidbuf, sizeof(pidbuf), "%d", pid);
+   return 

[PATCH v4 7/7] cgroup: document cgroup v2 freezer interface

2018-11-30 Thread Roman Gushchin
Describe cgroup v2 freezer interface in the cgroup v2 admin guide.

Signed-off-by: Roman Gushchin 
Cc: Tejun Heo 
Cc: linux-doc@vger.kernel.org
Cc: kernel-t...@fb.com
---
 Documentation/admin-guide/cgroup-v2.rst | 27 +
 1 file changed, 27 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 07e06136a550..f8335e26b362 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -864,6 +864,8 @@ All cgroup core files are prefixed with "cgroup."
  populated
1 if the cgroup or its descendants contains any live
processes; otherwise, 0.
+ frozen
+   1 if the cgroup is frozen; otherwise, 0.
 
   cgroup.max.descendants
A read-write single value files.  The default is "max".
@@ -897,6 +899,31 @@ All cgroup core files are prefixed with "cgroup."
A dying cgroup can consume system resources not exceeding
limits, which were active at the moment of cgroup deletion.
 
+  cgroup.freeze
+   A read-write single value file which exists on non-root cgroups.
+   Allowed values are "0" and "1". The default is "0".
+
+   Writing "1" to the file causes freezing of the cgroup and all
+   descendant cgroups. This means that all belonging processes will
+   be stopped and will not run until the cgroup will be explicitly
+   unfrozen. Freezing of the cgroup may take some time; when this action
+   is completed, the "frozen" value in the cgroup.events control file
+   will be updated to "1" and the corresponding notification will be
+   issued.
+
+   A cgroup can be frozen either by its own settings, or by settings
+   of any ancestor cgroups. If any of ancestor cgroups is frozen, the
+   cgroup will remain frozen.
+
+   Processes in the frozen cgroup can be killed by a fatal signal.
+   They also can enter and leave a frozen cgroup: either by an explicit
+   move by a user, or if freezing of the cgroup races with fork().
+   If a process is moved to a frozen cgroup, it stops. If a process is
+   moved out of a frozen cgroup, it becomes running.
+
+   Frozen status of a cgroup doesn't affect any cgroup tree operations:
+   it's possible to delete a frozen (and empty) cgroup, as well as
+   create new sub-cgroups.
 
 Controllers
 ===
-- 
2.17.2



Re: [PATCH v3 7/7] cgroup: document cgroup v2 freezer interface

2018-11-19 Thread Roman Gushchin
On Sat, Nov 17, 2018 at 12:02:28AM -0800, Mike Rapoport wrote:
> Hi,
> 
> On Fri, Nov 16, 2018 at 04:38:30PM -0800, Roman Gushchin wrote:
> > Describe cgroup v2 freezer interface in the cgroup v2 admin guide.
> > 
> > Signed-off-by: Roman Gushchin 
> > Cc: Tejun Heo 
> > Cc: linux-doc@vger.kernel.org
> > Cc: kernel-t...@fb.com
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst | 26 +
> >  1 file changed, 26 insertions(+)
> > 
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst 
> > b/Documentation/admin-guide/cgroup-v2.rst
> > index 184193bcb262..a065c0bed88c 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -862,6 +862,8 @@ All cgroup core files are prefixed with "cgroup."
> >   populated
> > 1 if the cgroup or its descendants contains any live
> > processes; otherwise, 0.
> > + frozen
> > +   1 if the cgroup is frozen; otherwise, 0.
> > 
> >cgroup.max.descendants
> > A read-write single value files.  The default is "max".
> > @@ -895,6 +897,30 @@ All cgroup core files are prefixed with "cgroup."
> > A dying cgroup can consume system resources not exceeding
> > limits, which were active at the moment of cgroup deletion.
> > 
> > +  cgroup.freeze
> > +   A read-write single value file which exists on non-root cgroups.
> > +   Allowed values are "0" and "1". The default is "0".
> > +
> > +   Writing "1" to the file causes freezing of the cgroup and all
> > +   descendant cgroups. This means that all belonging processes will
> > +   be stopped and will not run until the cgroup will be explicitly
> > +   unfrozen. Freezing of the cgroup may take some time; when the process
> 
> "when the process is complete" sounds somewhat ambiguous, it's unclear
> whether freezing is complete or the process that's being frozen is
> complete.
> 
> Maybe "when this action is completed"?
> 
> > +   is complete, the "frozen" value in the cgroup.events control file
> > +   will be updated and the corresponding notification will be issued.
> 
> Can you please clarify how exactly cgroup.events would be updated?
> 
> > +   Cgroup can be frozen either by its own settings, either by settings
> 
>   ^ A cgroup ... and maybe there are more "a" and "the" that should be
> fixed, it's hard for me to tell.
> 
> Also, I believe "either ..., or ..." sounds better than "either ...,
> either ..."
> 
> > +   of any ancestor cgroups. If any of ancestor cgroups is frozen, the
> > +   cgroup will remain frozen.
> > +
> > +   Processes in the frozen cgroup can be killed by a fatal signal.
> > +   They also can enter and leave a frozen cgroup: either by an explicit
> > +   move by a user, either if freezing of the cgroup races with fork().
> 
> ditto
> 
> > +   If a cgroup is moved to a frozen cgroup, it stops. If a process is
> 
> ^ process?
> 
> > +   moving out of a frozen cgroup, it becomes running.
> 
>^ moved

Hello, Mike!

Thanks for the review! I agree with all comments above; fixes queued for v4.

> 
> > +   Frozen status of a cgroup doesn't affect any cgroup tree operations:
> > +   it's possible to delete a frozen (and empty) cgroup, as well as
> > +   create new sub-cgroups.
> 
> Maybe it's also worth adding that freezing a process has no effect on its
> memory consumption, at least directly.

Hm, isn't it the expected behavior?

In any case, I assume that cgroup.freeze knob description is not the best place
for a such explanations. Maybe it's better to add a standalone paragraph with
the description of the frozen process state, what's expected to work, what's
not, etc. I'd return to this question a bit later, when we'll agree on the user
interface and the implementation.

Thanks!


[PATCH v3 7/7] cgroup: document cgroup v2 freezer interface

2018-11-16 Thread Roman Gushchin
Describe cgroup v2 freezer interface in the cgroup v2 admin guide.

Signed-off-by: Roman Gushchin 
Cc: Tejun Heo 
Cc: linux-doc@vger.kernel.org
Cc: kernel-t...@fb.com
---
 Documentation/admin-guide/cgroup-v2.rst | 26 +
 1 file changed, 26 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 184193bcb262..a065c0bed88c 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -862,6 +862,8 @@ All cgroup core files are prefixed with "cgroup."
  populated
1 if the cgroup or its descendants contains any live
processes; otherwise, 0.
+ frozen
+   1 if the cgroup is frozen; otherwise, 0.
 
   cgroup.max.descendants
A read-write single value files.  The default is "max".
@@ -895,6 +897,30 @@ All cgroup core files are prefixed with "cgroup."
A dying cgroup can consume system resources not exceeding
limits, which were active at the moment of cgroup deletion.
 
+  cgroup.freeze
+   A read-write single value file which exists on non-root cgroups.
+   Allowed values are "0" and "1". The default is "0".
+
+   Writing "1" to the file causes freezing of the cgroup and all
+   descendant cgroups. This means that all belonging processes will
+   be stopped and will not run until the cgroup will be explicitly
+   unfrozen. Freezing of the cgroup may take some time; when the process
+   is complete, the "frozen" value in the cgroup.events control file
+   will be updated and the corresponding notification will be issued.
+
+   Cgroup can be frozen either by its own settings, either by settings
+   of any ancestor cgroups. If any of ancestor cgroups is frozen, the
+   cgroup will remain frozen.
+
+   Processes in the frozen cgroup can be killed by a fatal signal.
+   They also can enter and leave a frozen cgroup: either by an explicit
+   move by a user, either if freezing of the cgroup races with fork().
+   If a cgroup is moved to a frozen cgroup, it stops. If a process is
+   moving out of a frozen cgroup, it becomes running.
+
+   Frozen status of a cgroup doesn't affect any cgroup tree operations:
+   it's possible to delete a frozen (and empty) cgroup, as well as
+   create new sub-cgroups.
 
 Controllers
 ===
-- 
2.17.2



[PATCH] cgroup, docs: add a note about returning EBUSY in some cases

2018-05-22 Thread Roman Gushchin
Explicitly document EBUSY returned by writing into cgroup.procs
if controllers are enabled; and writing into cgroup.subtree_control
if there are attached processes.

The return code might be slightly surprising, and because there is
nothing obviously better, let's document it at least.

Signed-off-by: Roman Gushchin 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 Documentation/cgroup-v2.txt | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 74cdeaed9f7a..57302f88a4ad 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -799,6 +799,9 @@ All cgroup core files are prefixed with "cgroup."
When delegating a sub-hierarchy, write access to this file
should be granted along with the containing directory.
 
+   If the target cgroup has enabled controllers, writing to this
+   file will fail with EBUSY.
+
In a threaded cgroup, reading this file fails with EOPNOTSUPP
as all the processes belong to the thread root.  Writing is
supported and moves every thread of the process to the cgroup.
@@ -850,6 +853,9 @@ All cgroup core files are prefixed with "cgroup."
the last one is effective.  When multiple enable and disable
operations are specified, either all succeed or all fail.
 
+   If the cgroup has attached tasks, writing to this file will
+   fail with EBUSY.
+
   cgroup.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined.  Unless specified
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 4/4] mm/docs: describe memory.low refinements

2018-04-05 Thread Roman Gushchin
Refine cgroup v2 docs after latest memory.low changes.

Signed-off-by: Roman Gushchin 
Cc: Andrew Morton 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: linux...@kvack.org
Cc: cgro...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
---
 Documentation/cgroup-v2.txt | 28 +---
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index f728e55602b2..7ee462b8a6ac 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1006,10 +1006,17 @@ PAGE_SIZE multiple when read back.
A read-write single value file which exists on non-root
cgroups.  The default is "0".
 
-   Best-effort memory protection.  If the memory usages of a
-   cgroup and all its ancestors are below their low boundaries,
-   the cgroup's memory won't be reclaimed unless memory can be
-   reclaimed from unprotected cgroups.
+   Best-effort memory protection.  If the memory usage of a
+   cgroup is within its effective low boundary, the cgroup's
+   memory won't be reclaimed unless memory can be reclaimed
+   from unprotected cgroups.
+
+   Effective low boundary is limited by memory.low values of
+   all ancestor cgroups. If there is memory.low overcommitment
+   (child cgroup or cgroups are requiring more protected memory,
+   than parent will allow), then each child cgroup will get
+   the part of parent's protection proportional to the its
+   actual memory usage below memory.low.
 
Putting more memory than generally available under this
protection is discouraged.
@@ -2008,17 +2015,8 @@ system performance due to overreclaim, to the point 
where the feature
 becomes self-defeating.
 
 The memory.low boundary on the other hand is a top-down allocated
-reserve.  A cgroup enjoys reclaim protection when it and all its
-ancestors are below their low boundaries, which makes delegation of
-subtrees possible.  Secondly, new cgroups have no reserve per default
-and in the common case most cgroups are eligible for the preferred
-reclaim pass.  This allows the new low boundary to be efficiently
-implemented with just a minor addition to the generic reclaim code,
-without the need for out-of-band data structures and reclaim passes.
-Because the generic reclaim code considers all cgroups except for the
-ones running low in the preferred first reclaim pass, overreclaim of
-individual groups is eliminated as well, resulting in much better
-overall workload performance.
+reserve.  A cgroup enjoys reclaim protection when it's within its low,
+which makes delegation of subtrees possible.
 
 The original high boundary, the hard limit, is defined as a strict
 limit that can not budge, even if the OOM killer has to be called.
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable

2018-01-30 Thread Roman Gushchin
On Tue, Jan 30, 2018 at 01:08:52PM +0100, Michal Hocko wrote:
> On Tue 30-01-18 11:58:51, Roman Gushchin wrote:
> > On Tue, Jan 30, 2018 at 09:54:45AM +0100, Michal Hocko wrote:
> > > On Mon 29-01-18 11:11:39, Tejun Heo wrote:
> > 
> > Hello, Michal!
> > 
> > > diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> > > index 2eaed1e2243d..67bdf19f8e5b 100644
> > > --- a/Documentation/cgroup-v2.txt
> > > +++ b/Documentation/cgroup-v2.txt
> > > @@ -1291,8 +1291,14 @@ This affects both system- and cgroup-wide OOMs. 
> > > For a cgroup-wide OOM
> > >  the memory controller considers only cgroups belonging to the sub-tree
> > >  of the OOM'ing cgroup.
> > >  
> > > -The root cgroup is treated as a leaf memory cgroup, so it's compared
> > > -with other leaf memory cgroups and cgroups with oom_group option set.
> >   ^
> > IMO, this statement is important. Isn't it?
> > 
> > > +Leaf cgroups are compared based on their cumulative memory usage. The
> > > +root cgroup is treated as a leaf memory cgroup as well, so it's
> > > +compared with other leaf memory cgroups. Due to internal implementation
> > > +restrictions the size of the root cgroup is a cumulative sum of
> > > +oom_badness of all its tasks (in other words oom_score_adj of each task
> > > +is obeyed). Relying on oom_score_adj (appart from OOM_SCORE_ADJ_MIN)
> > > +can lead to overestimating of the root cgroup consumption and it is
> > 
> > Hm, and underestimating too. Also OOM_SCORE_ADJ_MIN isn't any different
> > in this case. Say, all tasks except a small one have OOM_SCORE_ADJ set to
> > -999, this means the root croup has extremely low chances to be elected.
> > 
> > > +therefore discouraged. This might change in the future, though.
> > 
> > Other than that looks very good to me.
> 
> This?
> 
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index 2eaed1e2243d..34ad80ee90f2 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -1291,8 +1291,15 @@ This affects both system- and cgroup-wide OOMs. For a 
> cgroup-wide OOM
>  the memory controller considers only cgroups belonging to the sub-tree
>  of the OOM'ing cgroup.
>  
> -The root cgroup is treated as a leaf memory cgroup, so it's compared
> -with other leaf memory cgroups and cgroups with oom_group option set.
> +Leaf cgroups and cgroups with oom_group option set are compared based
> +on their cumulative memory usage. The root cgroup is treated as a
> +leaf memory cgroup as well, so it's compared with other leaf memory
> +cgroups. Due to internal implementation restrictions the size of
> +the root cgroup is a cumulative sum of oom_badness of all its tasks
> +(in other words oom_score_adj of each task is obeyed). Relying on
> +oom_score_adj (appart from OOM_SCORE_ADJ_MIN) can lead to over or
> +underestimating of the root cgroup consumption and it is therefore
> +discouraged. This might change in the future, though.

Acked-by: Roman Gushchin 

Thank you!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable

2018-01-30 Thread Roman Gushchin
On Tue, Jan 30, 2018 at 09:54:45AM +0100, Michal Hocko wrote:
> On Mon 29-01-18 11:11:39, Tejun Heo wrote:

Hello, Michal!

> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index 2eaed1e2243d..67bdf19f8e5b 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -1291,8 +1291,14 @@ This affects both system- and cgroup-wide OOMs. For a 
> cgroup-wide OOM
>  the memory controller considers only cgroups belonging to the sub-tree
>  of the OOM'ing cgroup.
>  
> -The root cgroup is treated as a leaf memory cgroup, so it's compared
> -with other leaf memory cgroups and cgroups with oom_group option set.
  ^
IMO, this statement is important. Isn't it?

> +Leaf cgroups are compared based on their cumulative memory usage. The
> +root cgroup is treated as a leaf memory cgroup as well, so it's
> +compared with other leaf memory cgroups. Due to internal implementation
> +restrictions the size of the root cgroup is a cumulative sum of
> +oom_badness of all its tasks (in other words oom_score_adj of each task
> +is obeyed). Relying on oom_score_adj (appart from OOM_SCORE_ADJ_MIN)
> +can lead to overestimating of the root cgroup consumption and it is

Hm, and underestimating too. Also OOM_SCORE_ADJ_MIN isn't any different
in this case. Say, all tasks except a small one have OOM_SCORE_ADJ set to
-999, this means the root croup has extremely low chances to be elected.

> +therefore discouraged. This might change in the future, though.

Other than that looks very good to me.

Thank you!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -mm 0/4] mm, memcg: introduce oom policies

2018-01-17 Thread Roman Gushchin
On Tue, Jan 16, 2018 at 06:14:58PM -0800, David Rientjes wrote:
> There are three significant concerns about the cgroup aware oom killer as
> it is implemented in -mm:
> 
>  (1) allows users to evade the oom killer by creating subcontainers or
>  using other controllers since scoring is done per cgroup and not
>  hierarchically,
> 
>  (2) does not allow the user to influence the decisionmaking, such that
>  important subtrees cannot be preferred or biased, and
> 
>  (3) unfairly compares the root mem cgroup using completely different
>  criteria than leaf mem cgroups and allows wildly inaccurate results
>  if oom_score_adj is used.
> 
> This patchset aims to fix (1) completely and, by doing so, introduces a
> completely extensible user interface that can be expanded in the future.
> 
> It eliminates the mount option for the cgroup aware oom killer entirely
> since it is now enabled through the root mem cgroup's oom policy.
> 
> It eliminates a pointless tunable, memory.oom_group, that unnecessarily
> pollutes the mem cgroup v2 filesystem and is invalid when cgroup v2 is
> mounted with the "groupoom" option.

You're introducing a new oom_policy knob, which has two separate sets
of possible values for the root and non-root cgroups. I don't think
it aligns with the existing cgroup v2 design.

For the root cgroup it works exactly as mount option, and both "none"
and "cgroup" values have no meaning outside of the root cgroup. We can
discuss if a knob on root cgroup is better than a mount option, or not
(I don't think so), but it has nothing to do with oom policy as you
define it for non-root cgroups.

For non-root cgroups you're introducing "all" and "tree", and the _only_
difference is that in the "all" mode all processes will be killed, rather
than the biggest in the "tree". I find these names confusing, in reality
it's more "evaluate together and kill all" and "evaluate together and
kill one".

So, it's not really the fully hierarchical approach, which I thought,
you were arguing for. You can easily do the same with adding the third
value to the memory.groupoom knob, as I've suggested earlier (say, "disable,
"kill" and "evaluate"), and will be much less confusing.

Thanks!

Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 0/7] cgroup-aware OOM killer

2018-01-11 Thread Roman Gushchin
On Thu, Jan 11, 2018 at 10:08:09AM +0100, Michal Hocko wrote:
> On Wed 10-01-18 11:33:45, Andrew Morton wrote:
> > On Wed, 10 Jan 2018 05:11:44 -0800 Roman Gushchin  wrote:
> >
> > > The per-process oom_score_adj interface is not the nicest one, and I'm not
> > > sure we want to replicate it on cgroup level as is. If you have an idea 
> > > of how
> > > it should look like, please, propose a patch; otherwise it's hard to 
> > > discuss
> > > it without the code.
> > 
> > It does make sense to have some form of per-cgroup tunability.  Why is
> > the oom_score_adj approach inappropriate and what would be better?
> 
> oom_score_adj is basically unusable for any fine tuning on the process
> level for most setups except for very specialized ones. The only
> reasonable usage I've seen so far was to disable OOM killer for a
> process or make it a prime candidate. Using the same limited concept for
> cgroups sounds like repeating the same error to me.

My 2c here: current oom_score_adj semantics is really non-trivial for
cgroups. It defines an addition/substraction in 1/1000s of total memory or
OOMing cgroup's memory limit, depending on the the scope of the OOM event.
This is totally out of control for a user, because he/she can even have no
idea about the limit of an upper cgroup in the hierarchy. I've provided
an example earlier, in which one or another of two processes in the
same cgroup can be killed, depending on the scope of the OOM event.

> 
> > How hard is it to graft such a thing onto the -mm patchset?
> 
> I think this should be thought through very well before we add another
> tuning. Moreover the current usecase doesn't seem to require it so I am
> not really sure why we should implement something right away and later
> suffer from API mistakes.
>  
> > > > I proposed a solution in 
> > > > https://marc.info/?l=linux-kernel&m=150956897302725, which was never 
> > > > responded to, for all of these issues.  The idea is to do hierarchical 
> > > > accounting of mem cgroup hierarchies so that the hierarchy is traversed 
> > > > comparing total usage at each level to select target cgroups.  Admins 
> > > > and 
> > > > users can use memory.oom_score_adj to influence that decisionmaking at 
> > > > each level.
> > > > 
> > > > This solves #1 because mem cgroups can be compared based on the same 
> > > > classes of memory and the root mem cgroup's usage can be fairly 
> > > > compared 
> > > > by subtracting top-level mem cgroup usage from system usage.  All of 
> > > > the 
> > > > criteria used to evaluate a leaf mem cgroup has a reasonable 
> > > > system-wide 
> > > > counterpart that can be used to do the simple subtraction.
> > > > 
> > > > This solves #2 because evaluation is done hierarchically so that 
> > > > distributing processes over a set of child cgroups either intentionally 
> > > > or unintentionally no longer evades the oom killer.  Total usage is 
> > > > always 
> > > > accounted to the parent and there is no escaping this criteria for 
> > > > users.
> > > > 
> > > > This solves #3 because it allows admins to protect important processes 
> > > > in 
> > > > cgroups that are supposed to use, for example, 75% of system memory 
> > > > without it unconditionally being selected for oom kill but still oom 
> > > > kill 
> > > > if it exceeds a certain threshold.  In this sense, the cgroup aware oom 
> > > > killer, as currently implemented, is selling mem cgroups short by 
> > > > requiring the user to accept that the important process will be oom 
> > > > killed 
> > > > iff it uses mem cgroups and isn't attached to root.  It also allows 
> > > > users 
> > > > to actually volunteer to be oom killed first without majority usage.
> > > > 
> > > > It has come up time and time again that this support can be introduced 
> > > > on 
> > > > top of the cgroup oom killer as implemented.  It simply cannot.  For 
> > > > admins and users to have control over decisionmaking, it needs a 
> > > > oom_score_adj type tunable that cannot change semantics from kernel 
> > > > version to kernel version and without polluting the mem cgroup 
> > > > filesystem.  
> > > > That, in my suggestion, is an adjustment on the amount of total 
> > > > hierarchical usage of each mem cg

Re: [PATCH v13 0/7] cgroup-aware OOM killer

2018-01-10 Thread Roman Gushchin
Hello, David!

On Tue, Jan 09, 2018 at 04:57:53PM -0800, David Rientjes wrote:
> On Thu, 30 Nov 2017, Andrew Morton wrote:
> 
> > > This patchset makes the OOM killer cgroup-aware.
> > 
> > Thanks, I'll grab these.
> > 
> > There has been controversy over this patchset, to say the least.  I
> > can't say that I followed it closely!  Could those who still have
> > reservations please summarise their concerns and hopefully suggest a
> > way forward?
> > 
> 
> Yes, I'll summarize what my concerns have been in the past and what they 
> are wrt the patchset as it stands in -mm.  None of them originate from my 
> current usecase or anticipated future usecase of the oom killer for 
> system-wide or memcg-constrained oom conditions.  They are based purely on 
> the patchset's use of an incomplete and unfair heuristic for deciding 
> which cgroup to target.
> 
> I'll also suggest simple changes to the patchset, which I have in the 
> past, that can be made to address all of these concerns.
> 
> 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
> 
> The patchset uses two different heuristics to compare root and leaf mem 
> cgroups and scores them based on number of pages.  For the root mem 
> cgroup, it totals the /proc/pid/oom_score of all processes attached: 
> that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.  
> For leaf mem cgroups, it's based on that memcg's anonymous, unevictable, 
> unreclaimable slab, kernel stack, and swap counters.  These can be wildly 
> different independent of /proc/pid/oom_score_adj, but the most obvious 
> unfairness comes from users who tune oom_score_adj.
> 
> An example: start a process that faults 1GB of anonymous memory and leave 
> it attached to the root mem cgroup.  Start six more processes that each 
> fault 1GB of anonymous memory and attached them to a leaf mem cgroup.  Set 
> all processes to have /proc/pid/oom_score_adj of 1000.  System oom kill 
> will always kill the 1GB process attached to the root mem cgroup.  It's 
> because oom_badness() relies on /proc/pid/oom_score_adj, which is used to 
> evaluate the root mem cgroup, and leaf mem cgroups completely disregard 
> it.
> 
> In this example, the leaf mem cgroup's score is 1,573,044, the number of 
> pages for the 6GB of faulted memory.  The root mem cgroup's score is 
> 12,652,907, eight times larger even though its usage is six times smaller.
> 
> This is caused by the patchset disregarding oom_score_adj entirely for 
> leaf mem cgroups and relying on it heavily for the root mem cgroup.  It's 
> the complete opposite result of what the cgroup aware oom killer 
> advertises.
> 
> It also works the other way, if a large memory hog is attached to the root 
> mem cgroup but has a negative oom_score_adj it is never killed and random 
> processes are nuked solely because they happened to be attached to a leaf 
> mem cgroup.  This behavior wrt oom_score_adj is completely undocumented, 
> so I can't presume that it is either known nor tested.
> 
> Solution: compare the root mem cgroup and leaf mem cgroups equally with 
> the same criteria by doing hierarchical accounting of usage and 
> subtracting from total system usage to find root usage.

I find this problem quite minor, because I haven't seen any practical problems
caused by accounting of the root cgroup memory.
If it's a serious problem for you, it can be solved without switching to the
hierarchical accounting: it's possible to sum up all leaf cgroup stats and
substract them from global values. So, it can be a relatively small enhancement
on top of the current mm tree. This has nothing to do with global victim 
selection
approach.

> 
> 2. Evading the oom killer by attaching processes to child cgroups
> 
> Any cgroup on the system can attach all their processes to individual 
> child cgroups.  This is functionally the same as doing
> 
>   for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; 
> done
> 
> without the no internal process constraint introduced with cgroup v2.  All 
> child cgroups are evaluated based on their own usage: all anon, 
> unevictable, and unreclaimable slab as described previously.  It requires 
> an individual cgroup to be the single largest consumer to be targeted by 
> the oom killer.
> 
> An example: allow users to manage two different mem cgroup hierarchies 
> limited to 100GB each.  User A uses 10GB of memory and user B uses 90GB of 
> memory in their respective hierarchies.  On a system oom condition, we'd 
> expect at least one process from user B's hierarchy would always be oom 
> killed with the cgroup aware oom killer.  In fact, the changelog 
> explicitly states it solves an issue where "1) There is no fairness 
> between containers. A small container with few large processes will be 
> chosen over a large one with huge number of small processes."
> 
> The opposite becomes true, however, if user B creates child cgroups and 
> distributes its processes such that each ch

[PATCH] cgroup, docs: document cgroup v2 device controller

2017-12-13 Thread Roman Gushchin
Add the corresponding section in cgroup v2 documentation.

Signed-off-by: Roman Gushchin 
Cc: Tejun Heo 
Cc: Alexei Starovoitov 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 Documentation/cgroup-v2.txt | 33 +
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 2cddab7efb20..d6efabb487e3 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -53,10 +53,11 @@ v1 is available under Documentation/cgroup-v1/.
5-3-2. Writeback
  5-4. PID
5-4-1. PID Interface Files
- 5-5. RDMA
-   5-5-1. RDMA Interface Files
- 5-6. Misc
-   5-6-1. perf_event
+ 5-5. Device
+ 5-6. RDMA
+   5-6-1. RDMA Interface Files
+ 5-7. Misc
+   5-7-1. perf_event
6. Namespace
  6-1. Basics
  6-2. The Root and Views
@@ -1429,6 +1430,30 @@ through fork() or clone(). These will return -EAGAIN if 
the creation
 of a new process would cause a cgroup policy to be violated.
 
 
+Device controller
+-
+
+Device controller manages access to device files. It includes both
+creation of new device files (using mknod), and access to the
+existing device files.
+
+Cgroup v2 device controller has no interface files and is implemented
+on top of cgroup BPF. To control access to device files, a user may
+create bpf programs of the BPF_CGROUP_DEVICE type and attach them
+to cgroups. On an attempt to access a device file, corresponding
+BPF programs will be executed, and depending on the return value
+the attempt will succeed or fail with -EPERM.
+
+A BPF_CGROUP_DEVICE program takes a pointer to the bpf_cgroup_dev_ctx
+structure, which describes the device access attempt: access type
+(mknod/read/write) and device (type, major and minor numbers).
+If the program returns 0, the attempt fails with -EPERM, otherwise
+it succeeds.
+
+An example of BPF_CGROUP_DEVICE program may be found in the kernel
+source tree in the tools/testing/selftests/bpf/dev_cgroup.c file.
+
+
 RDMA
 
 
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer

2017-12-07 Thread Roman Gushchin
On Wed, Dec 06, 2017 at 05:24:13PM -0800, Andrew Morton wrote:
> 
> As a result of the "stalled MM patches" discussion I've dropped these
> three patches:
> 
> mm,oom: move last second allocation to inside the OOM killer
> mm,oom: use ALLOC_OOM for OOM victim's last second allocation
> mm,oom: remove oom_lock serialization from the OOM reaper
> 
> and I had to rework this patch as a result.  Please carefully check (and
> preferable test) my handiwork in out_of_memory()?

Hi, Andrew!

Reviewed and tested, looks good to me. Thank you!

A couple of small nits below.

> 
> 
> 
> From: Roman Gushchin 
> Subject: mm, oom: cgroup-aware OOM killer
> 
> Traditionally, the OOM killer is operating on a process level.  Under oom
> conditions, it finds a process with the highest oom score and kills it.
> 
> This behavior doesn't suit well the system with many running containers:
> 
> 1) There is no fairness between containers.  A small container with few
>large processes will be chosen over a large one with huge number of
>small processes.
> 
> 2) Containers often do not expect that some random process inside will
>be killed.  In many cases much safer behavior is to kill all tasks in
>the container.  Traditionally, this was implemented in userspace, but
>doing it in the kernel has some advantages, especially in a case of a
>system-wide OOM.
> 
> To address these issues, the cgroup-aware OOM killer is introduced.
> 
> This patch introduces the core functionality: an ability to select a
> memory cgroup as an OOM victim.  Under OOM conditions the OOM killer looks
> for the biggest leaf memory cgroup and kills the biggest task belonging to
> it.
> 
> The following patches will extend this functionality to consider non-leaf
> memory cgroups as OOM victims, and also provide an ability to kill all
> tasks belonging to the victim cgroup.
> 
> The root cgroup is treated as a leaf memory cgroup, so it's score is
> compared with other leaf memory cgroups.  Due to memcg statistics
> implementation a special approximation is used for estimating oom_score of
> root memory cgroup: we sum oom_score of the belonging processes (or, to be
> more precise, tasks owning their mm structures).
> 
> Link: http://lkml.kernel.org/r/20171130152824.1591-4-g...@fb.com
> Signed-off-by: Roman Gushchin 
> Cc: Michal Hocko 
> Cc: Johannes Weiner 
> Cc: Vladimir Davydov 
> Cc: Tetsuo Handa 
> Cc: David Rientjes 
> Cc: Tejun Heo 
> Cc: Michal Hocko 
> Signed-off-by: Andrew Morton 
> ---
> 
>  include/linux/memcontrol.h |   17 +++
>  include/linux/oom.h|   12 ++
>  mm/memcontrol.c|  181 +++
>  mm/oom_kill.c  |   72 ++---
>  4 files changed, 262 insertions(+), 20 deletions(-)
> 
> diff -puN include/linux/memcontrol.h~mm-oom-cgroup-aware-oom-killer 
> include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h~mm-oom-cgroup-aware-oom-killer
> +++ a/include/linux/memcontrol.h
> @@ -35,6 +35,7 @@ struct mem_cgroup;
>  struct page;
>  struct mm_struct;
>  struct kmem_cache;
> +struct oom_control;
>  
>  /* Cgroup-specific page state, on top of universal node page state */
>  enum memcg_stat_item {
> @@ -344,6 +345,11 @@ struct mem_cgroup *mem_cgroup_from_css(s
>   return css ? container_of(css, struct mem_cgroup, css) : NULL;
>  }
>  
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> + css_put(&memcg->css);
> +}
> +
>  #define mem_cgroup_from_counter(counter, member) \
>   container_of(counter, struct mem_cgroup, member)
>  
> @@ -482,6 +488,8 @@ static inline bool task_in_memcg_oom(str
>  
>  bool mem_cgroup_oom_synchronize(bool wait);
>  
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc);
> +
>  #ifdef CONFIG_MEMCG_SWAP
>  extern int do_swap_account;
>  #endif
> @@ -781,6 +789,10 @@ static inline bool task_in_mem_cgroup(st
>   return true;
>  }
>  
> +static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> +{
> +}
> +
>  static inline struct mem_cgroup *
>  mem_cgroup_iter(struct mem_cgroup *root,
>   struct mem_cgroup *prev,
> @@ -973,6 +985,11 @@ static inline
>  void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
>  {
>  }
> +
> +static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> + return false;
> +}
>  #endif /* CONFIG_MEMCG */
>  
>  /* idx can be of type enum memcg_stat_item or node_stat_item */
> diff -puN include/linux/oom.h~mm-oom-cgroup-aware-oom-killer 
> include/linux/oom.h
> --- a/include/linux/oom.h~mm-oom-cgroup-aware-oom-k

Re: [PATCH v13 6/7] mm, oom, docs: describe the cgroup-aware OOM killer

2017-12-01 Thread Roman Gushchin
On Fri, Dec 01, 2017 at 09:41:54AM +0100, Michal Hocko wrote:
> On Thu 30-11-17 15:28:23, Roman Gushchin wrote:
> > @@ -1229,6 +1252,41 @@ to be accessed repeatedly by other cgroups, it may 
> > make sense to use
> >  POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
> >  belonging to the affected files to ensure correct memory ownership.
> >  
> > +OOM Killer
> > +~~
> > +
> > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > +It means that it treats cgroups as first class OOM entities.
> 
> This should mention groupoom mount option to enable this functionality.
> 
> Other than that looks ok to me
> Acked-by: Michal Hocko 
> -- 
> Michal Hocko
> SUSE Labs


>From 1d9c87128897ee7f27f9651d75b80f73985373e8 Mon Sep 17 00:00:00 2001
From: Roman Gushchin 
Date: Fri, 1 Dec 2017 15:34:59 +
Subject: [PATCH] mm, oom, docs: document groupoom mount option

Add a note that cgroup-aware OOM logic is disabled by default
and describe how to enable it.

Signed-off-by: Roman Gushchin 
Cc: Andrew Morton 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
---
 Documentation/cgroup-v2.txt | 9 +
 1 file changed, 9 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index c80a147f94b7..ff8e92db636d 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1049,6 +1049,9 @@ PAGE_SIZE multiple when read back.
and will never kill the unkillable task, even if memory.oom_group
is set.
 
+   If cgroup-aware OOM killer is not enabled, ENOTSUPP error
+   is returned on attempt to access the file.
+
   memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined.  Unless specified
@@ -1258,6 +1261,12 @@ OOM Killer
 Cgroup v2 memory controller implements a cgroup-aware OOM killer.
 It means that it treats cgroups as first class OOM entities.
 
+Cgroup-aware OOM logic is turned off by default and requires
+passing the "groupoom" option on mounting cgroupfs. It can also
+by remounting cgroupfs with the following command::
+
+  # mount -o remount,groupoom $MOUNT_POINT
+
 Under OOM conditions the memory controller tries to make the best
 choice of a victim, looking for a memory cgroup with the largest
 memory footprint, considering leaf cgroups and cgroups with the
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-12-01 Thread Roman Gushchin
On Fri, Dec 01, 2017 at 02:31:45PM +0100, Michal Hocko wrote:
> On Fri 01-12-17 13:15:38, Roman Gushchin wrote:
> [...]
> > So, maybe we just need to return -EAGAIN (or may be -ENOTSUP) on any 
> > read/write
> > attempt if option is not enabled?
> 
> Yes, that would work as well. ENOTSUP sounds better to me.
> -- 
> Michal Hocko
> SUSE Labs

>From 78bf2c00abf450bcd993d02a7dc1783144005fbd Mon Sep 17 00:00:00 2001
From: Roman Gushchin 
Date: Fri, 1 Dec 2017 14:30:14 +
Subject: [PATCH] mm, oom: return error on access to memory.oom_group if
 groupoom is disabled

Cgroup-aware OOM killer depends on cgroup mount option and is turned
off by default, despite the user interface (memory.oom_group file) is
always present. As it might be confusing to a user, let's return
-ENOTSUPP on an attempt to access to memory.oom_group if groupoom is not
enabled globally.

Example:
  $ cd /sys/fs/cgroup/user.slice/
  $ cat memory.oom_group
cat: memory.oom_group: Unknown error 524
  $ echo 1 > memory.oom_group
-bash: echo: write error: Unknown error 524
  $ mount -o remount,groupoom /sys/fs/cgroup
  $ echo 1 > memory.oom_group
  $ cat memory.oom_group
1

Signed-off-by: Roman Gushchin 
Cc: Andrew Morton 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
---
 mm/memcontrol.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c76d5fb55c5c..b709ee4f914b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5482,6 +5482,9 @@ static int memory_oom_group_show(struct seq_file *m, void 
*v)
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
bool oom_group = memcg->oom_group;
 
+   if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+   return -ENOTSUPP;
+
seq_printf(m, "%d\n", oom_group);
 
return 0;
@@ -5495,6 +5498,9 @@ static ssize_t memory_oom_group_write(struct 
kernfs_open_file *of,
int oom_group;
int err;
 
+   if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+   return -ENOTSUPP;
+
err = kstrtoint(strstrip(buf), 0, &oom_group);
if (err)
return err;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling

2017-12-01 Thread Roman Gushchin
Hi, Michal!

I totally agree that out_of_memory() function deserves some refactoring.

But I think there is an issue with your patch (see below):

On Fri, Dec 01, 2017 at 10:14:25AM +0100, Michal Hocko wrote:
> Recently added alloc_pages_before_oomkill gained new caller with this
> patchset and I think it just grown to deserve a simpler code flow.
> What do you think about this on top of the series?
> 
> ---
> From f1f6035ea0df65e7619860b013f2fabdda65233e Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Fri, 1 Dec 2017 10:05:25 +0100
> Subject: [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling
> 
> alloc_pages_before_oomkill is the last attempt to allocate memory before
> we go and try to kill a process or a memcg. It's success path always has
> to properly clean up the oc state (namely victim reference count). Let's
> pull this into alloc_pages_before_oomkill directly rather than risk
> somebody will forget to do it in future. Also document that we _know_
> alloc_pages_before_oomkill violates proper layering and that is a
> pragmatic decision.
> 
> Signed-off-by: Michal Hocko 
> ---
>  include/linux/oom.h |  2 +-
>  mm/oom_kill.c   | 21 +++--
>  mm/page_alloc.c | 24 ++--
>  3 files changed, 26 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 10f495c8454d..7052e0a20e13 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -121,7 +121,7 @@ extern void oom_killer_enable(void);
>  
>  extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>  
> -extern struct page *alloc_pages_before_oomkill(const struct oom_control *oc);
> +extern bool alloc_pages_before_oomkill(struct oom_control *oc);
>  
>  extern int oom_evaluate_task(struct task_struct *task, void *arg);
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 4678468bae17..5c2cd299757b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1102,8 +1102,7 @@ bool out_of_memory(struct oom_control *oc)
>   if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
>   current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
>   current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> - oc->page = alloc_pages_before_oomkill(oc);
> - if (oc->page)
> + if (alloc_pages_before_oomkill(oc))
>   return true;
>   get_task_struct(current);
>   oc->chosen_task = current;
> @@ -1112,13 +,8 @@ bool out_of_memory(struct oom_control *oc)
>   }
>  
>   if (mem_cgroup_select_oom_victim(oc)) {
> - oc->page = alloc_pages_before_oomkill(oc);
> - if (oc->page) {
> - if (oc->chosen_memcg &&
> - oc->chosen_memcg != INFLIGHT_VICTIM)
> - mem_cgroup_put(oc->chosen_memcg);

You're removing chosen_memcg releasing here, but I don't see where you
do this instead. And I'm not sure that putting mem_cgroup_put() into
alloc_pages_before_oomkill() is a way towards simpler code.

I was thinking about a bit larger refactoring: splitting out_of_memory()
into the following parts (defined as separate functions): victim selection
(per-process, memcg-aware or just allocating task), last allocation attempt,
OOM action (kill process, kill memcg, panic). Hopefully it can simplify the 
things,
but I don't have code yet.

Thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-12-01 Thread Roman Gushchin
On Fri, Dec 01, 2017 at 09:41:13AM +0100, Michal Hocko wrote:
> On Thu 30-11-17 15:28:22, Roman Gushchin wrote:
> > Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> > OOM killer. If not set, the OOM selection is performed in
> > a "traditional" per-process way.
> > 
> > The behavior can be changed dynamically by remounting the cgroupfs.
> 
> Is it ok to create oom_group if the option is not enabled? This looks
> confusing. I forgot all the details about how cgroup core creates file
> so I do not have a good idea how to fix this.

I don't think we do show/hide interface files dynamically.
Even for things like socket memory which can be disabled by the boot option,
we don't hide the corresponding stats entry.

So, maybe we just need to return -EAGAIN (or may be -ENOTSUP) on any read/write
attempt if option is not enabled?

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v13 4/7] mm, oom: introduce memory.oom_group

2017-11-30 Thread Roman Gushchin
The cgroup-aware OOM killer treats leaf memory cgroups as memory
consumption entities and performs the victim selection by comparing
them based on their memory footprint. Then it kills the biggest task
inside the selected memory cgroup.

But there are workloads, which are not tolerant to a such behavior.
Killing a random task may leave the workload in a broken state.

To solve this problem, memory.oom_group knob is introduced.
It will define, whether a memory group should be treated as an
indivisible memory consumer, compared by total memory consumption
with other memory consumers (leaf memory cgroups and other memory
cgroups with memory.oom_group set), and whether all belonging tasks
should be killed if the cgroup is selected.

If set on memcg A, it means that in case of system-wide OOM or
memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
belonging to the sub-tree of A will be killed. If OOM event is
scoped to a descendant cgroup (A/B, for example), only tasks in
that cgroup can be affected. OOM killer will never touch any tasks
outside of the scope of the OOM event.

Also, tasks with oom_score_adj set to -1000 will not be killed because
this has been a long established way to protect a particular process
from seeing an unexpected SIGKILL from the OOM killer. Ignoring this
user defined configuration might lead to data corruptions or other
misbehavior.

The default value is 0.

Signed-off-by: Roman Gushchin 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/memcontrol.h | 17 +++
 mm/memcontrol.c| 75 +++---
 mm/oom_kill.c  | 47 +++--
 3 files changed, 126 insertions(+), 13 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index cb4db659a8b5..7b8bcdf6571d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -203,6 +203,13 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
 
+   /*
+* Treat the sub-tree as an indivisible memory consumer,
+* kill all belonging tasks if the memory cgroup selected
+* as OOM victim.
+*/
+   bool oom_group;
+
/* handle for "memory.events" */
struct cgroup_file events_file;
 
@@ -490,6 +497,11 @@ bool mem_cgroup_oom_synchronize(bool wait);
 
 bool mem_cgroup_select_oom_victim(struct oom_control *oc);
 
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+   return memcg->oom_group;
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -990,6 +1002,11 @@ static inline bool mem_cgroup_select_oom_victim(struct 
oom_control *oc)
 {
return false;
 }
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+   return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 592ffb1c98a7..5d27a4bbd478 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2779,19 +2779,51 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
 
 static void select_victim_memcg(struct mem_cgroup *root, struct oom_control 
*oc)
 {
-   struct mem_cgroup *iter;
+   struct mem_cgroup *iter, *group = NULL;
+   long group_score = 0;
 
oc->chosen_memcg = NULL;
oc->chosen_points = 0;
 
+   /*
+* If OOM is memcg-wide, and the memcg has the oom_group flag set,
+* all tasks belonging to the memcg should be killed.
+* So, we mark the memcg as a victim.
+*/
+   if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
+   oc->chosen_memcg = oc->memcg;
+   css_get(&oc->chosen_memcg->css);
+   return;
+   }
+
/*
 * The oom_score is calculated for leaf memory cgroups (including
 * the root memcg).
+* Non-leaf oom_group cgroups accumulating score of descendant
+* leaf memory cgroups.
 */
rcu_read_lock();
for_each_mem_cgroup_tree(iter, root) {
long score;
 
+   /*
+* We don't consider non-leaf non-oom_group memory cgroups
+* as OOM victims.
+*/
+   if (memcg_has_children(iter) && iter != root_mem_cgroup &&
+   !mem_cgroup_oom_group(iter))
+   continue;
+
+   /*
+* If group is not set or we've ran out of the group's sub-tree,
+* we should set group and reset group_score.
+*/
+   if (!group || group

[PATCH v13 6/7] mm, oom, docs: describe the cgroup-aware OOM killer

2017-11-30 Thread Roman Gushchin
Document the cgroup-aware OOM killer.

Signed-off-by: Roman Gushchin 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Tetsuo Handa 
Cc: Andrew Morton 
Cc: David Rientjes 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 Documentation/cgroup-v2.txt | 58 +
 1 file changed, 58 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 779211fbb69f..c80a147f94b7 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
5-2-1. Memory Interface Files
5-2-2. Usage Guidelines
5-2-3. Memory Ownership
+   5-2-4. OOM Killer
  5-3. IO
5-3-1. IO Interface Files
5-3-2. Writeback
@@ -1026,6 +1027,28 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.
 
+  memory.oom_group
+
+   A read-write single value file which exists on non-root
+   cgroups.  The default is "0".
+
+   If set, OOM killer will consider the memory cgroup as an
+   indivisible memory consumers and compare it with other memory
+   consumers by it's memory footprint.
+   If such memory cgroup is selected as an OOM victim, all
+   processes belonging to it or it's descendants will be killed.
+
+   This applies to system-wide OOM conditions and reaching
+   the hard memory limit of the cgroup and their ancestor.
+   If OOM condition happens in a descendant cgroup with it's own
+   memory limit, the memory cgroup can't be considered
+   as an OOM victim, and OOM killer will not kill all belonging
+   tasks.
+
+   Also, OOM killer respects the /proc/pid/oom_score_adj value -1000,
+   and will never kill the unkillable task, even if memory.oom_group
+   is set.
+
   memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined.  Unless specified
@@ -1229,6 +1252,41 @@ to be accessed repeatedly by other cgroups, it may make 
sense to use
 POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
 belonging to the affected files to ensure correct memory ownership.
 
+OOM Killer
+~~
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choice of a victim, looking for a memory cgroup with the largest
+memory footprint, considering leaf cgroups and cgroups with the
+memory.oom_group option set, which are considered to be an indivisible
+memory consumers.
+
+By default, OOM killer will kill the biggest task in the selected
+memory cgroup. A user can change this behavior by enabling
+the per-cgroup memory.oom_group option. If set, it causes
+the OOM killer to kill all processes attached to the cgroup,
+except processes with oom_score_adj set to -1000.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
+The root cgroup is treated as a leaf memory cgroup, so it's compared
+with other leaf memory cgroups and cgroups with oom_group option set.
+
+If there are no cgroups with the enabled memory controller,
+the OOM killer is using the "traditional" process-based approach.
+
+Please, note that memory charges are not migrating if tasks
+are moved between different memory cgroups. Moving tasks with
+significant memory footprint may affect OOM victim selection logic.
+If it's a case, please, consider creating a common ancestor for
+the source and destination memory cgroups and enabling oom_group
+on ancestor layer.
+
 
 IO
 --
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v13 7/7] cgroup: list groupoom in cgroup features

2017-11-30 Thread Roman Gushchin
List groupoom in cgroup features list (exported via
/sys/kernel/cgroup/features), which can be used by a userspace
apps (most likely, systemd) to get an idea which cgroup features
are supported by kernel.

Signed-off-by: Roman Gushchin 
Cc: Tejun Heo 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 kernel/cgroup/cgroup.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 7338e12979e1..693443282fc1 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5922,7 +5922,8 @@ static struct kobj_attribute cgroup_delegate_attr = 
__ATTR_RO(delegate);
 static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
 char *buf)
 {
-   return snprintf(buf, PAGE_SIZE, "nsdelegate\n");
+   return snprintf(buf, PAGE_SIZE, "nsdelegate\n"
+   "groupoom\n");
 }
 static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
 
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-11-30 Thread Roman Gushchin
Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
OOM killer. If not set, the OOM selection is performed in
a "traditional" per-process way.

The behavior can be changed dynamically by remounting the cgroupfs.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/cgroup-defs.h |  5 +
 kernel/cgroup/cgroup.c  | 10 ++
 mm/memcontrol.c |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 8b7fd8eeccee..9fb99e25d654 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -81,6 +81,11 @@ enum {
 * Enable cpuset controller in v1 cgroup to use v2 behavior.
 */
CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
+
+   /*
+* Enable cgroup-aware OOM killer.
+*/
+   CGRP_GROUP_OOM = (1 << 5),
 };
 
 /* cftype->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 0b1ffe147f24..7338e12979e1 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1731,6 +1731,9 @@ static int parse_cgroup_root_flags(char *data, unsigned 
int *root_flags)
if (!strcmp(token, "nsdelegate")) {
*root_flags |= CGRP_ROOT_NS_DELEGATE;
continue;
+   } else if (!strcmp(token, "groupoom")) {
+   *root_flags |= CGRP_GROUP_OOM;
+   continue;
}
 
pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1747,6 +1750,11 @@ static void apply_cgroup_root_flags(unsigned int 
root_flags)
cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
+
+   if (root_flags & CGRP_GROUP_OOM)
+   cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
+   else
+   cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
}
 }
 
@@ -1754,6 +1762,8 @@ static int cgroup_show_options(struct seq_file *seq, 
struct kernfs_root *kf_root
 {
if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
seq_puts(seq, ",nsdelegate");
+   if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
+   seq_puts(seq, ",groupoom");
return 0;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5d27a4bbd478..c76d5fb55c5c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2869,6 +2869,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;
 
+   if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+   return false;
+
if (oc->memcg)
root = oc->memcg;
else
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v13 2/7] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup

2017-11-30 Thread Roman Gushchin
Implement mem_cgroup_scan_tasks() functionality for the root
memory cgroup to use this function for looking for a OOM victim
task in the root memory cgroup by the cgroup-ware OOM killer.

The root memory cgroup is treated as a leaf cgroup, so only tasks
which are directly belonging to the root cgroup are iterated over.

This patch doesn't introduce any functional change as
mem_cgroup_scan_tasks() is never called for the root memcg.
This is preparatory work for the cgroup-aware OOM killer,
which will use this function to iterate over tasks belonging
to the root memcg.

Signed-off-by: Roman Gushchin 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Acked-by: David Rientjes 
Cc: Vladimir Davydov 
Cc: Tetsuo Handa 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 mm/memcontrol.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a4aac306ebe3..55fbda60cef6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -888,7 +888,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup 
*dead_memcg)
  * value, the function breaks the iteration loop and returns the value.
  * Otherwise, it will iterate over all tasks and return 0.
  *
- * This function must not be called for the root memory cgroup.
+ * If memcg is the root memory cgroup, this function will iterate only
+ * over tasks belonging directly to the root memory cgroup.
  */
 int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
  int (*fn)(struct task_struct *, void *), void *arg)
@@ -896,8 +897,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
struct mem_cgroup *iter;
int ret = 0;
 
-   BUG_ON(memcg == root_mem_cgroup);
-
for_each_mem_cgroup_tree(iter, memcg) {
struct css_task_iter it;
struct task_struct *task;
@@ -906,7 +905,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
while (!ret && (task = css_task_iter_next(&it)))
ret = fn(task, arg);
css_task_iter_end(&it);
-   if (ret) {
+   if (ret || memcg == root_mem_cgroup) {
mem_cgroup_iter_break(memcg, iter);
break;
}
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v13 0/7] cgroup-aware OOM killer

2017-11-30 Thread Roman Gushchin
This patchset makes the OOM killer cgroup-aware.

v13:
  - Reverted fallback to per-process OOM as in v11 (asked by Michal)
  - Added entry in cgroup features list
  - Added a note about charge migration
  - Rebase

v12:
  - Root memory cgroup is evaluated based on sum of the oom scores
of belonging tasks
  - Do not fallback to the per-process behavior if there if
it wasn't possbile to kill a memcg victim
  - Rebase on top of mm tree

v11:
  - Fixed an issue with skipping the root mem cgroup
(discovered by Shakeel Butt)
  - Moved a check in __oom_kill_process() to the memmory.oom_group
patch, added corresponding comments
  - Added a note about ignoring tasks with oom_score_adj -1000
(proposed by Michal Hocko)
  - Rebase on top of mm tree

v10:
  - Separate oom_group introduction into a standalone patch
  - Stop propagating oom_group
  - Make oom_group delegatable
  - Do not try to kill the biggest task in the first order,
if the whole cgroup is going to be killed
  - Stop caching oom_score on struct memcg, optimize victim
memcg selection
  - Drop dmesg printing (for further refining)
  - Small refactorings and comments added here and there
  - Rebase on top of mm tree

v9:
  - Change siblings-to-siblings comparison to the tree-wide search,
make related refactorings
  - Make oom_group implicitly propagated down by the tree
  - Fix an issue with task selection in root cgroup

v8:
  - Do not kill tasks with OOM_SCORE_ADJ -1000
  - Make the whole thing opt-in with cgroup mount option control
  - Drop oom_priority for further discussions
  - Kill the whole cgroup if oom_group is set and it's
memory.max is reached
  - Update docs and commit messages

v7:
  - __oom_kill_process() drops reference to the victim task
  - oom_score_adj -1000 is always respected
  - Renamed oom_kill_all to oom_group
  - Dropped oom_prio range, converted from short to int
  - Added a cgroup v2 mount option to disable cgroup-aware OOM killer
  - Docs updated
  - Rebased on top of mmotm

v6:
  - Renamed oom_control.chosen to oom_control.chosen_task
  - Renamed oom_kill_all_tasks to oom_kill_all
  - Per-node NR_SLAB_UNRECLAIMABLE accounting
  - Several minor fixes and cleanups
  - Docs updated

v5:
  - Rebased on top of Michal Hocko's patches, which have changed the
way how OOM victims becoming an access to the memory
reserves. Dropped corresponding part of this patchset
  - Separated the oom_kill_process() splitting into a standalone commit
  - Added debug output (suggested by David Rientjes)
  - Some minor fixes

v4:
  - Reworked per-cgroup oom_score_adj into oom_priority
(based on ideas by David Rientjes)
  - Tasks with oom_score_adj -1000 are never selected if
oom_kill_all_tasks is not set
  - Memcg victim selection code is reworked, and
synchronization is based on finding tasks with OOM victim marker,
rather then on global counter
  - Debug output is dropped
  - Refactored TIF_MEMDIE usage

v3:
  - Merged commits 1-4 into 6
  - Separated oom_score_adj logic and debug output into separate commits
  - Fixed swap accounting

v2:
  - Reworked victim selection based on feedback
from Michal Hocko, Vladimir Davydov and Johannes Weiner
  - "Kill all tasks" is now an opt-in option, by default
only one process will be killed
  - Added per-cgroup oom_score_adj
  - Refined oom score calculations, suggested by Vladimir Davydov
  - Converted to a patchset

v1:
  https://lkml.org/lkml/2017/5/18/969


Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: cgro...@vger.kernel.org
Cc: linux...@kvack.org

Roman Gushchin (7):
  mm, oom: refactor the oom_kill_process() function
  mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  mm, oom: cgroup-aware OOM killer
  mm, oom: introduce memory.oom_group
  mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  mm, oom, docs: describe the cgroup-aware OOM killer
  cgroup: list groupoom in cgroup features

 Documentation/cgroup-v2.txt |  58 ++
 include/linux/cgroup-defs.h |   5 +
 include/linux/memcontrol.h  |  34 ++
 include/linux/oom.h |  12 ++-
 kernel/cgroup/cgroup.c  |  13 ++-
 mm/memcontrol.c | 258 +++-
 mm/oom_kill.c   | 224 +-
 7 files changed, 525 insertions(+), 79 deletions(-)

-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v13 3/7] mm, oom: cgroup-aware OOM killer

2017-11-30 Thread Roman Gushchin
Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

To address these issues, the cgroup-aware OOM killer is introduced.

This patch introduces the core functionality: an ability to select
a memory cgroup as an OOM victim. Under OOM conditions the OOM killer
looks for the biggest leaf memory cgroup and kills the biggest
task belonging to it.

The following patches will extend this functionality to consider
non-leaf memory cgroups as OOM victims, and also provide an ability
to kill all tasks belonging to the victim cgroup.

The root cgroup is treated as a leaf memory cgroup, so it's score
is compared with other leaf memory cgroups.
Due to memcg statistics implementation a special approximation
is used for estimating oom_score of root memory cgroup: we sum
oom_score of the belonging processes (or, to be more precise,
tasks owning their mm structures).

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/memcontrol.h |  17 +
 include/linux/oom.h|  12 ++-
 mm/memcontrol.c| 181 +
 mm/oom_kill.c  |  84 +++--
 4 files changed, 272 insertions(+), 22 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 882046863581..cb4db659a8b5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -344,6 +345,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct 
cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+   css_put(&memcg->css);
+}
+
 #define mem_cgroup_from_counter(counter, member)   \
container_of(counter, struct mem_cgroup, member)
 
@@ -482,6 +488,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -781,6 +789,10 @@ static inline bool task_in_mem_cgroup(struct task_struct 
*task,
return true;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -973,6 +985,11 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+   return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 27cd36b762b5..10f495c8454d 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -10,6 +10,13 @@
 #include  /* MMF_* */
 #include  /* VM_FAULT* */
 
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
 struct zonelist;
 struct notifier_block;
 struct mem_cgroup;
@@ -51,7 +58,8 @@ struct oom_control {
 
/* Used by oom implementation, do not set */
unsigned long totalpages;
-   struct task_struct *chosen;
+   struct task_struct *chosen_task;
+   struct mem_cgroup *chosen_memcg;
unsigned long chosen_points;
 };
 
@@ -115,6 +123,8 @@ extern struct task_struct *find_lock_task_mm(struct 
task_struct *p);
 
 extern struct page *alloc_pages_before_oomkill(const struct oom_control *oc);
 
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 55fbda60cef6..592ffb1c98a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2664,6 +2664,187 @@ static inline bool memcg_has_children(struct mem_cgroup 
*memcg)
 

[PATCH v13 1/7] mm, oom: refactor the oom_kill_process() function

2017-11-30 Thread Roman Gushchin
The oom_kill_process() function consists of two logical parts:
the first one is responsible for considering task's children as
a potential victim and printing the debug information.
The second half is responsible for sending SIGKILL to all
tasks sharing the mm struct with the given victim.

This commit splits the oom_kill_process() function with
an intention to re-use the the second half: __oom_kill_process().

The cgroup-aware OOM killer will kill multiple tasks
belonging to the victim cgroup. We don't need to print
the debug information for the each task, as well as play
with task selection (considering task's children),
so we can't use the existing oom_kill_process().

Signed-off-by: Roman Gushchin 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Acked-by: David Rientjes 
Cc: Vladimir Davydov 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 mm/oom_kill.c | 123 +++---
 1 file changed, 65 insertions(+), 58 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3b0d0fed8480..f041534d77d3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -814,68 +814,12 @@ static bool task_will_free_mem(struct task_struct *task)
return ret;
 }
 
-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
 {
-   struct task_struct *p = oc->chosen;
-   unsigned int points = oc->chosen_points;
-   struct task_struct *victim = p;
-   struct task_struct *child;
-   struct task_struct *t;
+   struct task_struct *p;
struct mm_struct *mm;
-   unsigned int victim_points = 0;
-   static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
bool can_oom_reap = true;
 
-   /*
-* If the task is already exiting, don't alarm the sysadmin or kill
-* its children or threads, just give it access to memory reserves
-* so it can die quickly
-*/
-   task_lock(p);
-   if (task_will_free_mem(p)) {
-   mark_oom_victim(p);
-   wake_oom_reaper(p);
-   task_unlock(p);
-   put_task_struct(p);
-   return;
-   }
-   task_unlock(p);
-
-   if (__ratelimit(&oom_rs))
-   dump_header(oc, p);
-
-   pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
-   message, task_pid_nr(p), p->comm, points);
-
-   /*
-* If any of p's children has a different mm and is eligible for kill,
-* the one with the highest oom_badness() score is sacrificed for its
-* parent.  This attempts to lose the minimal amount of work done while
-* still freeing memory.
-*/
-   read_lock(&tasklist_lock);
-   for_each_thread(p, t) {
-   list_for_each_entry(child, &t->children, sibling) {
-   unsigned int child_points;
-
-   if (process_shares_mm(child, p->mm))
-   continue;
-   /*
-* oom_badness() returns 0 if the thread is unkillable
-*/
-   child_points = oom_badness(child,
-   oc->memcg, oc->nodemask, oc->totalpages);
-   if (child_points > victim_points) {
-   put_task_struct(victim);
-   victim = child;
-   victim_points = child_points;
-   get_task_struct(victim);
-   }
-   }
-   }
-   read_unlock(&tasklist_lock);
-
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
@@ -949,6 +893,69 @@ static void oom_kill_process(struct oom_control *oc, const 
char *message)
 }
 #undef K
 
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+   struct task_struct *p = oc->chosen;
+   unsigned int points = oc->chosen_points;
+   struct task_struct *victim = p;
+   struct task_struct *child;
+   struct task_struct *t;
+   unsigned int victim_points = 0;
+   static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+   /*
+* If the task is already exiting, don't alarm the sysadmin or kill
+* its children or threads, just give it access to memory reserves
+* so it can die quickly
+*/
+   task_lock(p);
+   if (task_will_free_mem(p)) {
+   mark_oom_victim(p);
+  

Re: [RESEND v12 0/6] cgroup-aware OOM killer

2017-10-27 Thread Roman Gushchin
On Thu, Oct 26, 2017 at 02:03:41PM -0700, David Rientjes wrote:
> On Thu, 26 Oct 2017, Johannes Weiner wrote:
> 
> > > The nack is for three reasons:
> > > 
> > >  (1) unfair comparison of root mem cgroup usage to bias against that mem 
> > >  cgroup from oom kill in system oom conditions,
> > > 
> > >  (2) the ability of users to completely evade the oom killer by attaching
> > >  all processes to child cgroups either purposefully or unpurposefully,
> > >  and
> > > 
> > >  (3) the inability of userspace to effectively control oom victim  
> > >  selection.
> > 
> > My apologies if my summary was too reductionist.
> > 
> > That being said, the arguments you repeat here have come up in
> > previous threads and been responded to. This doesn't change my
> > conclusion that your NAK is bogus.
> > 
> 
> They actually haven't been responded to, Roman was working through v11 and 
> made a change on how the root mem cgroup usage was calculated that was 
> better than previous iterations but still not an apples to apples 
> comparison with other cgroups.  The problem is that it the calculation for 
> leaf cgroups includes additional memory classes, so it biases against 
> processes that are moved to non-root mem cgroups.  Simply creating mem 
> cgroups and attaching processes should not independently cause them to 
> become more preferred: it should be a fair comparison between the root mem 
> cgroup and the set of leaf mem cgroups as implemented.  That is very 
> trivial to do with hierarchical oom cgroup scoring.
> 
> Since the ability of userspace to control oom victim selection is not 
> addressed whatsoever by this patchset, and the suggested method cannot be 
> implemented on top of this patchset as you have argued because it requires 
> a change to the heuristic itself, the patchset needs to become complete 
> before being mergeable.

Hi David!

The thing is that the hierarchical approach (as in v8), which are you pushing,
has it's own limitations, which we've discussed in details earlier. There are
reasons why v12 is different, and we can't really simple go back. I mean if
there are better ideas how to resolve concerns raised in discussions around v8,
let me know, but ignoring them is not an option.

>From my point of view, an idea of selecting the biggest memcg tree-wide is
perfectly fine, as far as it possible to group memcgs in OOM domains.
As in v12, it can be done by setting the memory.oom_group knob.
It's perfectly possible to extend this by adding an ability to continue OOM
victim selection in the selected memcg instead of killing all belonging tasks,
as far as a practical need arises.

The way how we evaluate the root memory cgroup isn't as important as the
question how we compare cgroups in the hierarchy. So even if the hierarchical
approach allows to implement fairer comparison, it's not a reason to choose it.
Just because there are more serious concerns, discussed earlier.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RESEND v12 4/6] mm, oom: introduce memory.oom_group

2017-10-19 Thread Roman Gushchin
The cgroup-aware OOM killer treats leaf memory cgroups as memory
consumption entities and performs the victim selection by comparing
them based on their memory footprint. Then it kills the biggest task
inside the selected memory cgroup.

But there are workloads, which are not tolerant to a such behavior.
Killing a random task may leave the workload in a broken state.

To solve this problem, memory.oom_group knob is introduced.
It will define, whether a memory group should be treated as an
indivisible memory consumer, compared by total memory consumption
with other memory consumers (leaf memory cgroups and other memory
cgroups with memory.oom_group set), and whether all belonging tasks
should be killed if the cgroup is selected.

If set on memcg A, it means that in case of system-wide OOM or
memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
belonging to the sub-tree of A will be killed. If OOM event is
scoped to a descendant cgroup (A/B, for example), only tasks in
that cgroup can be affected. OOM killer will never touch any tasks
outside of the scope of the OOM event.

Also, tasks with oom_score_adj set to -1000 will not be killed because
this has been a long established way to protect a particular process
from seeing an unexpected SIGKILL from the OOM killer. Ignoring this
user defined configuration might lead to data corruptions or other
misbehavior.

The default value is 0.

Signed-off-by: Roman Gushchin 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/memcontrol.h | 17 +++
 mm/memcontrol.c| 75 +++---
 mm/oom_kill.c  | 49 +++---
 3 files changed, 127 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 75b63b68846e..84ac10d7e67d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -200,6 +200,13 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
 
+   /*
+* Treat the sub-tree as an indivisible memory consumer,
+* kill all belonging tasks if the memory cgroup selected
+* as OOM victim.
+*/
+   bool oom_group;
+
/* handle for "memory.events" */
struct cgroup_file events_file;
 
@@ -488,6 +495,11 @@ bool mem_cgroup_oom_synchronize(bool wait);
 
 bool mem_cgroup_select_oom_victim(struct oom_control *oc);
 
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+   return memcg->oom_group;
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -953,6 +965,11 @@ static inline bool mem_cgroup_select_oom_victim(struct 
oom_control *oc)
 {
return false;
 }
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+   return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f364bfed745f..ad10dbdf723b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2785,19 +2785,51 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
 
 static void select_victim_memcg(struct mem_cgroup *root, struct oom_control 
*oc)
 {
-   struct mem_cgroup *iter;
+   struct mem_cgroup *iter, *group = NULL;
+   long group_score = 0;
 
oc->chosen_memcg = NULL;
oc->chosen_points = 0;
 
/*
+* If OOM is memcg-wide, and the memcg has the oom_group flag set,
+* all tasks belonging to the memcg should be killed.
+* So, we mark the memcg as a victim.
+*/
+   if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
+   oc->chosen_memcg = oc->memcg;
+   css_get(&oc->chosen_memcg->css);
+   return;
+   }
+
+   /*
 * The oom_score is calculated for leaf memory cgroups (including
 * the root memcg).
+* Non-leaf oom_group cgroups accumulating score of descendant
+* leaf memory cgroups.
 */
rcu_read_lock();
for_each_mem_cgroup_tree(iter, root) {
long score;
 
+   /*
+* We don't consider non-leaf non-oom_group memory cgroups
+* as OOM victims.
+*/
+   if (memcg_has_children(iter) && iter != root_mem_cgroup &&
+   !mem_cgroup_oom_group(iter))
+   continue;
+
+   /*
+* If group is not set or we've ran out of the group's sub-tree,
+* we should set group and reset group_score.
+*/
+   if (!group || group

[RESEND v12 3/6] mm, oom: cgroup-aware OOM killer

2017-10-19 Thread Roman Gushchin
Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

To address these issues, the cgroup-aware OOM killer is introduced.

This patch introduces the core functionality: an ability to select
a memory cgroup as an OOM victim. Under OOM conditions the OOM killer
looks for the biggest leaf memory cgroup and kills the biggest
task belonging to it.

The following patches will extend this functionality to consider
non-leaf memory cgroups as OOM victims, and also provide an ability
to kill all tasks belonging to the victim cgroup.

The root cgroup is treated as a leaf memory cgroup, so it's score
is compared with other leaf memory cgroups.
Due to memcg statistics implementation a special approximation
is used for estimating oom_score of root memory cgroup: we sum
oom_score of the belonging processes (or, to be more precise,
tasks owning their mm structures).

Signed-off-by: Roman Gushchin 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/memcontrol.h |  17 +
 include/linux/oom.h|  12 ++-
 mm/memcontrol.c| 181 +
 mm/oom_kill.c  |  72 +-
 4 files changed, 262 insertions(+), 20 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 69966c461d1c..75b63b68846e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -342,6 +343,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct 
cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+   css_put(&memcg->css);
+}
+
 #define mem_cgroup_from_counter(counter, member)   \
container_of(counter, struct mem_cgroup, member)
 
@@ -480,6 +486,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -744,6 +752,10 @@ static inline bool task_in_mem_cgroup(struct task_struct 
*task,
return true;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -936,6 +948,11 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+   return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4ce39bc..ca78e2d5956e 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -9,6 +9,13 @@
 #include  /* MMF_* */
 #include  /* VM_FAULT* */
 
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
 struct zonelist;
 struct notifier_block;
 struct mem_cgroup;
@@ -39,7 +46,8 @@ struct oom_control {
 
/* Used by oom implementation, do not set */
unsigned long totalpages;
-   struct task_struct *chosen;
+   struct task_struct *chosen_task;
+   struct mem_cgroup *chosen_memcg;
unsigned long chosen_points;
 };
 
@@ -101,6 +109,8 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1d30a45a4bbe..f364bfed745f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2670,6 +2670,187 @@ static inline bool memcg_has_children(struct mem_cgroup 
*memcg)
return ret;
 }
 
+static long 

[RESEND v12 1/6] mm, oom: refactor the oom_kill_process() function

2017-10-19 Thread Roman Gushchin
The oom_kill_process() function consists of two logical parts:
the first one is responsible for considering task's children as
a potential victim and printing the debug information.
The second half is responsible for sending SIGKILL to all
tasks sharing the mm struct with the given victim.

This commit splits the oom_kill_process() function with
an intention to re-use the the second half: __oom_kill_process().

The cgroup-aware OOM killer will kill multiple tasks
belonging to the victim cgroup. We don't need to print
the debug information for the each task, as well as play
with task selection (considering task's children),
so we can't use the existing oom_kill_process().

Signed-off-by: Roman Gushchin 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Acked-by: David Rientjes 
Cc: Vladimir Davydov 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 mm/oom_kill.c | 123 +++---
 1 file changed, 65 insertions(+), 58 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 26add8a0d1f7..0b9f36117989 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -842,68 +842,12 @@ static bool task_will_free_mem(struct task_struct *task)
return ret;
 }
 
-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
 {
-   struct task_struct *p = oc->chosen;
-   unsigned int points = oc->chosen_points;
-   struct task_struct *victim = p;
-   struct task_struct *child;
-   struct task_struct *t;
+   struct task_struct *p;
struct mm_struct *mm;
-   unsigned int victim_points = 0;
-   static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
bool can_oom_reap = true;
 
-   /*
-* If the task is already exiting, don't alarm the sysadmin or kill
-* its children or threads, just give it access to memory reserves
-* so it can die quickly
-*/
-   task_lock(p);
-   if (task_will_free_mem(p)) {
-   mark_oom_victim(p);
-   wake_oom_reaper(p);
-   task_unlock(p);
-   put_task_struct(p);
-   return;
-   }
-   task_unlock(p);
-
-   if (__ratelimit(&oom_rs))
-   dump_header(oc, p);
-
-   pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
-   message, task_pid_nr(p), p->comm, points);
-
-   /*
-* If any of p's children has a different mm and is eligible for kill,
-* the one with the highest oom_badness() score is sacrificed for its
-* parent.  This attempts to lose the minimal amount of work done while
-* still freeing memory.
-*/
-   read_lock(&tasklist_lock);
-   for_each_thread(p, t) {
-   list_for_each_entry(child, &t->children, sibling) {
-   unsigned int child_points;
-
-   if (process_shares_mm(child, p->mm))
-   continue;
-   /*
-* oom_badness() returns 0 if the thread is unkillable
-*/
-   child_points = oom_badness(child,
-   oc->memcg, oc->nodemask, oc->totalpages);
-   if (child_points > victim_points) {
-   put_task_struct(victim);
-   victim = child;
-   victim_points = child_points;
-   get_task_struct(victim);
-   }
-   }
-   }
-   read_unlock(&tasklist_lock);
-
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
@@ -977,6 +921,69 @@ static void oom_kill_process(struct oom_control *oc, const 
char *message)
 }
 #undef K
 
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+   struct task_struct *p = oc->chosen;
+   unsigned int points = oc->chosen_points;
+   struct task_struct *victim = p;
+   struct task_struct *child;
+   struct task_struct *t;
+   unsigned int victim_points = 0;
+   static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+   /*
+* If the task is already exiting, don't alarm the sysadmin or kill
+* its children or threads, just give it access to memory reserves
+* so it can die quickly
+*/
+   task_lock(p);
+   if (task_will_free_mem(p)) {
+   mark_oom_victim(p);
+  

[RESEND v12 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup

2017-10-19 Thread Roman Gushchin
Implement mem_cgroup_scan_tasks() functionality for the root
memory cgroup to use this function for looking for a OOM victim
task in the root memory cgroup by the cgroup-ware OOM killer.

The root memory cgroup is treated as a leaf cgroup, so only tasks
which are directly belonging to the root cgroup are iterated over.

This patch doesn't introduce any functional change as
mem_cgroup_scan_tasks() is never called for the root memcg.
This is preparatory work for the cgroup-aware OOM killer,
which will use this function to iterate over tasks belonging
to the root memcg.

Signed-off-by: Roman Gushchin 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Acked-by: David Rientjes 
Cc: Vladimir Davydov 
Cc: Tetsuo Handa 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 mm/memcontrol.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 50e6906314f8..1d30a45a4bbe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -917,7 +917,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup 
*dead_memcg)
  * value, the function breaks the iteration loop and returns the value.
  * Otherwise, it will iterate over all tasks and return 0.
  *
- * This function must not be called for the root memory cgroup.
+ * If memcg is the root memory cgroup, this function will iterate only
+ * over tasks belonging directly to the root memory cgroup.
  */
 int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
  int (*fn)(struct task_struct *, void *), void *arg)
@@ -925,8 +926,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
struct mem_cgroup *iter;
int ret = 0;
 
-   BUG_ON(memcg == root_mem_cgroup);
-
for_each_mem_cgroup_tree(iter, memcg) {
struct css_task_iter it;
struct task_struct *task;
@@ -935,7 +934,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
while (!ret && (task = css_task_iter_next(&it)))
ret = fn(task, arg);
css_task_iter_end(&it);
-   if (ret) {
+   if (ret || memcg == root_mem_cgroup) {
mem_cgroup_iter_break(memcg, iter);
break;
}
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RESEND v12 0/6] cgroup-aware OOM killer

2017-10-19 Thread Roman Gushchin
This patchset makes the OOM killer cgroup-aware.

v12:
  - Root memory cgroup is evaluated based on sum of the oom scores
of belonging tasks
  - Do not fallback to the per-process behavior if there if
it wasn't possbile to kill a memcg victim
  - Rebase on top of mm tree

v11:
  - Fixed an issue with skipping the root mem cgroup
(discovered by Shakeel Butt)
  - Moved a check in __oom_kill_process() to the memmory.oom_group
patch, added corresponding comments
  - Added a note about ignoring tasks with oom_score_adj -1000
(proposed by Michal Hocko)
  - Rebase on top of mm tree

v10:
  - Separate oom_group introduction into a standalone patch
  - Stop propagating oom_group
  - Make oom_group delegatable
  - Do not try to kill the biggest task in the first order,
if the whole cgroup is going to be killed
  - Stop caching oom_score on struct memcg, optimize victim
memcg selection
  - Drop dmesg printing (for further refining)
  - Small refactorings and comments added here and there
  - Rebase on top of mm tree

v9:
  - Change siblings-to-siblings comparison to the tree-wide search,
make related refactorings
  - Make oom_group implicitly propagated down by the tree
  - Fix an issue with task selection in root cgroup

v8:
  - Do not kill tasks with OOM_SCORE_ADJ -1000
  - Make the whole thing opt-in with cgroup mount option control
  - Drop oom_priority for further discussions
  - Kill the whole cgroup if oom_group is set and it's
memory.max is reached
  - Update docs and commit messages

v7:
  - __oom_kill_process() drops reference to the victim task
  - oom_score_adj -1000 is always respected
  - Renamed oom_kill_all to oom_group
  - Dropped oom_prio range, converted from short to int
  - Added a cgroup v2 mount option to disable cgroup-aware OOM killer
  - Docs updated
  - Rebased on top of mmotm

v6:
  - Renamed oom_control.chosen to oom_control.chosen_task
  - Renamed oom_kill_all_tasks to oom_kill_all
  - Per-node NR_SLAB_UNRECLAIMABLE accounting
  - Several minor fixes and cleanups
  - Docs updated

v5:
  - Rebased on top of Michal Hocko's patches, which have changed the
way how OOM victims becoming an access to the memory
reserves. Dropped corresponding part of this patchset
  - Separated the oom_kill_process() splitting into a standalone commit
  - Added debug output (suggested by David Rientjes)
  - Some minor fixes

v4:
  - Reworked per-cgroup oom_score_adj into oom_priority
(based on ideas by David Rientjes)
  - Tasks with oom_score_adj -1000 are never selected if
oom_kill_all_tasks is not set
  - Memcg victim selection code is reworked, and
synchronization is based on finding tasks with OOM victim marker,
rather then on global counter
  - Debug output is dropped
  - Refactored TIF_MEMDIE usage

v3:
  - Merged commits 1-4 into 6
  - Separated oom_score_adj logic and debug output into separate commits
  - Fixed swap accounting

v2:
  - Reworked victim selection based on feedback
from Michal Hocko, Vladimir Davydov and Johannes Weiner
  - "Kill all tasks" is now an opt-in option, by default
only one process will be killed
  - Added per-cgroup oom_score_adj
  - Refined oom score calculations, suggested by Vladimir Davydov
  - Converted to a patchset

v1:
  https://lkml.org/lkml/2017/5/18/969


Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org


Roman Gushchin (6):
  mm, oom: refactor the oom_kill_process() function
  mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  mm, oom: cgroup-aware OOM killer
  mm, oom: introduce memory.oom_group
  mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  mm, oom, docs: describe the cgroup-aware OOM killer

 Documentation/cgroup-v2.txt |  51 +
 include/linux/cgroup-defs.h |   5 +
 include/linux/memcontrol.h  |  34 ++
 include/linux/oom.h |  12 ++-
 kernel/cgroup/cgroup.c  |  10 ++
 mm/memcontrol.c | 258 +++-
 mm/oom_kill.c   | 212 
 7 files changed, 506 insertions(+), 76 deletions(-)

-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RESEND v12 6/6] mm, oom, docs: describe the cgroup-aware OOM killer

2017-10-19 Thread Roman Gushchin
Document the cgroup-aware OOM killer.

Signed-off-by: Roman Gushchin 
Acked-by: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Tetsuo Handa 
Cc: Andrew Morton 
Cc: David Rientjes 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 Documentation/cgroup-v2.txt | 51 +
 1 file changed, 51 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 0bbdc720dd7c..69db5bf9c580 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
5-2-1. Memory Interface Files
5-2-2. Usage Guidelines
5-2-3. Memory Ownership
+   5-2-4. OOM Killer
  5-3. IO
5-3-1. IO Interface Files
5-3-2. Writeback
@@ -1031,6 +1032,28 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.
 
+  memory.oom_group
+
+   A read-write single value file which exists on non-root
+   cgroups.  The default is "0".
+
+   If set, OOM killer will consider the memory cgroup as an
+   indivisible memory consumers and compare it with other memory
+   consumers by it's memory footprint.
+   If such memory cgroup is selected as an OOM victim, all
+   processes belonging to it or it's descendants will be killed.
+
+   This applies to system-wide OOM conditions and reaching
+   the hard memory limit of the cgroup and their ancestor.
+   If OOM condition happens in a descendant cgroup with it's own
+   memory limit, the memory cgroup can't be considered
+   as an OOM victim, and OOM killer will not kill all belonging
+   tasks.
+
+   Also, OOM killer respects the /proc/pid/oom_score_adj value -1000,
+   and will never kill the unkillable task, even if memory.oom_group
+   is set.
+
   memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined.  Unless specified
@@ -1234,6 +1257,34 @@ to be accessed repeatedly by other cgroups, it may make 
sense to use
 POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
 belonging to the affected files to ensure correct memory ownership.
 
+OOM Killer
+~~
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choice of a victim, looking for a memory cgroup with the largest
+memory footprint, considering leaf cgroups and cgroups with the
+memory.oom_group option set, which are considered to be an indivisible
+memory consumers.
+
+By default, OOM killer will kill the biggest task in the selected
+memory cgroup. A user can change this behavior by enabling
+the per-cgroup memory.oom_group option. If set, it causes
+the OOM killer to kill all processes attached to the cgroup,
+except processes with oom_score_adj set to -1000.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
+The root cgroup is treated as a leaf memory cgroup, so it's compared
+with other leaf memory cgroups and cgroups with oom_group option set.
+
+If there are no cgroups with the enabled memory controller,
+the OOM killer is using the "traditional" process-based approach.
+
 
 IO
 --
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RESEND v12 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-10-19 Thread Roman Gushchin
Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
OOM killer. If not set, the OOM selection is performed in
a "traditional" per-process way.

The behavior can be changed dynamically by remounting the cgroupfs.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/cgroup-defs.h |  5 +
 kernel/cgroup/cgroup.c  | 10 ++
 mm/memcontrol.c |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 3e55bbd31ad1..cae5343a8b21 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -80,6 +80,11 @@ enum {
 * Enable cpuset controller in v1 cgroup to use v2 behavior.
 */
CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
+
+   /*
+* Enable cgroup-aware OOM killer.
+*/
+   CGRP_GROUP_OOM = (1 << 5),
 };
 
 /* cftype->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c7086c8835da..0e1685ca1d7b 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1709,6 +1709,9 @@ static int parse_cgroup_root_flags(char *data, unsigned 
int *root_flags)
if (!strcmp(token, "nsdelegate")) {
*root_flags |= CGRP_ROOT_NS_DELEGATE;
continue;
+   } else if (!strcmp(token, "groupoom")) {
+   *root_flags |= CGRP_GROUP_OOM;
+   continue;
}
 
pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1725,6 +1728,11 @@ static void apply_cgroup_root_flags(unsigned int 
root_flags)
cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
+
+   if (root_flags & CGRP_GROUP_OOM)
+   cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
+   else
+   cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
}
 }
 
@@ -1732,6 +1740,8 @@ static int cgroup_show_options(struct seq_file *seq, 
struct kernfs_root *kf_root
 {
if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
seq_puts(seq, ",nsdelegate");
+   if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
+   seq_puts(seq, ",groupoom");
return 0;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ad10dbdf723b..eb1e15385782 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2875,6 +2875,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;
 
+   if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+   return false;
+
if (oc->memcg)
root = oc->memcg;
else
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-13 Thread Roman Gushchin
On Thu, Oct 12, 2017 at 02:50:38PM -0700, David Rientjes wrote:
> On Wed, 11 Oct 2017, Roman Gushchin wrote:
> 
> Think about it in a different way: we currently compare per-process usage 
> and userspace has /proc/pid/oom_score_adj to adjust that usage depending 
> on priorities of that process and still oom kill if there's a memory leak.  
> Your heuristic compares per-cgroup usage, it's the cgroup-aware oom killer 
> after all.  We don't need a strict memory.oom_priority that outranks all 
> other sibling cgroups regardless of usage.  We need a memory.oom_score_adj 
> to adjust the per-cgroup usage.  The decisionmaking in your earlier 
> example would be under the control of C/memory.oom_score_adj and 
> D/memory.oom_score_adj.  Problem solved.
> 
> It also solves the problem of userspace being able to influence oom victim 
> selection so now they can protect important cgroups just like we can 
> protect important processes today.
> 
> And since this would be hierarchical usage, you can trivially infer root 
> mem cgroup usage by subtraction of top-level mem cgroup usage.
> 
> This is a powerful solution to the problem and gives userspace the control 
> they need so that it can work in all usecases, not a subset of usecases.

You're right that per-cgroup oom_score_adj may resolve the issue with
too strict semantics of oom_priorities. But I believe nobody likes
the existing per-process oom_score_adj interface, and there are reasons behind.
Especially in case of memcg-OOM, getting the idea how exactly oom_score_adj
will work is not trivial.
For example, earlier in this thread I've shown an example, when a decision
which of two processes should be killed depends on whether it's global or
memcg-wide oom, despite both belong to a single cgroup!

Of course, it's technically trivial to implement some analog of oom_score_adj
for cgroups (and early versions of this patchset did that).
But the right question is: is this an interface we want to support
for the next many years? I'm not sure.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-11 Thread Roman Gushchin
On Wed, Oct 11, 2017 at 01:21:47PM -0700, David Rientjes wrote:
> On Tue, 10 Oct 2017, Roman Gushchin wrote:
> 
> > > We don't need a better approximation, we need a fair comparison.  The 
> > > heuristic that this patchset is implementing is based on the usage of 
> > > individual mem cgroups.  For the root mem cgroup to be considered 
> > > eligible, we need to understand its usage.  That usage is _not_ what is 
> > > implemented by this patchset, which is the largest rss of a single 
> > > attached process.  This, in fact, is not an "approximation" at all.  In 
> > > the example of 1 processes attached with 80MB rss each, the usage of 
> > > the root mem cgroup is _not_ 80MB.
> > 
> > It's hard to imagine a "healthy" setup with 1 process in the root
> > memory cgroup, and even if we kill 1 process we will still have 
> > remaining process. I agree with you at some point, but it's not
> > a real world example.
> > 
> 
> It's an example that illustrates the problem with the unfair comparison 
> between the root mem cgroup and leaf mem cgroups.  It's unfair to compare 
> [largest rss of a single process attached to a cgroup] to
> [anon + unevictable + unreclaimable slab usage of a cgroup].  It's not an 
> approximation, as previously stated: the usage of the root mem cgroup is 
> not 100MB if there are 10 such processes attached to the root mem cgroup, 
> it's off by orders of magnitude.
> 
> For the root mem cgroup to be treated equally as a leaf mem cgroup as this 
> patchset proposes, it must have a fair comparison.  That can be done by 
> accounting memory to the root mem cgroup in the same way it is to leaf mem 
> cgroups.
> 
> But let's move the discussion forward to fix it.  To avoid necessarily 
> accounting memory to the root mem cgroup, have we considered if it is even 
> necessary to address the root mem cgroup?  For the users who opt-in to 
> this heuristic, would it be possible to discount the root mem cgroup from 
> the heuristic entirely so that oom kills originate from leaf mem cgroups?  
> Or, perhaps better, oom kill from non-memory.oom_group cgroups only if 
> the victim rss is greater than an eligible victim rss attached to the root 
> mem cgroup?

David, I'm not pretending for implementing the best possible accounting
for the root memory cgroup, and I'm sure there is a place for further
enhancement. But if it's not leading to some obviously stupid victim
selection (like ignoring leaking task, which consumes most of the memory),
I don't see why it should be treated as a blocker for the whole patchset.
I also doubt that any of us has these examples, and the best way to get
them is to get some real usage feedback.

Ignoring oom_score_adj, subtracting leaf usage sum from system usage etc,
these all are perfect ideas which can be implemented on top of this patchset.

> 
> > > For these reasons: unfair comparison of root mem cgroup usage to bias 
> > > against that mem cgroup from oom kill in system oom conditions, the 
> > > ability of users to completely evade the oom killer by attaching all 
> > > processes to child cgroups either purposefully or unpurposefully, and the 
> > > inability of userspace to effectively control oom victim selection:
> > > 
> > > Nacked-by: David Rientjes 
> > 
> > So, if we'll sum the oom_score of tasks belonging to the root memory cgroup,
> > will it fix the problem?
> > 
> > It might have some drawbacks as well (especially around oom_score_adj),
> > but it's doable, if we'll ignore tasks which are not owners of their's mm 
> > struct.
> > 
> 
> You would be required to discount oom_score_adj because the heuristic 
> doesn't account for oom_score_adj when comparing the anon + unevictable + 
> unreclaimable slab of leaf mem cgroups.  This wouldn't result in the 
> correct victim selection in real-world scenarios where processes attached 
> to the root mem cgroup are vital to the system and not part of any user 
> job, i.e. they are important system daemons and the "activity manager" 
> responsible for orchestrating the cgroup hierarchy.
> 
> It's also still unfair because it now compares
> [sum of rss of processes attached to a cgroup] to
> [anon + unevictable + unreclaimable slab usage of a cgroup].  RSS isn't 
> going to be a solution, regardless if its one process or all processes, if 
> it's being compared to more types of memory in leaf cgroups.
> 
> If we really don't want root mem cgroup accounting so this is a fair 
> comparison, I think the heuristic needs to

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-11 Thread Roman Gushchin
On Tue, Oct 10, 2017 at 02:13:00PM -0700, David Rientjes wrote:
> On Tue, 10 Oct 2017, Roman Gushchin wrote:
> 
> > > This seems to unfairly bias the root mem cgroup depending on process 
> > > size.  
> > > It isn't treated fairly as a leaf mem cgroup if they are being compared 
> > > based on different criteria: the root mem cgroup as (mostly) the largest 
> > > rss of a single process vs leaf mem cgroups as all anon, unevictable, and 
> > > unreclaimable slab pages charged to it by all processes.
> > > 
> > > I imagine a configuration where the root mem cgroup has 100 processes 
> > > attached each with rss of 80MB, compared to a leaf cgroup with 100 
> > > processes of 1MB rss each.  How does this logic prevent repeatedly oom 
> > > killing the processes of 1MB rss?
> > > 
> > > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't 
> > > quite fair, it can simply hide large processes from being selected.  
> > > Users 
> > > who configure cgroups in a unified hierarchy for other resource 
> > > constraints are penalized for this choice even though the mem cgroup with 
> > > 100 processes of 1MB rss each may not be limited itself.
> > > 
> > > I think for this comparison to be fair, it requires accounting for the 
> > > root mem cgroup itself or for a different accounting methodology for leaf 
> > > memory cgroups.
> > 
> > This is basically a workaround, because we don't have necessary stats for 
> > root
> > memory cgroup. If we'll start gathering them at some point, we can change 
> > this
> > and treat root memcg exactly as other leaf cgroups.
> > 
> 
> I understand why it currently cannot be an apples vs apples comparison 
> without, as I suggest in the last paragraph, that the same accounting is 
> done for the root mem cgroup, which is intuitive if it is to be considered 
> on the same basis as leaf mem cgroups.
> 
> I understand for the design to work that leaf mem cgroups and the root mem 
> cgroup must be compared if processes can be attached to the root mem 
> cgroup.  My point is that it is currently completely unfair as I've 
> stated: you can have 1 processes attached to the root mem cgroup with 
> rss of 80MB each and a leaf mem cgroup with 100 processes of 1MB rss each 
> and the oom killer is going to target the leaf mem cgroup as a result of 
> this apples vs oranges comparison.
> 
> In case it's not clear, the 1 processes of 80MB rss each is the most 
> likely contributor to a system-wide oom kill.  Unfortunately, the 
> heuristic introduced by this patchset is broken wrt a fair comparison of 
> the root mem cgroup usage.
> 
> > Or, if someone will come with an idea of a better approximation, it can be
> > implemented as a separate enhancement on top of the initial implementation.
> > This is more than welcome.
> > 
> 
> We don't need a better approximation, we need a fair comparison.  The 
> heuristic that this patchset is implementing is based on the usage of 
> individual mem cgroups.  For the root mem cgroup to be considered 
> eligible, we need to understand its usage.  That usage is _not_ what is 
> implemented by this patchset, which is the largest rss of a single 
> attached process.  This, in fact, is not an "approximation" at all.  In 
> the example of 1 processes attached with 80MB rss each, the usage of 
> the root mem cgroup is _not_ 80MB.
> 
> I'll restate that oom killing a process is a last resort for the kernel, 
> but it also must be able to make a smart decision.  Targeting dozens of 
> 1MB processes instead of 80MB processes because of a shortcoming in this 
> implementation is not the appropriate selection, it's the opposite of the 
> correct selection.
> 
> > > I'll reiterate what I did on the last version of the patchset: 
> > > considering 
> > > only leaf memory cgroups easily allows users to defeat this heuristic and 
> > > bias against all of their memory usage up to the largest process size 
> > > amongst the set of processes attached.  If the user creates N child mem 
> > > cgroups for their N processes and attaches one process to each child, the 
> > > _only_ thing this achieved is to defeat your heuristic and prefer other 
> > > leaf cgroups simply because those other leaf cgroups did not do this.
> > > 
> > > Effectively:
> > > 
> > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> > > 
> > > will radically shift the heuristi

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-10 Thread Roman Gushchin
On Tue, Oct 10, 2017 at 02:13:00PM -0700, David Rientjes wrote:
> On Tue, 10 Oct 2017, Roman Gushchin wrote:
> 
> > > This seems to unfairly bias the root mem cgroup depending on process 
> > > size.  
> > > It isn't treated fairly as a leaf mem cgroup if they are being compared 
> > > based on different criteria: the root mem cgroup as (mostly) the largest 
> > > rss of a single process vs leaf mem cgroups as all anon, unevictable, and 
> > > unreclaimable slab pages charged to it by all processes.
> > > 
> > > I imagine a configuration where the root mem cgroup has 100 processes 
> > > attached each with rss of 80MB, compared to a leaf cgroup with 100 
> > > processes of 1MB rss each.  How does this logic prevent repeatedly oom 
> > > killing the processes of 1MB rss?
> > > 
> > > In this case, "the root cgroup is treated as a leaf memory cgroup" isn't 
> > > quite fair, it can simply hide large processes from being selected.  
> > > Users 
> > > who configure cgroups in a unified hierarchy for other resource 
> > > constraints are penalized for this choice even though the mem cgroup with 
> > > 100 processes of 1MB rss each may not be limited itself.
> > > 
> > > I think for this comparison to be fair, it requires accounting for the 
> > > root mem cgroup itself or for a different accounting methodology for leaf 
> > > memory cgroups.
> > 
> > This is basically a workaround, because we don't have necessary stats for 
> > root
> > memory cgroup. If we'll start gathering them at some point, we can change 
> > this
> > and treat root memcg exactly as other leaf cgroups.
> > 
> 
> I understand why it currently cannot be an apples vs apples comparison 
> without, as I suggest in the last paragraph, that the same accounting is 
> done for the root mem cgroup, which is intuitive if it is to be considered 
> on the same basis as leaf mem cgroups.
> 
> I understand for the design to work that leaf mem cgroups and the root mem 
> cgroup must be compared if processes can be attached to the root mem 
> cgroup.  My point is that it is currently completely unfair as I've 
> stated: you can have 1 processes attached to the root mem cgroup with 
> rss of 80MB each and a leaf mem cgroup with 100 processes of 1MB rss each 
> and the oom killer is going to target the leaf mem cgroup as a result of 
> this apples vs oranges comparison.
> 
> In case it's not clear, the 1 processes of 80MB rss each is the most 
> likely contributor to a system-wide oom kill.  Unfortunately, the 
> heuristic introduced by this patchset is broken wrt a fair comparison of 
> the root mem cgroup usage.
> 
> > Or, if someone will come with an idea of a better approximation, it can be
> > implemented as a separate enhancement on top of the initial implementation.
> > This is more than welcome.
> > 
> 
> We don't need a better approximation, we need a fair comparison.  The 
> heuristic that this patchset is implementing is based on the usage of 
> individual mem cgroups.  For the root mem cgroup to be considered 
> eligible, we need to understand its usage.  That usage is _not_ what is 
> implemented by this patchset, which is the largest rss of a single 
> attached process.  This, in fact, is not an "approximation" at all.  In 
> the example of 1 processes attached with 80MB rss each, the usage of 
> the root mem cgroup is _not_ 80MB.

It's hard to imagine a "healthy" setup with 1 process in the root
memory cgroup, and even if we kill 1 process we will still have 
remaining process. I agree with you at some point, but it's not
a real world example.

> 
> I'll restate that oom killing a process is a last resort for the kernel, 
> but it also must be able to make a smart decision.  Targeting dozens of 
> 1MB processes instead of 80MB processes because of a shortcoming in this 
> implementation is not the appropriate selection, it's the opposite of the 
> correct selection.
> 
> > > I'll reiterate what I did on the last version of the patchset: 
> > > considering 
> > > only leaf memory cgroups easily allows users to defeat this heuristic and 
> > > bias against all of their memory usage up to the largest process size 
> > > amongst the set of processes attached.  If the user creates N child mem 
> > > cgroups for their N processes and attaches one process to each child, the 
> > > _only_ thing this achieved is to defeat your heuristic and prefer other 
> > > leaf cgroups simply because those other 

Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-10 Thread Roman Gushchin
On Mon, Oct 09, 2017 at 02:52:53PM -0700, David Rientjes wrote:
> On Thu, 5 Oct 2017, Roman Gushchin wrote:
> 
> > Traditionally, the OOM killer is operating on a process level.
> > Under oom conditions, it finds a process with the highest oom score
> > and kills it.
> > 
> > This behavior doesn't suit well the system with many running
> > containers:
> > 
> > 1) There is no fairness between containers. A small container with
> > few large processes will be chosen over a large one with huge
> > number of small processes.
> > 
> > 2) Containers often do not expect that some random process inside
> > will be killed. In many cases much safer behavior is to kill
> > all tasks in the container. Traditionally, this was implemented
> > in userspace, but doing it in the kernel has some advantages,
> > especially in a case of a system-wide OOM.
> > 
> 
> I'd move the second point to the changelog for the next patch since this 
> patch doesn't implement any support for memory.oom_group.

There is a special remark later in the changelog explaining that
this functionality will be added by following patches. I've thought
it's useful to have all basic ideas in the one place.

> 
> > To address these issues, the cgroup-aware OOM killer is introduced.
> > 
> > Under OOM conditions, it looks for the biggest leaf memory cgroup
> > and kills the biggest task belonging to it. The following patches
> > will extend this functionality to consider non-leaf memory cgroups
> > as well, and also provide an ability to kill all tasks belonging
> > to the victim cgroup.
> > 
> > The root cgroup is treated as a leaf memory cgroup, so it's score
> > is compared with leaf memory cgroups.
> > Due to memcg statistics implementation a special algorithm
> > is used for estimating it's oom_score: we define it as maximum
> > oom_score of the belonging tasks.
> > 
> 
> This seems to unfairly bias the root mem cgroup depending on process size.  
> It isn't treated fairly as a leaf mem cgroup if they are being compared 
> based on different criteria: the root mem cgroup as (mostly) the largest 
> rss of a single process vs leaf mem cgroups as all anon, unevictable, and 
> unreclaimable slab pages charged to it by all processes.
> 
> I imagine a configuration where the root mem cgroup has 100 processes 
> attached each with rss of 80MB, compared to a leaf cgroup with 100 
> processes of 1MB rss each.  How does this logic prevent repeatedly oom 
> killing the processes of 1MB rss?
> 
> In this case, "the root cgroup is treated as a leaf memory cgroup" isn't 
> quite fair, it can simply hide large processes from being selected.  Users 
> who configure cgroups in a unified hierarchy for other resource 
> constraints are penalized for this choice even though the mem cgroup with 
> 100 processes of 1MB rss each may not be limited itself.
> 
> I think for this comparison to be fair, it requires accounting for the 
> root mem cgroup itself or for a different accounting methodology for leaf 
> memory cgroups.

This is basically a workaround, because we don't have necessary stats for root
memory cgroup. If we'll start gathering them at some point, we can change this
and treat root memcg exactly as other leaf cgroups.

Or, if someone will come with an idea of a better approximation, it can be
implemented as a separate enhancement on top of the initial implementation.
This is more than welcome.

> 
> I'll reiterate what I did on the last version of the patchset: considering 
> only leaf memory cgroups easily allows users to defeat this heuristic and 
> bias against all of their memory usage up to the largest process size 
> amongst the set of processes attached.  If the user creates N child mem 
> cgroups for their N processes and attaches one process to each child, the 
> _only_ thing this achieved is to defeat your heuristic and prefer other 
> leaf cgroups simply because those other leaf cgroups did not do this.
> 
> Effectively:
> 
> for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> 
> will radically shift the heuristic from a score of all anonymous + 
> unevictable memory for all processes to a score of the largest anonymous +
> unevictable memory for a single process.  There is no downside or 
> ramifaction for the end user in doing this.  When comparing cgroups based 
> on usage, it only makes sense to compare the hierarchical usage of that 
> cgroup so that attaching processes to descendants or splitting the 
> implementation of a process into several smaller individual processes does 
> not allow this heuristic to be 

Re: [v11 4/6] mm, oom: introduce memory.oom_group

2017-10-06 Thread Roman Gushchin
On Thu, Oct 05, 2017 at 04:31:04PM +0200, Michal Hocko wrote:
> Btw. here is how I would do the recursive oom badness. The diff is not
> the nicest one because there is some code moving but the resulting code
> is smaller and imho easier to grasp. Only compile tested though

Thanks!

I'm not against this approach, and maybe it can lead to a better code,
but the version you sent is just not there yet.

There are some problems with it:

1) If there are nested cgroups with oom_group set, you will calculate
a badness multiple times, and rely on the fact, that top memcg will
become the largest score. It can be optimized, of course, but it's
additional code.

2) cgroup_has_tasks() probably requires additional locking.
Maybe it's ok to read nr_populated_csets without explicit locking,
but it's not obvious for me.

3) Returning -1 from memcg_oom_badness() if eligible is equal to 0
is suspicious.

Right now your version has exactly the same amount of code
(skipping comments). I assume, this approach just requires some additional
thinking/rework.

Anyway, thank you for sharing this!

> ---
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 085056e562b1..9cdba4682198 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -122,6 +122,11 @@ void cgroup_free(struct task_struct *p);
>  int cgroup_init_early(void);
>  int cgroup_init(void);
>  
> +static bool cgroup_has_tasks(struct cgroup *cgrp)
> +{
> + return cgrp->nr_populated_csets;
> +}
> +
>  /*
>   * Iteration helpers and macros.
>   */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 8dacf73ad57e..a2dd7e3ffe23 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -319,11 +319,6 @@ static void cgroup_idr_remove(struct idr *idr, int id)
>   spin_unlock_bh(&cgroup_idr_lock);
>  }
>  
> -static bool cgroup_has_tasks(struct cgroup *cgrp)
> -{
> - return cgrp->nr_populated_csets;
> -}
> -
>  bool cgroup_is_threaded(struct cgroup *cgrp)
>  {
>   return cgrp->dom_cgrp != cgrp;
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-10-05 Thread Roman Gushchin
On Thu, Oct 05, 2017 at 03:14:19PM +0200, Michal Hocko wrote:
> On Wed 04-10-17 16:04:53, Johannes Weiner wrote:
> [...]
> > That will silently ignore what the user writes to the memory.oom_group
> > control files across the system's cgroup tree.
> > 
> > We'll have a knob that lets the workload declare itself an indivisible
> > memory consumer, that it would like to get killed in one piece, and
> > it's silently ignored because of a mount option they forgot to pass.
> > 
> > That's not good from an interface perspective.
> 
> Yes and that is why I think a boot time knob would be the most simple
> way. It will also open doors for more oom policies in future which I
> believe come sooner or later.

So, we would rely on grub config to set up OOM policy? Sounds weird.

We use boot options, when it's hard to implement on the fly switching
(like turning on/off socket memory accounting), but here is not this case.

> 
> > On the other hand, the only benefit of this patch is to shield users
> > from changes to the OOM killing heuristics. Yet, it's really hard to
> > imagine that modifying the victim selection process slightly could be
> > called a regression in any way. We have done that many times over,
> > without a second thought on backwards compatibility:
> > 
> > 5e9d834a0e0c oom: sacrifice child with highest badness score for parent
> > a63d83f427fb oom: badness heuristic rewrite
> > 778c14affaf9 mm, oom: base root bonus on current usage
> 
> yes we have changed that without a deeper considerations. Some of those
> changes are arguable (e.g. child scarification). The oom badness
> heuristic rewrite has triggered quite some complains AFAIR (I remember
> Kosaki has made several attempts to revert it). I think that we are
> trying to be more careful about user visible changes than we used to be.
> 
> More importantly I do not think that the current (non-memcg aware) OOM
> policy is somehow obsolete and many people expect it to behave
> consistently. As I've said already, I have seen many complains that the
> OOM killer doesn't kill the right task. Most of them were just NUMA
> related issues where the oom report was not clear enough. I do not want
> to repeat that again now. Memcg awareness is certainly a useful
> heuristic but I do not see it universally applicable to all workloads.
> 
> > Let's not make the userspace interface crap because of some misguided
> > idea that the OOM heuristic is a hard promise to userspace. It's never
> > been, and nobody has complained about changes in the past.
> > 
> > This case is doubly silly, as the behavior change only applies to
> > cgroup2, which doesn't exactly have a large base of legacy users yet.
> 
> I agree on the interface part but I disagree with making it default just
> because v2 is not largerly adopted yet.

I believe that the only real regression can be caused by active using of
oom_score_adj. I really don't know how many cgroup v2 users are relying
on it (hopefully, 0).

So, personally I would prefer to have an opt-out cgroup v2 mount option
(sane new behavior for most users, 100% backward compatibility for rare
strange setups), but I don't have a very strong opinion here.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v11 0/6] cgroup-aware OOM killer

2017-10-05 Thread Roman Gushchin
This patchset makes the OOM killer cgroup-aware.

v11:
  - Fixed an issue with skipping the root mem cgroup
(discovered by Shakeel Butt)
  - Moved a check in __oom_kill_process() to the memmory.oom_group
patch, added corresponding comments
  - Added a note about ignoring tasks with oom_score_adj -1000
(proposed by Michal Hocko)
  - Rebase on top of mm tree

v10:
  - Separate oom_group introduction into a standalone patch
  - Stop propagating oom_group
  - Make oom_group delegatable
  - Do not try to kill the biggest task in the first order,
if the whole cgroup is going to be killed
  - Stop caching oom_score on struct memcg, optimize victim
memcg selection
  - Drop dmesg printing (for further refining)
  - Small refactorings and comments added here and there
  - Rebase on top of mm tree

v9:
  - Change siblings-to-siblings comparison to the tree-wide search,
make related refactorings
  - Make oom_group implicitly propagated down by the tree
  - Fix an issue with task selection in root cgroup

v8:
  - Do not kill tasks with OOM_SCORE_ADJ -1000
  - Make the whole thing opt-in with cgroup mount option control
  - Drop oom_priority for further discussions
  - Kill the whole cgroup if oom_group is set and it's
memory.max is reached
  - Update docs and commit messages

v7:
  - __oom_kill_process() drops reference to the victim task
  - oom_score_adj -1000 is always respected
  - Renamed oom_kill_all to oom_group
  - Dropped oom_prio range, converted from short to int
  - Added a cgroup v2 mount option to disable cgroup-aware OOM killer
  - Docs updated
  - Rebased on top of mmotm

v6:
  - Renamed oom_control.chosen to oom_control.chosen_task
  - Renamed oom_kill_all_tasks to oom_kill_all
  - Per-node NR_SLAB_UNRECLAIMABLE accounting
  - Several minor fixes and cleanups
  - Docs updated

v5:
  - Rebased on top of Michal Hocko's patches, which have changed the
way how OOM victims becoming an access to the memory
reserves. Dropped corresponding part of this patchset
  - Separated the oom_kill_process() splitting into a standalone commit
  - Added debug output (suggested by David Rientjes)
  - Some minor fixes

v4:
  - Reworked per-cgroup oom_score_adj into oom_priority
(based on ideas by David Rientjes)
  - Tasks with oom_score_adj -1000 are never selected if
oom_kill_all_tasks is not set
  - Memcg victim selection code is reworked, and
synchronization is based on finding tasks with OOM victim marker,
rather then on global counter
  - Debug output is dropped
  - Refactored TIF_MEMDIE usage

v3:
  - Merged commits 1-4 into 6
  - Separated oom_score_adj logic and debug output into separate commits
  - Fixed swap accounting

v2:
  - Reworked victim selection based on feedback
from Michal Hocko, Vladimir Davydov and Johannes Weiner
  - "Kill all tasks" is now an opt-in option, by default
only one process will be killed
  - Added per-cgroup oom_score_adj
  - Refined oom score calculations, suggested by Vladimir Davydov
  - Converted to a patchset

v1:
  https://lkml.org/lkml/2017/5/18/969


Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org


Roman Gushchin (6):
  mm, oom: refactor the oom_kill_process() function
  mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  mm, oom: cgroup-aware OOM killer
  mm, oom: introduce memory.oom_group
  mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  mm, oom, docs: describe the cgroup-aware OOM killer

 Documentation/cgroup-v2.txt |  51 +
 include/linux/cgroup-defs.h |   5 +
 include/linux/memcontrol.h  |  34 ++
 include/linux/oom.h |  12 ++-
 kernel/cgroup/cgroup.c  |  10 ++
 mm/memcontrol.c | 249 +++-
 mm/oom_kill.c   | 210 -
 7 files changed, 495 insertions(+), 76 deletions(-)

-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v11 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup

2017-10-05 Thread Roman Gushchin
Implement mem_cgroup_scan_tasks() functionality for the root
memory cgroup to use this function for looking for a OOM victim
task in the root memory cgroup by the cgroup-ware OOM killer.

The root memory cgroup is treated as a leaf cgroup, so only tasks
which are directly belonging to the root cgroup are iterated over.

This patch doesn't introduce any functional change as
mem_cgroup_scan_tasks() is never called for the root memcg.
This is preparatory work for the cgroup-aware OOM killer,
which will use this function to iterate over tasks belonging
to the root memcg.

Signed-off-by: Roman Gushchin 
Acked-by: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 mm/memcontrol.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c7410636fadf..41d71f665550 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -917,7 +917,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup 
*dead_memcg)
  * value, the function breaks the iteration loop and returns the value.
  * Otherwise, it will iterate over all tasks and return 0.
  *
- * This function must not be called for the root memory cgroup.
+ * If memcg is the root memory cgroup, this function will iterate only
+ * over tasks belonging directly to the root memory cgroup.
  */
 int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
  int (*fn)(struct task_struct *, void *), void *arg)
@@ -925,8 +926,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
struct mem_cgroup *iter;
int ret = 0;
 
-   BUG_ON(memcg == root_mem_cgroup);
-
for_each_mem_cgroup_tree(iter, memcg) {
struct css_task_iter it;
struct task_struct *task;
@@ -935,7 +934,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
while (!ret && (task = css_task_iter_next(&it)))
ret = fn(task, arg);
css_task_iter_end(&it);
-   if (ret) {
+   if (ret || memcg == root_mem_cgroup) {
mem_cgroup_iter_break(memcg, iter);
break;
}
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v11 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-10-05 Thread Roman Gushchin
Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
OOM killer. If not set, the OOM selection is performed in
a "traditional" per-process way.

The behavior can be changed dynamically by remounting the cgroupfs.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/cgroup-defs.h |  5 +
 kernel/cgroup/cgroup.c  | 10 ++
 mm/memcontrol.c |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 3e55bbd31ad1..cae5343a8b21 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -80,6 +80,11 @@ enum {
 * Enable cpuset controller in v1 cgroup to use v2 behavior.
 */
CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
+
+   /*
+* Enable cgroup-aware OOM killer.
+*/
+   CGRP_GROUP_OOM = (1 << 5),
 };
 
 /* cftype->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c3421ee0d230..8d8aa46ff930 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1709,6 +1709,9 @@ static int parse_cgroup_root_flags(char *data, unsigned 
int *root_flags)
if (!strcmp(token, "nsdelegate")) {
*root_flags |= CGRP_ROOT_NS_DELEGATE;
continue;
+   } else if (!strcmp(token, "groupoom")) {
+   *root_flags |= CGRP_GROUP_OOM;
+   continue;
}
 
pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1725,6 +1728,11 @@ static void apply_cgroup_root_flags(unsigned int 
root_flags)
cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
+
+   if (root_flags & CGRP_GROUP_OOM)
+   cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
+   else
+   cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
}
 }
 
@@ -1732,6 +1740,8 @@ static int cgroup_show_options(struct seq_file *seq, 
struct kernfs_root *kf_root
 {
if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
seq_puts(seq, ",nsdelegate");
+   if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
+   seq_puts(seq, ",groupoom");
return 0;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d5acb278b11a..fe6155d827c1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2866,6 +2866,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;
 
+   if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+   return false;
+
if (oc->memcg)
root = oc->memcg;
else
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v11 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread Roman Gushchin
Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

To address these issues, the cgroup-aware OOM killer is introduced.

Under OOM conditions, it looks for the biggest leaf memory cgroup
and kills the biggest task belonging to it. The following patches
will extend this functionality to consider non-leaf memory cgroups
as well, and also provide an ability to kill all tasks belonging
to the victim cgroup.

The root cgroup is treated as a leaf memory cgroup, so it's score
is compared with leaf memory cgroups.
Due to memcg statistics implementation a special algorithm
is used for estimating it's oom_score: we define it as maximum
oom_score of the belonging tasks.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/memcontrol.h |  17 +
 include/linux/oom.h|  12 +++-
 mm/memcontrol.c| 172 +
 mm/oom_kill.c  |  70 +-
 4 files changed, 251 insertions(+), 20 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 69966c461d1c..75b63b68846e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -342,6 +343,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct 
cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+   css_put(&memcg->css);
+}
+
 #define mem_cgroup_from_counter(counter, member)   \
container_of(counter, struct mem_cgroup, member)
 
@@ -480,6 +486,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -744,6 +752,10 @@ static inline bool task_in_mem_cgroup(struct task_struct 
*task,
return true;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -936,6 +948,11 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+   return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4ce39bc..ca78e2d5956e 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -9,6 +9,13 @@
 #include  /* MMF_* */
 #include  /* VM_FAULT* */
 
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
 struct zonelist;
 struct notifier_block;
 struct mem_cgroup;
@@ -39,7 +46,8 @@ struct oom_control {
 
/* Used by oom implementation, do not set */
unsigned long totalpages;
-   struct task_struct *chosen;
+   struct task_struct *chosen_task;
+   struct mem_cgroup *chosen_memcg;
unsigned long chosen_points;
 };
 
@@ -101,6 +109,8 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 41d71f665550..191b70735f1f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2670,6 +2670,178 @@ static inline bool memcg_has_children(struct mem_cgroup 
*memcg)
return ret;
 }
 
+static long memcg_oom_badness(struct mem_cgroup *memcg,
+ const nodemask_t *nodemask,
+ unsigned long totalpages)
+{
+   long points = 0;
+   int 

[v11 6/6] mm, oom, docs: describe the cgroup-aware OOM killer

2017-10-05 Thread Roman Gushchin
Document the cgroup-aware OOM killer.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: Andrew Morton 
Cc: David Rientjes 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 Documentation/cgroup-v2.txt | 51 +
 1 file changed, 51 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 3f8216912df0..28429e62b0ea 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
5-2-1. Memory Interface Files
5-2-2. Usage Guidelines
5-2-3. Memory Ownership
+   5-2-4. OOM Killer
  5-3. IO
5-3-1. IO Interface Files
5-3-2. Writeback
@@ -1043,6 +1044,28 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.
 
+  memory.oom_group
+
+   A read-write single value file which exists on non-root
+   cgroups.  The default is "0".
+
+   If set, OOM killer will consider the memory cgroup as an
+   indivisible memory consumers and compare it with other memory
+   consumers by it's memory footprint.
+   If such memory cgroup is selected as an OOM victim, all
+   processes belonging to it or it's descendants will be killed.
+
+   This applies to system-wide OOM conditions and reaching
+   the hard memory limit of the cgroup and their ancestor.
+   If OOM condition happens in a descendant cgroup with it's own
+   memory limit, the memory cgroup can't be considered
+   as an OOM victim, and OOM killer will not kill all belonging
+   tasks.
+
+   Also, OOM killer respects the /proc/pid/oom_score_adj value -1000,
+   and will never kill the unkillable task, even if memory.oom_group
+   is set.
+
   memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined.  Unless specified
@@ -1246,6 +1269,34 @@ to be accessed repeatedly by other cgroups, it may make 
sense to use
 POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
 belonging to the affected files to ensure correct memory ownership.
 
+OOM Killer
+~~
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choice of a victim, looking for a memory cgroup with the largest
+memory footprint, considering leaf cgroups and cgroups with the
+memory.oom_group option set, which are considered to be an indivisible
+memory consumers.
+
+By default, OOM killer will kill the biggest task in the selected
+memory cgroup. A user can change this behavior by enabling
+the per-cgroup memory.oom_group option. If set, it causes
+the OOM killer to kill all processes attached to the cgroup,
+except processes with oom_score_adj set to -1000.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
+The root cgroup is treated as a leaf memory cgroup, so it's compared
+with other leaf memory cgroups and cgroups with oom_group option set.
+
+If there are no cgroups with the enabled memory controller,
+the OOM killer is using the "traditional" process-based approach.
+
 
 IO
 --
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v11 4/6] mm, oom: introduce memory.oom_group

2017-10-05 Thread Roman Gushchin
The cgroup-aware OOM killer treats leaf memory cgroups as memory
consumption entities and performs the victim selection by comparing
them based on their memory footprint. Then it kills the biggest task
inside the selected memory cgroup.

But there are workloads, which are not tolerant to a such behavior.
Killing a random task may leave the workload in a broken state.

To solve this problem, memory.oom_group knob is introduced.
It will define, whether a memory group should be treated as an
indivisible memory consumer, compared by total memory consumption
with other memory consumers (leaf memory cgroups and other memory
cgroups with memory.oom_group set), and whether all belonging tasks
should be killed if the cgroup is selected.

If set on memcg A, it means that in case of system-wide OOM or
memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
belonging to the sub-tree of A will be killed. If OOM event is
scoped to a descendant cgroup (A/B, for example), only tasks in
that cgroup can be affected. OOM killer will never touch any tasks
outside of the scope of the OOM event.

Also, tasks with oom_score_adj set to -1000 will not be killed because
this has been a long established way to protect a particular process
from seeing an unexpected SIGKILL from the OOM killer. Ignoring this
user defined configuration might lead to data corruptions or other
misbehavior.

The default value is 0.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/memcontrol.h | 17 +++
 mm/memcontrol.c| 75 +++---
 mm/oom_kill.c  | 49 +++---
 3 files changed, 127 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 75b63b68846e..84ac10d7e67d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -200,6 +200,13 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
 
+   /*
+* Treat the sub-tree as an indivisible memory consumer,
+* kill all belonging tasks if the memory cgroup selected
+* as OOM victim.
+*/
+   bool oom_group;
+
/* handle for "memory.events" */
struct cgroup_file events_file;
 
@@ -488,6 +495,11 @@ bool mem_cgroup_oom_synchronize(bool wait);
 
 bool mem_cgroup_select_oom_victim(struct oom_control *oc);
 
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+   return memcg->oom_group;
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -953,6 +965,11 @@ static inline bool mem_cgroup_select_oom_victim(struct 
oom_control *oc)
 {
return false;
 }
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+   return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 191b70735f1f..d5acb278b11a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2776,19 +2776,51 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
 
 static void select_victim_memcg(struct mem_cgroup *root, struct oom_control 
*oc)
 {
-   struct mem_cgroup *iter;
+   struct mem_cgroup *iter, *group = NULL;
+   long group_score = 0;
 
oc->chosen_memcg = NULL;
oc->chosen_points = 0;
 
/*
+* If OOM is memcg-wide, and the memcg has the oom_group flag set,
+* all tasks belonging to the memcg should be killed.
+* So, we mark the memcg as a victim.
+*/
+   if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
+   oc->chosen_memcg = oc->memcg;
+   css_get(&oc->chosen_memcg->css);
+   return;
+   }
+
+   /*
 * The oom_score is calculated for leaf memory cgroups (including
 * the root memcg).
+* Non-leaf oom_group cgroups accumulating score of descendant
+* leaf memory cgroups.
 */
rcu_read_lock();
for_each_mem_cgroup_tree(iter, root) {
long score;
 
+   /*
+* We don't consider non-leaf non-oom_group memory cgroups
+* as OOM victims.
+*/
+   if (memcg_has_children(iter) && iter != root_mem_cgroup &&
+   !mem_cgroup_oom_group(iter))
+   continue;
+
+   /*
+* If group is not set or we've ran out of the group's sub-tree,
+* we should set group and reset group_score.
+*/
+   if (!group || group

[v11 1/6] mm, oom: refactor the oom_kill_process() function

2017-10-05 Thread Roman Gushchin
The oom_kill_process() function consists of two logical parts:
the first one is responsible for considering task's children as
a potential victim and printing the debug information.
The second half is responsible for sending SIGKILL to all
tasks sharing the mm struct with the given victim.

This commit splits the oom_kill_process() function with
an intention to re-use the the second half: __oom_kill_process().

The cgroup-aware OOM killer will kill multiple tasks
belonging to the victim cgroup. We don't need to print
the debug information for the each task, as well as play
with task selection (considering task's children),
so we can't use the existing oom_kill_process().

Signed-off-by: Roman Gushchin 
Acked-by: Michal Hocko 
Acked-by: David Rientjes 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 mm/oom_kill.c | 123 +++---
 1 file changed, 65 insertions(+), 58 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f642a45b7f14..ccdb7d34cd13 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -845,68 +845,12 @@ static bool task_will_free_mem(struct task_struct *task)
return ret;
 }
 
-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
 {
-   struct task_struct *p = oc->chosen;
-   unsigned int points = oc->chosen_points;
-   struct task_struct *victim = p;
-   struct task_struct *child;
-   struct task_struct *t;
+   struct task_struct *p;
struct mm_struct *mm;
-   unsigned int victim_points = 0;
-   static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
bool can_oom_reap = true;
 
-   /*
-* If the task is already exiting, don't alarm the sysadmin or kill
-* its children or threads, just give it access to memory reserves
-* so it can die quickly
-*/
-   task_lock(p);
-   if (task_will_free_mem(p)) {
-   mark_oom_victim(p);
-   wake_oom_reaper(p);
-   task_unlock(p);
-   put_task_struct(p);
-   return;
-   }
-   task_unlock(p);
-
-   if (__ratelimit(&oom_rs))
-   dump_header(oc, p);
-
-   pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
-   message, task_pid_nr(p), p->comm, points);
-
-   /*
-* If any of p's children has a different mm and is eligible for kill,
-* the one with the highest oom_badness() score is sacrificed for its
-* parent.  This attempts to lose the minimal amount of work done while
-* still freeing memory.
-*/
-   read_lock(&tasklist_lock);
-   for_each_thread(p, t) {
-   list_for_each_entry(child, &t->children, sibling) {
-   unsigned int child_points;
-
-   if (process_shares_mm(child, p->mm))
-   continue;
-   /*
-* oom_badness() returns 0 if the thread is unkillable
-*/
-   child_points = oom_badness(child,
-   oc->memcg, oc->nodemask, oc->totalpages);
-   if (child_points > victim_points) {
-   put_task_struct(victim);
-   victim = child;
-   victim_points = child_points;
-   get_task_struct(victim);
-   }
-   }
-   }
-   read_unlock(&tasklist_lock);
-
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
@@ -980,6 +924,69 @@ static void oom_kill_process(struct oom_control *oc, const 
char *message)
 }
 #undef K
 
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+   struct task_struct *p = oc->chosen;
+   unsigned int points = oc->chosen_points;
+   struct task_struct *victim = p;
+   struct task_struct *child;
+   struct task_struct *t;
+   unsigned int victim_points = 0;
+   static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+   /*
+* If the task is already exiting, don't alarm the sysadmin or kill
+* its children or threads, just give it access to memory reserves
+* so it can die quickly
+*/
+   task_lock(p);
+   if (task_will_free_mem(p)) {
+   mark_oom_victim(p);
+   wake_oom_reaper(

Re: [v10 4/6] mm, oom: introduce memory.oom_group

2017-10-05 Thread Roman Gushchin
On Thu, Oct 05, 2017 at 02:06:49PM +0200, Michal Hocko wrote:
> On Wed 04-10-17 16:46:36, Roman Gushchin wrote:
> > The cgroup-aware OOM killer treats leaf memory cgroups as memory
> > consumption entities and performs the victim selection by comparing
> > them based on their memory footprint. Then it kills the biggest task
> > inside the selected memory cgroup.
> > 
> > But there are workloads, which are not tolerant to a such behavior.
> > Killing a random task may leave the workload in a broken state.
> > 
> > To solve this problem, memory.oom_group knob is introduced.
> > It will define, whether a memory group should be treated as an
> > indivisible memory consumer, compared by total memory consumption
> > with other memory consumers (leaf memory cgroups and other memory
> > cgroups with memory.oom_group set), and whether all belonging tasks
> > should be killed if the cgroup is selected.
> > 
> > If set on memcg A, it means that in case of system-wide OOM or
> > memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
> > belonging to the sub-tree of A will be killed. If OOM event is
> > scoped to a descendant cgroup (A/B, for example), only tasks in
> > that cgroup can be affected. OOM killer will never touch any tasks
> > outside of the scope of the OOM event.
> > 
> > Also, tasks with oom_score_adj set to -1000 will not be killed.
> 
> I would extend the last sentence with an explanation. What about the
> following:
> "
> Also, tasks with oom_score_adj set to -1000 will not be killed because
> this has been a long established way to protect a particular process
> from seeing an unexpected SIGKILL from the oom killer. Ignoring this
> user defined configuration might lead to data corruptions or other
> misbehavior.
> "

Added, thanks!

> 
> few mostly nit picks below but this looks good other than that. Once the
> fix mentioned in patch 3 is folded I will ack this.
> 
> [...]
> 
> >  static void select_victim_memcg(struct mem_cgroup *root, struct 
> > oom_control *oc)
> >  {
> > -   struct mem_cgroup *iter;
> > +   struct mem_cgroup *iter, *group = NULL;
> > +   long group_score = 0;
> >  
> > oc->chosen_memcg = NULL;
> > oc->chosen_points = 0;
> >  
> > /*
> > +* If OOM is memcg-wide, and the memcg has the oom_group flag set,
> > +* all tasks belonging to the memcg should be killed.
> > +* So, we mark the memcg as a victim.
> > +*/
> > +   if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
> 
> we have is_memcg_oom() helper which is esier to read and understand than
> the explicit oc->memcg check

It's defined in oom_kill.c and not exported, so I'm not sure.

> 
> > +   oc->chosen_memcg = oc->memcg;
> > +   css_get(&oc->chosen_memcg->css);
> > +   return;
> > +   }
> > +
> > +   /*
> >  * The oom_score is calculated for leaf memory cgroups (including
> >  * the root memcg).
> > +* Non-leaf oom_group cgroups accumulating score of descendant
> > +* leaf memory cgroups.
> >  */
> > rcu_read_lock();
> > for_each_mem_cgroup_tree(iter, root) {
> > long score;
> >  
> > +   /*
> > +* We don't consider non-leaf non-oom_group memory cgroups
> > +* as OOM victims.
> > +*/
> > +   if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter))
> > +   continue;
> > +
> > +   /*
> > +* If group is not set or we've ran out of the group's sub-tree,
> > +* we should set group and reset group_score.
> > +*/
> > +   if (!group || group == root_mem_cgroup ||
> > +   !mem_cgroup_is_descendant(iter, group)) {
> > +   group = iter;
> > +   group_score = 0;
> > +   }
> > +
> 
> hmm, I thought you would go with a recursive oom_evaluate_memcg
> implementation that would result in a more readable code IMHO. It is
> true that we would traverse oom_group more times. But I do not expect
> we would have very deep memcg hierarchies in the majority of workloads
> and even if we did then this is a cold path which should focus on
> readability more than a performance. Also implementing
> mem_cgroup_iter_skip_subtree shouldn't be all that hard if this ever
> turns out a real problem.

I've tried to go this way, but I didn't like the result. These both
loops will s

Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread Roman Gushchin
On Thu, Oct 05, 2017 at 01:12:30PM +0200, Michal Hocko wrote:
> On Thu 05-10-17 11:27:07, Roman Gushchin wrote:
> > On Wed, Oct 04, 2017 at 02:24:26PM -0700, Shakeel Butt wrote:
> [...]
> > > Sorry about the confusion. There are two things. First, should we do a
> > > css_get on the newly selected memcg within the for loop when we still
> > > have a reference to it?
> > 
> > We're holding rcu_read_lock, it should be enough. We're bumping css counter
> > just before releasing rcu lock.
> 
> yes
> 
> > > 
> > > Second, for the OFFLINE memcg, you are right oom_evaluate_memcg() will
> > > return 0 for offlined memcgs. Maybe no need to call
> > > oom_evaluate_memcg() for offlined memcgs.
> > 
> > Sounds like a good optimization, which can be done on top of the current
> > patchset.
> 
> You could achive this by checking whether a memcg has tasks rather than
> explicitly checking for children memcgs as I've suggested already.

Using cgroup_has_tasks() will require additional locking, so I'm not sure
it worth it.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread Roman Gushchin
On Thu, Oct 05, 2017 at 01:40:09AM -0700, David Rientjes wrote:
> On Wed, 4 Oct 2017, Johannes Weiner wrote:
> 
> > > By only considering leaf memcgs, does this penalize users if their memcg 
> > > becomes oc->chosen_memcg purely because it has aggregated all of its 
> > > processes to be members of that memcg, which would otherwise be the 
> > > standard behavior?
> > > 
> > > What prevents me from spreading my memcg with N processes attached over N 
> > > child memcgs instead so that memcg_oom_badness() becomes very small for 
> > > each child memcg specifically to avoid being oom killed?
> > 
> > It's no different from forking out multiple mm to avoid being the
> > biggest process.
> >

Hi, David!

> 
> It is, because it can quite clearly be a DoS, and was prevented with 
> Roman's earlier design of iterating usage up the hierarchy and comparing 
> siblings based on that criteria.  I know exactly why he chose that 
> implementation detail early on, and it was to prevent cases such as this 
> and to not let userspace hide from the oom killer.
> 
> > It's up to the parent to enforce limits on that group and prevent you
> > from being able to cause global OOM in the first place, in particular
> > if you delegate to untrusted and potentially malicious users.
> > 
> 
> Let's resolve that global oom is a real condition and getting into that 
> situation is not a userspace problem.  It's the result of overcommiting 
> the system, and is used in the enterprise to address business goals.  If 
> the above is true, and its up to memcg to prevent global oom in the first 
> place, then this entire patchset is absolutely pointless.  Limit userspace 
> to 95% of memory and when usage is approaching that limit, let userspace 
> attached to the root memcg iterate the hierarchy itself and kill from the 
> largest consumer.
> 
> This patchset exists because overcommit is real, exactly the same as 
> overcommit within memcg hierarchies is real.  99% of the time we don't run 
> into global oom because people aren't using their limits so it just works 
> out.  1% of the time we run into global oom and we need a decision to made 
> based for forward progress.  Using Michal's earlier example of admins and 
> students, a student can easily use all of his limit and also, with v10 of 
> this patchset, 99% of the time avoid being oom killed just by forking N 
> processes over N cgroups.  It's going to oom kill an admin every single 
> time.

Overcommit is real, but configuring the system in a way that system-wide OOM
happens often is a strange idea. As we all know, the system can barely work
adequate under global memory shortage: network packets are dropped, latency
is bad, weird kernel issues are revealed periodically, etc.
I do not see, why you can't overcommit on deeper layers of cgroup hierarchy,
avoiding system-wide OOM to happen.

> 
> I know exactly why earlier versions of this patchset iterated that usage 
> up the tree so you would pick from students, pick from this troublemaking 
> student, and then oom kill from his hierarchy.  Roman has made that point 
> himself.  My suggestion was to add userspace influence to it so that 
> enterprise users and users with business goals can actually define that we 
> really do want 80% of memory to be used by this process or this hierarchy, 
> it's in our best interest.

I'll repeat myself: I believe that there is a range of possible policies:
from a complete flat (what Johannes did suggested few weeks ago), to a very
hierarchical (as in v8). Each with their pros and cons.
(Michal did provide a clear example of bad behavior of the hierarchical 
approach).

I assume, that v10 is a good middle point, and it's good because it doesn't
prevent further development. Just for example, you can introduce a third state
of oom_group knob, which will mean "evaluate as a whole, but do not kill all".
And this is what will solve your particular case, right?

> 
> Earlier iterations of this patchset did this, and did it correctly.  
> Userspace influence over the decisionmaking makes it a very powerful 
> combination because you _can_ specify what your goals are or choose to 
> leave the priorities as default so you can compare based solely on usage.  
> It was a beautiful solution to the problem.

I did, but then I did agree with Tejun's point, that proposed semantics will
limit us further. Really, oom_priorities do not guarantee the killing order
(remember numa issues, as well as oom_score_adj), so in practice it can be even
reverted (e.g. low prio cgroup killed before high prio). We shouldn't cause
users rely on these priorities more than some hints to the kernel.
But the way how they are defined doesn't allow to change anything, it's too
rigid.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-05 Thread Roman Gushchin
On Wed, Oct 04, 2017 at 02:24:26PM -0700, Shakeel Butt wrote:
> >> > +   if (memcg_has_children(iter))
> >> > +   continue;
> >>
> >> && iter != root_mem_cgroup ?
> >
> > Oh, sure. I had a stupid bug in my test script, which prevented me from
> > catching this. Thanks!
> >
> > This should fix the problem.
> > --
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 2e82625bd354..b3848bce4c86 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2807,7 +2807,8 @@ static void select_victim_memcg(struct mem_cgroup 
> > *root, struct oom_control *oc)
> >  * We don't consider non-leaf non-oom_group memory cgroups
> >  * as OOM victims.
> >  */
> > -   if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter))
> > +   if (memcg_has_children(iter) && iter != root_mem_cgroup &&
> > +   !mem_cgroup_oom_group(iter))
> > continue;
> 
> I think you are mixing the 3rd and 4th patch. The root_mem_cgroup
> check should be in 3rd while oom_group stuff should be in 4th.
>

Right. This "patch" should fix them both, it was just confusing to
send two patches. I'll split it before final landing.

> 
> >>
> >> Shouldn't there be a CSS_ONLINE check? Also instead of css_get at the
> >> end why not css_tryget_online() here and css_put for the previous
> >> selected one.
> >
> > Hm, why do we need to check this? I do not see, how we can choose
> > an OFFLINE memcg as a victim, tbh. Please, explain the problem.
> >
> 
> Sorry about the confusion. There are two things. First, should we do a
> css_get on the newly selected memcg within the for loop when we still
> have a reference to it?

We're holding rcu_read_lock, it should be enough. We're bumping css counter
just before releasing rcu lock.

> 
> Second, for the OFFLINE memcg, you are right oom_evaluate_memcg() will
> return 0 for offlined memcgs. Maybe no need to call
> oom_evaluate_memcg() for offlined memcgs.

Sounds like a good optimization, which can be done on top of the current
patchset.

Thank you!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-04 Thread Roman Gushchin
On Wed, Oct 04, 2017 at 01:17:14PM -0700, David Rientjes wrote:
> On Wed, 4 Oct 2017, Roman Gushchin wrote:
> 
> > > > @@ -828,6 +828,12 @@ static void __oom_kill_process(struct task_struct 
> > > > *victim)
> > > > struct mm_struct *mm;
> > > > bool can_oom_reap = true;
> > > >  
> > > > +   if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
> > > > +   victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > > > +   put_task_struct(victim);
> > > > +   return;
> > > > +   }
> > > > +
> > > > p = find_lock_task_mm(victim);
> > > > if (!p) {
> > > > put_task_struct(victim);
> > > 
> > > Is this necessary? The callers of this function use oom_badness() to
> > > find a victim, and that filters init, kthread, OOM_SCORE_ADJ_MIN.
> > 
> > It is. __oom_kill_process() is used to kill all processes belonging
> > to the selected memory cgroup, so we should perform these checks
> > to avoid killing unkillable processes.
> > 
> 
> That's only true after the next patch in the series which uses the 
> oom_kill_memcg_member() callback to kill processes for oom_group, correct?  
> Would it be possible to move this check to that patch so it's more 
> obvious?

Sure, no problems.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-04 Thread Roman Gushchin
On Wed, Oct 04, 2017 at 12:48:03PM -0700, Shakeel Butt wrote:
> > +
> > +static void select_victim_memcg(struct mem_cgroup *root, struct 
> > oom_control *oc)
> > +{
> > +   struct mem_cgroup *iter;
> > +
> > +   oc->chosen_memcg = NULL;
> > +   oc->chosen_points = 0;
> > +
> > +   /*
> > +* The oom_score is calculated for leaf memory cgroups (including
> > +* the root memcg).
> > +*/
> > +   rcu_read_lock();
> > +   for_each_mem_cgroup_tree(iter, root) {
> > +   long score;
> > +
> > +   if (memcg_has_children(iter))
> > +   continue;
> 
> && iter != root_mem_cgroup ?

Oh, sure. I had a stupid bug in my test script, which prevented me from
catching this. Thanks!

This should fix the problem.
--
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2e82625bd354..b3848bce4c86 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2807,7 +2807,8 @@ static void select_victim_memcg(struct mem_cgroup *root, 
struct oom_control *oc)
 * We don't consider non-leaf non-oom_group memory cgroups
 * as OOM victims.
 */
-   if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter))
+   if (memcg_has_children(iter) && iter != root_mem_cgroup &&
+   !mem_cgroup_oom_group(iter))
continue;
 
/*
@@ -2820,7 +2821,7 @@ static void select_victim_memcg(struct mem_cgroup *root, 
struct oom_control *oc)
group_score = 0;
}
 
-   if (memcg_has_children(iter))
+   if (memcg_has_children(iter) && iter != root_mem_cgroup)
continue;
 
score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);

--

> 
> > +
> > +   score = oom_evaluate_memcg(iter, oc->nodemask, 
> > oc->totalpages);
> > +
> > +   /*
> > +* Ignore empty and non-eligible memory cgroups.
> > +*/
> > +   if (score == 0)
> > +   continue;
> > +
> > +   /*
> > +* If there are inflight OOM victims, we don't need
> > +* to look further for new victims.
> > +*/
> > +   if (score == -1) {
> > +   oc->chosen_memcg = INFLIGHT_VICTIM;
> > +   mem_cgroup_iter_break(root, iter);
> > +   break;
> > +   }
> > +
> 
> Shouldn't there be a CSS_ONLINE check? Also instead of css_get at the
> end why not css_tryget_online() here and css_put for the previous
> selected one.

Hm, why do we need to check this? I do not see, how we can choose
an OFFLINE memcg as a victim, tbh. Please, explain the problem.

Thank you!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-04 Thread Roman Gushchin
On Wed, Oct 04, 2017 at 03:27:20PM -0400, Johannes Weiner wrote:
> On Wed, Oct 04, 2017 at 04:46:35PM +0100, Roman Gushchin wrote:
> > Traditionally, the OOM killer is operating on a process level.
> > Under oom conditions, it finds a process with the highest oom score
> > and kills it.
> > 
> > This behavior doesn't suit well the system with many running
> > containers:
> > 
> > 1) There is no fairness between containers. A small container with
> > few large processes will be chosen over a large one with huge
> > number of small processes.
> > 
> > 2) Containers often do not expect that some random process inside
> > will be killed. In many cases much safer behavior is to kill
> > all tasks in the container. Traditionally, this was implemented
> > in userspace, but doing it in the kernel has some advantages,
> > especially in a case of a system-wide OOM.
> > 
> > To address these issues, the cgroup-aware OOM killer is introduced.
> > 
> > Under OOM conditions, it looks for the biggest leaf memory cgroup
> > and kills the biggest task belonging to it. The following patches
> > will extend this functionality to consider non-leaf memory cgroups
> > as well, and also provide an ability to kill all tasks belonging
> > to the victim cgroup.
> > 
> > The root cgroup is treated as a leaf memory cgroup, so it's score
> > is compared with leaf memory cgroups.
> > Due to memcg statistics implementation a special algorithm
> > is used for estimating it's oom_score: we define it as maximum
> > oom_score of the belonging tasks.
> > 
> > Signed-off-by: Roman Gushchin 
> > Cc: Michal Hocko 
> > Cc: Vladimir Davydov 
> > Cc: Johannes Weiner 
> > Cc: Tetsuo Handa 
> > Cc: David Rientjes 
> > Cc: Andrew Morton 
> > Cc: Tejun Heo 
> > Cc: kernel-t...@fb.com
> > Cc: cgro...@vger.kernel.org
> > Cc: linux-doc@vger.kernel.org
> > Cc: linux-ker...@vger.kernel.org
> > Cc: linux...@kvack.org
> 
> This looks good to me.
> 
> Acked-by: Johannes Weiner 
> 
> I just have one question:
> 
> > @@ -828,6 +828,12 @@ static void __oom_kill_process(struct task_struct 
> > *victim)
> > struct mm_struct *mm;
> > bool can_oom_reap = true;
> >  
> > +   if (is_global_init(victim) || (victim->flags & PF_KTHREAD) ||
> > +   victim->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > +   put_task_struct(victim);
> > +   return;
> > +   }
> > +
> > p = find_lock_task_mm(victim);
> > if (!p) {
> > put_task_struct(victim);
> 
> Is this necessary? The callers of this function use oom_badness() to
> find a victim, and that filters init, kthread, OOM_SCORE_ADJ_MIN.

It is. __oom_kill_process() is used to kill all processes belonging
to the selected memory cgroup, so we should perform these checks
to avoid killing unkillable processes.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v10 3/6] mm, oom: cgroup-aware OOM killer

2017-10-04 Thread Roman Gushchin
Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

To address these issues, the cgroup-aware OOM killer is introduced.

Under OOM conditions, it looks for the biggest leaf memory cgroup
and kills the biggest task belonging to it. The following patches
will extend this functionality to consider non-leaf memory cgroups
as well, and also provide an ability to kill all tasks belonging
to the victim cgroup.

The root cgroup is treated as a leaf memory cgroup, so it's score
is compared with leaf memory cgroups.
Due to memcg statistics implementation a special algorithm
is used for estimating it's oom_score: we define it as maximum
oom_score of the belonging tasks.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/memcontrol.h |  17 +
 include/linux/oom.h|  12 +++-
 mm/memcontrol.c| 172 +
 mm/oom_kill.c  |  76 +++-
 4 files changed, 257 insertions(+), 20 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 69966c461d1c..75b63b68846e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -342,6 +343,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct 
cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+   css_put(&memcg->css);
+}
+
 #define mem_cgroup_from_counter(counter, member)   \
container_of(counter, struct mem_cgroup, member)
 
@@ -480,6 +486,8 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -744,6 +752,10 @@ static inline bool task_in_mem_cgroup(struct task_struct 
*task,
return true;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -936,6 +948,11 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+   return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4ce39bc..ca78e2d5956e 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -9,6 +9,13 @@
 #include  /* MMF_* */
 #include  /* VM_FAULT* */
 
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
 struct zonelist;
 struct notifier_block;
 struct mem_cgroup;
@@ -39,7 +46,8 @@ struct oom_control {
 
/* Used by oom implementation, do not set */
unsigned long totalpages;
-   struct task_struct *chosen;
+   struct task_struct *chosen_task;
+   struct mem_cgroup *chosen_memcg;
unsigned long chosen_points;
 };
 
@@ -101,6 +109,8 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern int oom_evaluate_task(struct task_struct *task, void *arg);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b4de17a78dc1..79f30c281185 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2670,6 +2670,178 @@ static inline bool memcg_has_children(struct mem_cgroup 
*memcg)
return ret;
 }
 
+static long memcg_oom_badness(struct mem_cgroup *memcg,
+ const nodemask_t *nodemask,
+ unsigned long totalpages)
+{
+   long points = 0;
+   int 

[v10 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup

2017-10-04 Thread Roman Gushchin
Implement mem_cgroup_scan_tasks() functionality for the root
memory cgroup to use this function for looking for a OOM victim
task in the root memory cgroup by the cgroup-ware OOM killer.

The root memory cgroup is treated as a leaf cgroup, so only tasks
which are directly belonging to the root cgroup are iterated over.

This patch doesn't introduce any functional change as
mem_cgroup_scan_tasks() is never called for the root memcg.
This is preparatory work for the cgroup-aware OOM killer,
which will use this function to iterate over tasks belonging
to the root memcg.

Signed-off-by: Roman Gushchin 
Acked-by: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 mm/memcontrol.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d5f3a62887cf..b4de17a78dc1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -917,7 +917,8 @@ static void invalidate_reclaim_iterators(struct mem_cgroup 
*dead_memcg)
  * value, the function breaks the iteration loop and returns the value.
  * Otherwise, it will iterate over all tasks and return 0.
  *
- * This function must not be called for the root memory cgroup.
+ * If memcg is the root memory cgroup, this function will iterate only
+ * over tasks belonging directly to the root memory cgroup.
  */
 int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
  int (*fn)(struct task_struct *, void *), void *arg)
@@ -925,8 +926,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
struct mem_cgroup *iter;
int ret = 0;
 
-   BUG_ON(memcg == root_mem_cgroup);
-
for_each_mem_cgroup_tree(iter, memcg) {
struct css_task_iter it;
struct task_struct *task;
@@ -935,7 +934,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
while (!ret && (task = css_task_iter_next(&it)))
ret = fn(task, arg);
css_task_iter_end(&it);
-   if (ret) {
+   if (ret || memcg == root_mem_cgroup) {
mem_cgroup_iter_break(memcg, iter);
break;
}
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v10 0/6] cgroup-aware OOM killer

2017-10-04 Thread Roman Gushchin
This patchset makes the OOM killer cgroup-aware.

v10:
  - Separate oom_group introduction into a standalone patch
  - Stop propagating oom_group
  - Make oom_group delegatable
  - Do not try to kill the biggest task in the first order,
if the whole cgroup is going to be killed
  - Stop caching oom_score on struct memcg, optimize victim
memcg selection
  - Drop dmesg printing (for further refining)
  - Small refactorings and comments added here and there
  - Rebase on top of mm tree

v9:
  - Change siblings-to-siblings comparison to the tree-wide search,
make related refactorings
  - Make oom_group implicitly propagated down by the tree
  - Fix an issue with task selection in root cgroup

v8:
  - Do not kill tasks with OOM_SCORE_ADJ -1000
  - Make the whole thing opt-in with cgroup mount option control
  - Drop oom_priority for further discussions
  - Kill the whole cgroup if oom_group is set and it's
memory.max is reached
  - Update docs and commit messages

v7:
  - __oom_kill_process() drops reference to the victim task
  - oom_score_adj -1000 is always respected
  - Renamed oom_kill_all to oom_group
  - Dropped oom_prio range, converted from short to int
  - Added a cgroup v2 mount option to disable cgroup-aware OOM killer
  - Docs updated
  - Rebased on top of mmotm

v6:
  - Renamed oom_control.chosen to oom_control.chosen_task
  - Renamed oom_kill_all_tasks to oom_kill_all
  - Per-node NR_SLAB_UNRECLAIMABLE accounting
  - Several minor fixes and cleanups
  - Docs updated

v5:
  - Rebased on top of Michal Hocko's patches, which have changed the
way how OOM victims becoming an access to the memory
reserves. Dropped corresponding part of this patchset
  - Separated the oom_kill_process() splitting into a standalone commit
  - Added debug output (suggested by David Rientjes)
  - Some minor fixes

v4:
  - Reworked per-cgroup oom_score_adj into oom_priority
(based on ideas by David Rientjes)
  - Tasks with oom_score_adj -1000 are never selected if
oom_kill_all_tasks is not set
  - Memcg victim selection code is reworked, and
synchronization is based on finding tasks with OOM victim marker,
rather then on global counter
  - Debug output is dropped
  - Refactored TIF_MEMDIE usage

v3:
  - Merged commits 1-4 into 6
  - Separated oom_score_adj logic and debug output into separate commits
  - Fixed swap accounting

v2:
  - Reworked victim selection based on feedback
from Michal Hocko, Vladimir Davydov and Johannes Weiner
  - "Kill all tasks" is now an opt-in option, by default
only one process will be killed
  - Added per-cgroup oom_score_adj
  - Refined oom score calculations, suggested by Vladimir Davydov
  - Converted to a patchset

v1:
  https://lkml.org/lkml/2017/5/18/969


Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org


Roman Gushchin (6):
  mm, oom: refactor the oom_kill_process() function
  mm: implement mem_cgroup_scan_tasks() for the root memory cgroup
  mm, oom: cgroup-aware OOM killer
  mm, oom: introduce memory.oom_group
  mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
  mm, oom, docs: describe the cgroup-aware OOM killer

 Documentation/cgroup-v2.txt |  51 +
 include/linux/cgroup-defs.h |   5 +
 include/linux/memcontrol.h  |  34 ++
 include/linux/oom.h |  12 ++-
 kernel/cgroup/cgroup.c  |  10 ++
 mm/memcontrol.c | 248 +++-
 mm/oom_kill.c   | 209 -
 7 files changed, 491 insertions(+), 78 deletions(-)

-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-10-04 Thread Roman Gushchin
Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
OOM killer. If not set, the OOM selection is performed in
a "traditional" per-process way.

The behavior can be changed dynamically by remounting the cgroupfs.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/cgroup-defs.h |  5 +
 kernel/cgroup/cgroup.c  | 10 ++
 mm/memcontrol.c |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 3e55bbd31ad1..cae5343a8b21 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -80,6 +80,11 @@ enum {
 * Enable cpuset controller in v1 cgroup to use v2 behavior.
 */
CGRP_ROOT_CPUSET_V2_MODE = (1 << 4),
+
+   /*
+* Enable cgroup-aware OOM killer.
+*/
+   CGRP_GROUP_OOM = (1 << 5),
 };
 
 /* cftype->flags */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c3421ee0d230..8d8aa46ff930 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1709,6 +1709,9 @@ static int parse_cgroup_root_flags(char *data, unsigned 
int *root_flags)
if (!strcmp(token, "nsdelegate")) {
*root_flags |= CGRP_ROOT_NS_DELEGATE;
continue;
+   } else if (!strcmp(token, "groupoom")) {
+   *root_flags |= CGRP_GROUP_OOM;
+   continue;
}
 
pr_err("cgroup2: unknown option \"%s\"\n", token);
@@ -1725,6 +1728,11 @@ static void apply_cgroup_root_flags(unsigned int 
root_flags)
cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
+
+   if (root_flags & CGRP_GROUP_OOM)
+   cgrp_dfl_root.flags |= CGRP_GROUP_OOM;
+   else
+   cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM;
}
 }
 
@@ -1732,6 +1740,8 @@ static int cgroup_show_options(struct seq_file *seq, 
struct kernfs_root *kf_root
 {
if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
seq_puts(seq, ",nsdelegate");
+   if (cgrp_dfl_root.flags & CGRP_GROUP_OOM)
+   seq_puts(seq, ",groupoom");
return 0;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1fcd6cc353d5..2e82625bd354 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2865,6 +2865,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;
 
+   if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM))
+   return false;
+
if (oc->memcg)
root = oc->memcg;
else
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v10 6/6] mm, oom, docs: describe the cgroup-aware OOM killer

2017-10-04 Thread Roman Gushchin
Document the cgroup-aware OOM killer.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: Andrew Morton 
Cc: David Rientjes 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 Documentation/cgroup-v2.txt | 51 +
 1 file changed, 51 insertions(+)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 3f8216912df0..28429e62b0ea 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
5-2-1. Memory Interface Files
5-2-2. Usage Guidelines
5-2-3. Memory Ownership
+   5-2-4. OOM Killer
  5-3. IO
5-3-1. IO Interface Files
5-3-2. Writeback
@@ -1043,6 +1044,28 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.
 
+  memory.oom_group
+
+   A read-write single value file which exists on non-root
+   cgroups.  The default is "0".
+
+   If set, OOM killer will consider the memory cgroup as an
+   indivisible memory consumers and compare it with other memory
+   consumers by it's memory footprint.
+   If such memory cgroup is selected as an OOM victim, all
+   processes belonging to it or it's descendants will be killed.
+
+   This applies to system-wide OOM conditions and reaching
+   the hard memory limit of the cgroup and their ancestor.
+   If OOM condition happens in a descendant cgroup with it's own
+   memory limit, the memory cgroup can't be considered
+   as an OOM victim, and OOM killer will not kill all belonging
+   tasks.
+
+   Also, OOM killer respects the /proc/pid/oom_score_adj value -1000,
+   and will never kill the unkillable task, even if memory.oom_group
+   is set.
+
   memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined.  Unless specified
@@ -1246,6 +1269,34 @@ to be accessed repeatedly by other cgroups, it may make 
sense to use
 POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
 belonging to the affected files to ensure correct memory ownership.
 
+OOM Killer
+~~
+
+Cgroup v2 memory controller implements a cgroup-aware OOM killer.
+It means that it treats cgroups as first class OOM entities.
+
+Under OOM conditions the memory controller tries to make the best
+choice of a victim, looking for a memory cgroup with the largest
+memory footprint, considering leaf cgroups and cgroups with the
+memory.oom_group option set, which are considered to be an indivisible
+memory consumers.
+
+By default, OOM killer will kill the biggest task in the selected
+memory cgroup. A user can change this behavior by enabling
+the per-cgroup memory.oom_group option. If set, it causes
+the OOM killer to kill all processes attached to the cgroup,
+except processes with oom_score_adj set to -1000.
+
+This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
+the memory controller considers only cgroups belonging to the sub-tree
+of the OOM'ing cgroup.
+
+The root cgroup is treated as a leaf memory cgroup, so it's compared
+with other leaf memory cgroups and cgroups with oom_group option set.
+
+If there are no cgroups with the enabled memory controller,
+the OOM killer is using the "traditional" process-based approach.
+
 
 IO
 --
-- 
2.13.6

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v10 4/6] mm, oom: introduce memory.oom_group

2017-10-04 Thread Roman Gushchin
The cgroup-aware OOM killer treats leaf memory cgroups as memory
consumption entities and performs the victim selection by comparing
them based on their memory footprint. Then it kills the biggest task
inside the selected memory cgroup.

But there are workloads, which are not tolerant to a such behavior.
Killing a random task may leave the workload in a broken state.

To solve this problem, memory.oom_group knob is introduced.
It will define, whether a memory group should be treated as an
indivisible memory consumer, compared by total memory consumption
with other memory consumers (leaf memory cgroups and other memory
cgroups with memory.oom_group set), and whether all belonging tasks
should be killed if the cgroup is selected.

If set on memcg A, it means that in case of system-wide OOM or
memcg-wide OOM scoped to A or any ancestor cgroup, all tasks,
belonging to the sub-tree of A will be killed. If OOM event is
scoped to a descendant cgroup (A/B, for example), only tasks in
that cgroup can be affected. OOM killer will never touch any tasks
outside of the scope of the OOM event.

Also, tasks with oom_score_adj set to -1000 will not be killed.

The default value is 0.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/memcontrol.h | 17 +++
 mm/memcontrol.c| 74 +++---
 mm/oom_kill.c  | 38 +---
 3 files changed, 115 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 75b63b68846e..84ac10d7e67d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -200,6 +200,13 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
 
+   /*
+* Treat the sub-tree as an indivisible memory consumer,
+* kill all belonging tasks if the memory cgroup selected
+* as OOM victim.
+*/
+   bool oom_group;
+
/* handle for "memory.events" */
struct cgroup_file events_file;
 
@@ -488,6 +495,11 @@ bool mem_cgroup_oom_synchronize(bool wait);
 
 bool mem_cgroup_select_oom_victim(struct oom_control *oc);
 
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+   return memcg->oom_group;
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -953,6 +965,11 @@ static inline bool mem_cgroup_select_oom_victim(struct 
oom_control *oc)
 {
return false;
 }
+
+static inline bool mem_cgroup_oom_group(struct mem_cgroup *memcg)
+{
+   return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 79f30c281185..1fcd6cc353d5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2776,19 +2776,50 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
 
 static void select_victim_memcg(struct mem_cgroup *root, struct oom_control 
*oc)
 {
-   struct mem_cgroup *iter;
+   struct mem_cgroup *iter, *group = NULL;
+   long group_score = 0;
 
oc->chosen_memcg = NULL;
oc->chosen_points = 0;
 
/*
+* If OOM is memcg-wide, and the memcg has the oom_group flag set,
+* all tasks belonging to the memcg should be killed.
+* So, we mark the memcg as a victim.
+*/
+   if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
+   oc->chosen_memcg = oc->memcg;
+   css_get(&oc->chosen_memcg->css);
+   return;
+   }
+
+   /*
 * The oom_score is calculated for leaf memory cgroups (including
 * the root memcg).
+* Non-leaf oom_group cgroups accumulating score of descendant
+* leaf memory cgroups.
 */
rcu_read_lock();
for_each_mem_cgroup_tree(iter, root) {
long score;
 
+   /*
+* We don't consider non-leaf non-oom_group memory cgroups
+* as OOM victims.
+*/
+   if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter))
+   continue;
+
+   /*
+* If group is not set or we've ran out of the group's sub-tree,
+* we should set group and reset group_score.
+*/
+   if (!group || group == root_mem_cgroup ||
+   !mem_cgroup_is_descendant(iter, group)) {
+   group = iter;
+   group_score = 0;
+   }
+
if (memcg_has_children(iter))
continue;
 
@@ -2810,9 +2841,11 @@ static v

[v10 1/6] mm, oom: refactor the oom_kill_process() function

2017-10-04 Thread Roman Gushchin
The oom_kill_process() function consists of two logical parts:
the first one is responsible for considering task's children as
a potential victim and printing the debug information.
The second half is responsible for sending SIGKILL to all
tasks sharing the mm struct with the given victim.

This commit splits the oom_kill_process() function with
an intention to re-use the the second half: __oom_kill_process().

The cgroup-aware OOM killer will kill multiple tasks
belonging to the victim cgroup. We don't need to print
the debug information for the each task, as well as play
with task selection (considering task's children),
so we can't use the existing oom_kill_process().

Signed-off-by: Roman Gushchin 
Acked-by: Michal Hocko 
Acked-by: David Rientjes 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Andrew Morton 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 mm/oom_kill.c | 123 +++---
 1 file changed, 65 insertions(+), 58 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e284810b9851..1e7b8a27e6cc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -822,68 +822,12 @@ static bool task_will_free_mem(struct task_struct *task)
return ret;
 }
 
-static void oom_kill_process(struct oom_control *oc, const char *message)
+static void __oom_kill_process(struct task_struct *victim)
 {
-   struct task_struct *p = oc->chosen;
-   unsigned int points = oc->chosen_points;
-   struct task_struct *victim = p;
-   struct task_struct *child;
-   struct task_struct *t;
+   struct task_struct *p;
struct mm_struct *mm;
-   unsigned int victim_points = 0;
-   static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
bool can_oom_reap = true;
 
-   /*
-* If the task is already exiting, don't alarm the sysadmin or kill
-* its children or threads, just give it access to memory reserves
-* so it can die quickly
-*/
-   task_lock(p);
-   if (task_will_free_mem(p)) {
-   mark_oom_victim(p);
-   wake_oom_reaper(p);
-   task_unlock(p);
-   put_task_struct(p);
-   return;
-   }
-   task_unlock(p);
-
-   if (__ratelimit(&oom_rs))
-   dump_header(oc, p);
-
-   pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
-   message, task_pid_nr(p), p->comm, points);
-
-   /*
-* If any of p's children has a different mm and is eligible for kill,
-* the one with the highest oom_badness() score is sacrificed for its
-* parent.  This attempts to lose the minimal amount of work done while
-* still freeing memory.
-*/
-   read_lock(&tasklist_lock);
-   for_each_thread(p, t) {
-   list_for_each_entry(child, &t->children, sibling) {
-   unsigned int child_points;
-
-   if (process_shares_mm(child, p->mm))
-   continue;
-   /*
-* oom_badness() returns 0 if the thread is unkillable
-*/
-   child_points = oom_badness(child,
-   oc->memcg, oc->nodemask, oc->totalpages);
-   if (child_points > victim_points) {
-   put_task_struct(victim);
-   victim = child;
-   victim_points = child_points;
-   get_task_struct(victim);
-   }
-   }
-   }
-   read_unlock(&tasklist_lock);
-
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
@@ -957,6 +901,69 @@ static void oom_kill_process(struct oom_control *oc, const 
char *message)
 }
 #undef K
 
+static void oom_kill_process(struct oom_control *oc, const char *message)
+{
+   struct task_struct *p = oc->chosen;
+   unsigned int points = oc->chosen_points;
+   struct task_struct *victim = p;
+   struct task_struct *child;
+   struct task_struct *t;
+   unsigned int victim_points = 0;
+   static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+   /*
+* If the task is already exiting, don't alarm the sysadmin or kill
+* its children or threads, just give it access to memory reserves
+* so it can die quickly
+*/
+   task_lock(p);
+   if (task_will_free_mem(p)) {
+   mark_oom_victim(p);
+   wake_oom_reaper(

Re: [v9 3/5] mm, oom: cgroup-aware OOM killer

2017-10-04 Thread Roman Gushchin
On Tue, Oct 03, 2017 at 04:22:46PM +0200, Michal Hocko wrote:
> On Tue 03-10-17 15:08:41, Roman Gushchin wrote:
> > On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
> [...]
> > > I guess we want to inherit the value on the memcg creation but I agree
> > > that enforcing parent setting is weird. I will think about it some more
> > > but I agree that it is saner to only enforce per memcg value.
> > 
> > I'm not against, but we should come up with a good explanation, why we're
> > inheriting it; or not inherit.
> 
> Inheriting sounds like a less surprising behavior. Once you opt in for
> oom_group you can expect that descendants are going to assume the same
> unless they explicitly state otherwise.

Not sure I understand why. Setting memory.oom_group on a child memcg
has absolutely no meaning until memory.max is also set. In case of OOM
scoped to the parent memcg or above, parent's value defines the behavior.

If a user decides to create a separate OOM domain (be setting the hard
memory limit), he/she can also make a decision on how OOM event should
be handled.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v9 3/5] mm, oom: cgroup-aware OOM killer

2017-10-03 Thread Roman Gushchin
On Tue, Oct 03, 2017 at 04:22:46PM +0200, Michal Hocko wrote:
> On Tue 03-10-17 15:08:41, Roman Gushchin wrote:
> > On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
> [...]
> > > I guess we want to inherit the value on the memcg creation but I agree
> > > that enforcing parent setting is weird. I will think about it some more
> > > but I agree that it is saner to only enforce per memcg value.
> > 
> > I'm not against, but we should come up with a good explanation, why we're
> > inheriting it; or not inherit.
> 
> Inheriting sounds like a less surprising behavior. Once you opt in for
> oom_group you can expect that descendants are going to assume the same
> unless they explicitly state otherwise.
> 
> [...]
> > > > > > @@ -962,6 +968,48 @@ static void oom_kill_process(struct 
> > > > > > oom_control *oc, const char *message)
> > > > > > __oom_kill_process(victim);
> > > > > >  }
> > > > > >  
> > > > > > +static int oom_kill_memcg_member(struct task_struct *task, void 
> > > > > > *unused)
> > > > > > +{
> > > > > > +   if (!tsk_is_oom_victim(task)) {
> > > > > 
> > > > > How can this happen?
> > > > 
> > > > We do start with killing the largest process, and then iterate over all 
> > > > tasks
> > > > in the cgroup. So, this check is required to avoid killing tasks which 
> > > > are
> > > > already in the termination process.
> > > 
> > > Do you mean we have tsk_is_oom_victim && MMF_OOM_SKIP == T?
> > 
> > No, just tsk_is_oom_victim. We're are killing the biggest task, and then 
> > _all_
> > tasks. This is a way to skip the biggest task, and do not kill it again.
> 
> OK, I have missed that part. Why are we doing that actually? Why don't
> we simply do 
>   /* If oom_group flag is set, kill all belonging tasks */
>   if (mem_cgroup_oom_group(oc->chosen_memcg))
>   mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
> NULL);
> 
> we are going to kill all the tasks anyway.

Well, the idea behind was that killing the biggest process give us better
chances to get out of global memory shortage and guarantee forward progress.
I can drop it, if it considered to be excessive.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v9 3/5] mm, oom: cgroup-aware OOM killer

2017-10-03 Thread Roman Gushchin
On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
> On Tue 03-10-17 13:37:21, Roman Gushchin wrote:
> > On Tue, Oct 03, 2017 at 01:48:48PM +0200, Michal Hocko wrote:
> [...]
> > > Wrt. to the implicit inheritance you brought up in a separate email
> > > thread [1]. Let me quote
> > > : after some additional thinking I don't think anymore that implicit
> > > : propagation of oom_group is a good idea.  Let me explain: assume we
> > > : have memcg A with memory.max and memory.oom_group set, and nested
> > > : memcg A/B with memory.max set. Let's imagine we have an OOM event if
> > > : A/B. What is an expected system behavior?
> > > : We have OOM scoped to A/B, and any action should be also scoped to A/B.
> > > : We really shouldn't touch processes which are not belonging to A/B.
> > > : That means we should either kill the biggest process in A/B, either all
> > > : processes in A/B. It's natural to make A/B/memory.oom_group responsible
> > > : for this decision. It's strange to make the depend on 
> > > A/memory.oom_group, IMO.
> > > : It really makes no sense, and makes oom_group knob really hard to 
> > > describe.
> > > : 
> > > : Also, after some off-list discussion, we've realized that 
> > > memory.oom_knob
> > > : should be delegatable. The workload should have control over it to 
> > > express
> > > : dependency between processes.
> > > 
> > > OK, I have asked about this already but I am not sure the answer was
> > > very explicit. So let me ask again. When exactly a subtree would
> > > disagree with the parent on oom_group? In other words when do we want a
> > > different cleanup based on the OOM root? I am not saying this is wrong
> > > I am just curious about a practical example.
> > 
> > Well, I do not have a practical example right now, but it's against the 
> > logic.
> > Any OOM event has a scope, and group_oom knob is applied for OOM events
> > scoped to the cgroup or any ancestors (including system as a whole).
> > So, applying it implicitly to OOM scoped to descendant cgroups makes no 
> > sense.
> > It's a strange configuration limitation, and I do not see any benefits:
> > it doesn't provide any new functionality or guarantees.
> 
> Well, I guess I agree. I was merely interested about consequences when
> the oom behavior is different depending on which layer it happens. Does
> it make sense to cleanup the whole hierarchy while any subtree would
> kill a single task if the oom happened there?

By setting or not setting the oom_group knob a user is expressing
the readiness to handle the OOM by itself, e.g. looking at cgroup events,
restarting killed tasks, etc.

If workload is complex, and has some sub-parts with their own memory
constraints, it's quite possible, that it's ready to restart these parts,
but not a random process killed by the global OOM. This is actually a proper
replacement for setting oom_score_adj: let say there is memcg A,
which contains some control stuff in A/C, and several sub-workloads
A/W1, A/W2, etc. In case of global OOM, caused by system miss-configuration,
or, say, a memory leak in the control stuff, it makes perfect sense to
kill A as a whole, so we can set A/memory.oom_groups to 1.

But if there is a memory shortage in one of the workers (A/W1, for instance),
it's quite possible that killing everything is excessive. So, a user has
the freedom to decide what's the proper way to handle OOM.

>  
> > Even if we don't have practical examples, we should build something less
> > surprising for a user, and I don't understand why oom_group should be 
> > inherited.
> 
> I guess we want to inherit the value on the memcg creation but I agree
> that enforcing parent setting is weird. I will think about it some more
> but I agree that it is saner to only enforce per memcg value.

I'm not against, but we should come up with a good explanation, why we're
inheriting it; or not inherit.

>  
> > > > Tasks with oom_score_adj set to -1000 are considered as unkillable.
> > > > 
> > > > The root cgroup is treated as a leaf memory cgroup, so it's score
> > > > is compared with other leaf and oom_group memory cgroups.
> > > > The oom_group option is not supported for the root cgroup.
> > > > Due to memcg statistics implementation a special algorithm
> > > > is used for estimating root cgroup oom_score: we define it
> > > > as maximum oom_score of the belonging tasks.
> > > 
> > > [1] 
> > > http://lkml.ker

Re: [v9 2/5] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup

2017-10-03 Thread Roman Gushchin
On Tue, Oct 03, 2017 at 12:49:39PM +0200, Michal Hocko wrote:
> On Wed 27-09-17 14:09:33, Roman Gushchin wrote:
> > Implement mem_cgroup_scan_tasks() functionality for the root
> > memory cgroup to use this function for looking for a OOM victim
> > task in the root memory cgroup by the cgroup-ware OOM killer.
> > 
> > The root memory cgroup should be treated as a leaf cgroup,
> > so only tasks which are directly belonging to the root cgroup
> > should be iterated over.
> 
> I would only add that this patch doesn't introduce any functionally
> visible change because we never trigger oom killer with the root memcg
> as the root of the hierarchy. So this is just a preparatory work for
> later changes.

Sure, thanks!

> 
> > Signed-off-by: Roman Gushchin 
> > Cc: Michal Hocko 
> > Cc: Vladimir Davydov 
> > Cc: Johannes Weiner 
> > Cc: Tetsuo Handa 
> > Cc: David Rientjes 
> > Cc: Andrew Morton 
> > Cc: Tejun Heo 
> > Cc: kernel-t...@fb.com
> > Cc: cgro...@vger.kernel.org
> > Cc: linux-doc@vger.kernel.org
> > Cc: linux-ker...@vger.kernel.org
> > Cc: linux...@kvack.org
> 
> Acked-by: Michal Hocko 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v9 4/5] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer

2017-10-03 Thread Roman Gushchin
On Tue, Oct 03, 2017 at 01:50:36PM +0200, Michal Hocko wrote:
> On Wed 27-09-17 14:09:35, Roman Gushchin wrote:
> > Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware
> > OOM killer. If not set, the OOM selection is performed in
> > a "traditional" per-process way.
> > 
> > The behavior can be changed dynamically by remounting the cgroupfs.
> 
> I do not have a strong preference about this. I would just be worried
> that it is usually systemd which tries to own the whole hierarchy

I actually like this fact.

It gives us the opportunity to change the default behavior for most users
at the point when we'll be sure that new behavior is better; but at the same
time we'll save full compatibility on the kernel level.
With growing popularity of memory cgroups, I don't think that hiding
this functionality with a boot option makes any sense. It's just not
this type of feature, that should be hidden.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v9 3/5] mm, oom: cgroup-aware OOM killer

2017-10-03 Thread Roman Gushchin
On Tue, Oct 03, 2017 at 01:48:48PM +0200, Michal Hocko wrote:
> On Wed 27-09-17 14:09:34, Roman Gushchin wrote:
> > Traditionally, the OOM killer is operating on a process level.
> > Under oom conditions, it finds a process with the highest oom score
> > and kills it.
> > 
> > This behavior doesn't suit well the system with many running
> > containers:
> > 
> > 1) There is no fairness between containers. A small container with
> > few large processes will be chosen over a large one with huge
> > number of small processes.
> > 
> > 2) Containers often do not expect that some random process inside
> > will be killed. In many cases much safer behavior is to kill
> > all tasks in the container. Traditionally, this was implemented
> > in userspace, but doing it in the kernel has some advantages,
> > especially in a case of a system-wide OOM.
> > 
> > To address these issues, the cgroup-aware OOM killer is introduced.
> > 
> > Under OOM conditions, it looks for the biggest memory consumer:
> > a leaf memory cgroup or a memory cgroup with the memory.oom_group
> > option set. Then it kills either a task with the biggest memory
> > footprint, either all belonging tasks, if memory.oom_group is set.
> > If a cgroup has memory.oom_group set, all descendant cgroups
> > implicitly inherit the memory.oom_group setting.
> 
> I think it would be better to separate oom_group into its own patch.
> So this patch would just add the cgroup awareness and oom_group will
> build on top of that.

Sure, will do.

> 
> Wrt. to the implicit inheritance you brought up in a separate email
> thread [1]. Let me quote
> : after some additional thinking I don't think anymore that implicit
> : propagation of oom_group is a good idea.  Let me explain: assume we
> : have memcg A with memory.max and memory.oom_group set, and nested
> : memcg A/B with memory.max set. Let's imagine we have an OOM event if
> : A/B. What is an expected system behavior?
> : We have OOM scoped to A/B, and any action should be also scoped to A/B.
> : We really shouldn't touch processes which are not belonging to A/B.
> : That means we should either kill the biggest process in A/B, either all
> : processes in A/B. It's natural to make A/B/memory.oom_group responsible
> : for this decision. It's strange to make the depend on A/memory.oom_group, 
> IMO.
> : It really makes no sense, and makes oom_group knob really hard to describe.
> : 
> : Also, after some off-list discussion, we've realized that memory.oom_knob
> : should be delegatable. The workload should have control over it to express
> : dependency between processes.
> 
> OK, I have asked about this already but I am not sure the answer was
> very explicit. So let me ask again. When exactly a subtree would
> disagree with the parent on oom_group? In other words when do we want a
> different cleanup based on the OOM root? I am not saying this is wrong
> I am just curious about a practical example.

Well, I do not have a practical example right now, but it's against the logic.
Any OOM event has a scope, and group_oom knob is applied for OOM events
scoped to the cgroup or any ancestors (including system as a whole).
So, applying it implicitly to OOM scoped to descendant cgroups makes no sense.
It's a strange configuration limitation, and I do not see any benefits:
it doesn't provide any new functionality or guarantees.

Even if we don't have practical examples, we should build something less
surprising for a user, and I don't understand why oom_group should be inherited.

> 
> > Tasks with oom_score_adj set to -1000 are considered as unkillable.
> > 
> > The root cgroup is treated as a leaf memory cgroup, so it's score
> > is compared with other leaf and oom_group memory cgroups.
> > The oom_group option is not supported for the root cgroup.
> > Due to memcg statistics implementation a special algorithm
> > is used for estimating root cgroup oom_score: we define it
> > as maximum oom_score of the belonging tasks.
> 
> [1] 
> http://lkml.kernel.org/r/20171002124712.ga17...@castle.dhcp.thefacebook.com
> 
> [...]
> > +static long memcg_oom_badness(struct mem_cgroup *memcg,
> > + const nodemask_t *nodemask,
> > + unsigned long totalpages)
> > +{
> > +   long points = 0;
> > +   int nid;
> > +   pg_data_t *pgdat;
> > +
> > +   /*
> > +* We don't have necessary stats for the root memcg,
> > +* so we define it's oom_score as the maximum oom_score
> > +* of the belonging tasks.
> > +*/
> 
> Why not a su

  1   2   3   >