Re: [RFC] Making memcg track ownership per address_space or anon_vma
On Thu, Jan 29 2015, Tejun Heo wrote: > Hello, > > Since the cgroup writeback patchset[1] have been posted, several > people brought up concerns about the complexity of allowing an inode > to be dirtied against multiple cgroups is necessary for the purpose of > writeback and it is true that a significant amount of complexity (note > that bdi still needs to be split, so it's still not trivial) can be > removed if we assume that an inode always belongs to one cgroup for > the purpose of writeback. > > However, as mentioned before, this issue is directly linked to whether > memcg needs to track the memory ownership per-page. If there are > valid use cases where the pages of an inode must be tracked to be > owned by different cgroups, cgroup writeback must be able to handle > that situation properly. If there aren't no such cases, the cgroup > writeback support can be simplified but again we should put memcg on > the same cadence and enforce per-inode (or per-anon_vma) ownership > from the beginning. The conclusion can be either way - per-page or > per-inode - but both memcg and blkcg must be looking at the same > picture. Deviating them is highly likely to lead to long-term issues > forcing us to look at this again anyway, only with far more baggage. > > One thing to note is that the per-page tracking which is currently > employed by memcg seems to have been born more out of conveninence > rather than requirements for any actual use cases. Per-page ownership > makes sense iff pages of an inode have to be associated with different > cgroups - IOW, when an inode is accessed by multiple cgroups; however, > currently, memcg assigns a page to its instantiating memcg and leaves > it at that till the page is released. This means that if a page is > instantiated by one cgroup and then subsequently accessed only by a > different cgroup, whether the page's charge gets moved to the cgroup > which is actively using it is purely incidental. If the page gets > reclaimed and released at some point, it'll be moved. If not, it > won't. > > AFAICS, the only case where the current per-page accounting works > properly is when disjoint sections of an inode are used by different > cgroups and the whole thing hinges on whether this use case justifies > all the added overhead including page->mem_cgroup pointer and the > extra complexity in the writeback layer. FWIW, I'm doubtful. > Johannes, Michal, Greg, what do you guys think? > > If the above use case - a huge file being actively accssed disjointly > by multiple cgroups - isn't significant enough and there aren't other > use cases that I missed which can benefit from the per-page tracking > that's currently implemented, it'd be logical to switch to per-inode > (or per-anon_vma or per-slab) ownership tracking. For the short term, > even just adding an extra ownership information to those containing > objects and inherting those to page->mem_cgroup could work although > it'd definitely be beneficial to eventually get rid of > page->mem_cgroup. > > As with per-page, when the ownership terminates is debatable w/ > per-inode tracking. Also, supporting some form of shared accounting > across different cgroups may be useful (e.g. shared library's memory > being equally split among anyone who accesses it); however, these > aren't likely to be major and trying to do something smart may affect > other use cases adversely, so it'd probably be best to just keep it > dumb and clear the ownership when the inode loses all pages (a cgroup > can disown such inode through FADV_DONTNEED if necessary). > > What do you guys think? If making memcg track ownership at per-inode > level, even for just the unified hierarchy, is the direction we can > take, I'll go ahead and simplify the cgroup writeback patchset. > > Thanks. I find simplification appealing. But I not sure it will fly, if for no other reason than the shared accountings. I'm ignoring intentional sharing, used by carefully crafted apps, and just thinking about incidental sharing (e.g. libc). Example: $ mkdir small $ echo 1M > small/memory.limit_in_bytes $ (echo $BASHPID > small/cgroup.procs && exec sleep 1h) & $ mkdir big $ echo 10G > big/memory.limit_in_bytes $ (echo $BASHPID > big/cgroup.procs && exec mlockall_database 1h) & Assuming big/mlockall_database mlocks all of libc, then it will oom kill the small memcg because libc is owned by small due it having touched it first. It'd be hard to figure out what small did wrong to deserve the oom kill. FWIW we've been using memcg writeback where inodes have a memcg writeback owner. Once multiple memcg write to an inode then the inode becomes writeback shared which makes it more likely to be written. Once cleaned the inode is then again able to be privately owned: https://lkml.org/lkml/2011/8/17/200 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [RFC] Making memcg track ownership per address_space or anon_vma
On Thu, Jan 29 2015, Tejun Heo wrote: Hello, Since the cgroup writeback patchset[1] have been posted, several people brought up concerns about the complexity of allowing an inode to be dirtied against multiple cgroups is necessary for the purpose of writeback and it is true that a significant amount of complexity (note that bdi still needs to be split, so it's still not trivial) can be removed if we assume that an inode always belongs to one cgroup for the purpose of writeback. However, as mentioned before, this issue is directly linked to whether memcg needs to track the memory ownership per-page. If there are valid use cases where the pages of an inode must be tracked to be owned by different cgroups, cgroup writeback must be able to handle that situation properly. If there aren't no such cases, the cgroup writeback support can be simplified but again we should put memcg on the same cadence and enforce per-inode (or per-anon_vma) ownership from the beginning. The conclusion can be either way - per-page or per-inode - but both memcg and blkcg must be looking at the same picture. Deviating them is highly likely to lead to long-term issues forcing us to look at this again anyway, only with far more baggage. One thing to note is that the per-page tracking which is currently employed by memcg seems to have been born more out of conveninence rather than requirements for any actual use cases. Per-page ownership makes sense iff pages of an inode have to be associated with different cgroups - IOW, when an inode is accessed by multiple cgroups; however, currently, memcg assigns a page to its instantiating memcg and leaves it at that till the page is released. This means that if a page is instantiated by one cgroup and then subsequently accessed only by a different cgroup, whether the page's charge gets moved to the cgroup which is actively using it is purely incidental. If the page gets reclaimed and released at some point, it'll be moved. If not, it won't. AFAICS, the only case where the current per-page accounting works properly is when disjoint sections of an inode are used by different cgroups and the whole thing hinges on whether this use case justifies all the added overhead including page-mem_cgroup pointer and the extra complexity in the writeback layer. FWIW, I'm doubtful. Johannes, Michal, Greg, what do you guys think? If the above use case - a huge file being actively accssed disjointly by multiple cgroups - isn't significant enough and there aren't other use cases that I missed which can benefit from the per-page tracking that's currently implemented, it'd be logical to switch to per-inode (or per-anon_vma or per-slab) ownership tracking. For the short term, even just adding an extra ownership information to those containing objects and inherting those to page-mem_cgroup could work although it'd definitely be beneficial to eventually get rid of page-mem_cgroup. As with per-page, when the ownership terminates is debatable w/ per-inode tracking. Also, supporting some form of shared accounting across different cgroups may be useful (e.g. shared library's memory being equally split among anyone who accesses it); however, these aren't likely to be major and trying to do something smart may affect other use cases adversely, so it'd probably be best to just keep it dumb and clear the ownership when the inode loses all pages (a cgroup can disown such inode through FADV_DONTNEED if necessary). What do you guys think? If making memcg track ownership at per-inode level, even for just the unified hierarchy, is the direction we can take, I'll go ahead and simplify the cgroup writeback patchset. Thanks. I find simplification appealing. But I not sure it will fly, if for no other reason than the shared accountings. I'm ignoring intentional sharing, used by carefully crafted apps, and just thinking about incidental sharing (e.g. libc). Example: $ mkdir small $ echo 1M small/memory.limit_in_bytes $ (echo $BASHPID small/cgroup.procs exec sleep 1h) $ mkdir big $ echo 10G big/memory.limit_in_bytes $ (echo $BASHPID big/cgroup.procs exec mlockall_database 1h) Assuming big/mlockall_database mlocks all of libc, then it will oom kill the small memcg because libc is owned by small due it having touched it first. It'd be hard to figure out what small did wrong to deserve the oom kill. FWIW we've been using memcg writeback where inodes have a memcg writeback owner. Once multiple memcg write to an inode then the inode becomes writeback shared which makes it more likely to be written. Once cleaned the inode is then again able to be privately owned: https://lkml.org/lkml/2011/8/17/200 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory
On Thu, Jan 08 2015, Johannes Weiner wrote: > Introduce the basic control files to account, partition, and limit > memory using cgroups in default hierarchy mode. > > This interface versioning allows us to address fundamental design > issues in the existing memory cgroup interface, further explained > below. The old interface will be maintained indefinitely, but a > clearer model and improved workload performance should encourage > existing users to switch over to the new one eventually. > > The control files are thus: > > - memory.current shows the current consumption of the cgroup and its > descendants, in bytes. > > - memory.low configures the lower end of the cgroup's expected > memory consumption range. The kernel considers memory below that > boundary to be a reserve - the minimum that the workload needs in > order to make forward progress - and generally avoids reclaiming > it, unless there is an imminent risk of entering an OOM situation. So this is try-hard, but no-promises interface. No complaints. But I assume that an eventual extension is a more rigid memory.min which specifies a minimum working set under which an container would prefer an oom kill to thrashing. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory
On Thu, Jan 08 2015, Johannes Weiner wrote: Introduce the basic control files to account, partition, and limit memory using cgroups in default hierarchy mode. This interface versioning allows us to address fundamental design issues in the existing memory cgroup interface, further explained below. The old interface will be maintained indefinitely, but a clearer model and improved workload performance should encourage existing users to switch over to the new one eventually. The control files are thus: - memory.current shows the current consumption of the cgroup and its descendants, in bytes. - memory.low configures the lower end of the cgroup's expected memory consumption range. The kernel considers memory below that boundary to be a reserve - the minimum that the workload needs in order to make forward progress - and generally avoids reclaiming it, unless there is an imminent risk of entering an OOM situation. So this is try-hard, but no-promises interface. No complaints. But I assume that an eventual extension is a more rigid memory.min which specifies a minimum working set under which an container would prefer an oom kill to thrashing. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] memcg: remove extra newlines from memcg oom kill log
Commit e61734c55c24 ("cgroup: remove cgroup->name") added two extra newlines to memcg oom kill log messages. This makes dmesg hard to read and parse. The issue affects 3.15+. Example: Task in /t <<< extra #1 killed as a result of limit of /t <<< extra #2 memory: usage 102400kB, limit 102400kB, failcnt 274712 Remove the extra newlines from memcg oom kill messages, so the messages look like: Task in /t killed as a result of limit of /t memory: usage 102400kB, limit 102400kB, failcnt 240649 Fixes: e61734c55c24 ("cgroup: remove cgroup->name") Signed-off-by: Greg Thelen --- mm/memcontrol.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 851924fa5170..683b4782019b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1477,9 +1477,9 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) pr_info("Task in "); pr_cont_cgroup_path(task_cgroup(p, memory_cgrp_id)); - pr_info(" killed as a result of limit of "); + pr_cont(" killed as a result of limit of "); pr_cont_cgroup_path(memcg->css.cgroup); - pr_info("\n"); + pr_cont("\n"); rcu_read_unlock(); -- 2.2.0.rc0.207.ga3a616c -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] memcg: add BUILD_BUG_ON() for string tables
Use BUILD_BUG_ON() to compile assert that memcg string tables are in sync with corresponding enums. There aren't currently any issues with these tables. This is just defensive. Signed-off-by: Greg Thelen --- mm/memcontrol.c | 4 1 file changed, 4 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ef91e856c7e4..8d1ca6c55480 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3699,6 +3699,10 @@ static int memcg_stat_show(struct seq_file *m, void *v) struct mem_cgroup *mi; unsigned int i; + BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_stat_names) != +MEM_CGROUP_STAT_NSTATS); + BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_events_names) != +MEM_CGROUP_EVENTS_NSTATS); BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS); for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { -- 2.2.0.rc0.207.ga3a616c -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] memcg: remove extra newlines from memcg oom kill log
Commit e61734c55c24 (cgroup: remove cgroup-name) added two extra newlines to memcg oom kill log messages. This makes dmesg hard to read and parse. The issue affects 3.15+. Example: Task in /t extra #1 killed as a result of limit of /t extra #2 memory: usage 102400kB, limit 102400kB, failcnt 274712 Remove the extra newlines from memcg oom kill messages, so the messages look like: Task in /t killed as a result of limit of /t memory: usage 102400kB, limit 102400kB, failcnt 240649 Fixes: e61734c55c24 (cgroup: remove cgroup-name) Signed-off-by: Greg Thelen gthe...@google.com --- mm/memcontrol.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 851924fa5170..683b4782019b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1477,9 +1477,9 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) pr_info(Task in ); pr_cont_cgroup_path(task_cgroup(p, memory_cgrp_id)); - pr_info( killed as a result of limit of ); + pr_cont( killed as a result of limit of ); pr_cont_cgroup_path(memcg-css.cgroup); - pr_info(\n); + pr_cont(\n); rcu_read_unlock(); -- 2.2.0.rc0.207.ga3a616c -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] memcg: add BUILD_BUG_ON() for string tables
Use BUILD_BUG_ON() to compile assert that memcg string tables are in sync with corresponding enums. There aren't currently any issues with these tables. This is just defensive. Signed-off-by: Greg Thelen gthe...@google.com --- mm/memcontrol.c | 4 1 file changed, 4 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ef91e856c7e4..8d1ca6c55480 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3699,6 +3699,10 @@ static int memcg_stat_show(struct seq_file *m, void *v) struct mem_cgroup *mi; unsigned int i; + BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_stat_names) != +MEM_CGROUP_STAT_NSTATS); + BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_events_names) != +MEM_CGROUP_EVENTS_NSTATS); BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS); for (i = 0; i MEM_CGROUP_STAT_NSTATS; i++) { -- 2.2.0.rc0.207.ga3a616c -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] x86, kaslr: Prevent .bss from overlaping initrd
On Mon, Nov 17 2014, Greg Thelen wrote: [...] > Given that bss and brk are nobits (i.e. only ALLOC) sections, does > file_offset make sense as a load address. This fails with gold: > > $ git checkout v3.18-rc5 > $ make # with gold > [...] > ..bss and .brk lack common file offset > ..bss and .brk lack common file offset > ..bss and .brk lack common file offset > ..bss and .brk lack common file offset > MKPIGGY arch/x86/boot/compressed/piggy.S > Usage: arch/x86/boot/compressed/mkpiggy compressed_file run_size > make[2]: *** [arch/x86/boot/compressed/piggy.S] Error 1 > make[1]: *** [arch/x86/boot/compressed/vmlinux] Error 2 > make: *** [bzImage] Error 2 [...] I just saw http://www.spinics.net/lists/kernel/msg1869328.html which fixes things for me. Sorry for the noise. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] x86, kaslr: Prevent .bss from overlaping initrd
On Fri, Oct 31 2014, Junjie Mao wrote: > When choosing a random address, the current implementation does not take into > account the reversed space for .bss and .brk sections. Thus the relocated > kernel > may overlap other components in memory. Here is an example of the overlap > from a > x86_64 kernel in qemu (the ranges of physical addresses are presented): > > Physical Address > > 0x0fe0 --++ <-- randomized base >/ | relocated kernel | >vmlinux.bin| (from vmlinux.bin) | > 0x1336d000(an ELF file) ++-- >\ || \ > 0x1376d870 --++ | > |relocs table| | > 0x13c1c2a8++ .bss and .brk > || | > 0x13ce6000++ | > || / > 0x13f77000| initrd |-- > || > 0x13fef374++ > > The initrd image will then be overwritten by the memset during early > initialization: > > [1.655204] Unpacking initramfs... > [1.662831] Initramfs unpacking failed: junk in compressed archive > > This patch prevents the above situation by requiring a larger space when > looking > for a random kernel base, so that existing logic can effectively avoids the > overlap. > > Fixes: 82fa9637a2 ("x86, kaslr: Select random position from e820 maps") > Reported-by: Fengguang Wu > Signed-off-by: Junjie Mao > [kees: switched to perl to avoid hex translation pain in mawk vs gawk] > [kees: calculated overlap without relocs table] > Signed-off-by: Kees Cook > Cc: sta...@vger.kernel.org > --- > This version updates the commit log only. > > Kees, please help review the documentation. Thanks! > > Best Regards > Junjie Mao [...] > diff --git a/arch/x86/tools/calc_run_size.pl b/arch/x86/tools/calc_run_size.pl > new file mode 100644 > index ..0b0b124d3ece > --- /dev/null > +++ b/arch/x86/tools/calc_run_size.pl > @@ -0,0 +1,30 @@ > +#!/usr/bin/perl > +# > +# Calculate the amount of space needed to run the kernel, including room for > +# the .bss and .brk sections. > +# > +# Usage: > +# objdump -h a.out | perl calc_run_size.pl > +use strict; > + > +my $mem_size = 0; > +my $file_offset = 0; > + > +my $sections=" *[0-9]+ \.(?:bss|brk) +"; > +while (<>) { > + if (/^$sections([0-9a-f]+) +(?:[0-9a-f]+ +){2}([0-9a-f]+)/) { > + my $size = hex($1); > + my $offset = hex($2); > + $mem_size += $size; > + if ($file_offset == 0) { > + $file_offset = $offset; > + } elsif ($file_offset != $offset) { > + die ".bss and .brk lack common file offset\n"; > + } > + } > +} > + > +if ($file_offset == 0) { > + die "Never found .bss or .brk file offset\n"; > +} > +printf("%d\n", $mem_size + $file_offset); Given that bss and brk are nobits (i.e. only ALLOC) sections, does file_offset make sense as a load address. This fails with gold: $ git checkout v3.18-rc5 $ make # with gold [...] ..bss and .brk lack common file offset ..bss and .brk lack common file offset ..bss and .brk lack common file offset ..bss and .brk lack common file offset MKPIGGY arch/x86/boot/compressed/piggy.S Usage: arch/x86/boot/compressed/mkpiggy compressed_file run_size make[2]: *** [arch/x86/boot/compressed/piggy.S] Error 1 make[1]: *** [arch/x86/boot/compressed/vmlinux] Error 2 make: *** [bzImage] Error 2 In ld.bfd brk/bss file_offsets match, but they differ with ld.gold: $ objdump -h vmlinux.ld [...] 0 .text 00818bb3 8100 0100 0020 2**12 CONTENTS, ALLOC, LOAD, READONLY, CODE [...] 26 .bss 000e 81fe8000 01fe8000 013e8000 2**12 ALLOC 27 .brk 00026000 820c8000 020c8000 013e8000 2**0 ALLOC $ objdump -h vmlinux.ld | perl arch/x86/tools/calc_run_size.pl 21946368 # aka 0x14ee000 $ objdump -h vmlinux.gold [...] 0 .text 00818bb3 8100 0100 1000 2**12 CONTENTS, ALLOC, LOAD, READONLY, CODE [...] 26 .bss 000e 81feb000 01feb000 00e9 2**12 ALLOC 27 .brk 00026000 820cb000 020cb000 00f7 2**0 ALLOC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] x86, kaslr: Prevent .bss from overlaping initrd
On Fri, Oct 31 2014, Junjie Mao wrote: When choosing a random address, the current implementation does not take into account the reversed space for .bss and .brk sections. Thus the relocated kernel may overlap other components in memory. Here is an example of the overlap from a x86_64 kernel in qemu (the ranges of physical addresses are presented): Physical Address 0x0fe0 --++ -- randomized base / | relocated kernel | vmlinux.bin| (from vmlinux.bin) | 0x1336d000(an ELF file) ++-- \ || \ 0x1376d870 --++ | |relocs table| | 0x13c1c2a8++ .bss and .brk || | 0x13ce6000++ | || / 0x13f77000| initrd |-- || 0x13fef374++ The initrd image will then be overwritten by the memset during early initialization: [1.655204] Unpacking initramfs... [1.662831] Initramfs unpacking failed: junk in compressed archive This patch prevents the above situation by requiring a larger space when looking for a random kernel base, so that existing logic can effectively avoids the overlap. Fixes: 82fa9637a2 (x86, kaslr: Select random position from e820 maps) Reported-by: Fengguang Wu fengguang...@intel.com Signed-off-by: Junjie Mao eternal@gmail.com [kees: switched to perl to avoid hex translation pain in mawk vs gawk] [kees: calculated overlap without relocs table] Signed-off-by: Kees Cook keesc...@chromium.org Cc: sta...@vger.kernel.org --- This version updates the commit log only. Kees, please help review the documentation. Thanks! Best Regards Junjie Mao [...] diff --git a/arch/x86/tools/calc_run_size.pl b/arch/x86/tools/calc_run_size.pl new file mode 100644 index ..0b0b124d3ece --- /dev/null +++ b/arch/x86/tools/calc_run_size.pl @@ -0,0 +1,30 @@ +#!/usr/bin/perl +# +# Calculate the amount of space needed to run the kernel, including room for +# the .bss and .brk sections. +# +# Usage: +# objdump -h a.out | perl calc_run_size.pl +use strict; + +my $mem_size = 0; +my $file_offset = 0; + +my $sections= *[0-9]+ \.(?:bss|brk) +; +while () { + if (/^$sections([0-9a-f]+) +(?:[0-9a-f]+ +){2}([0-9a-f]+)/) { + my $size = hex($1); + my $offset = hex($2); + $mem_size += $size; + if ($file_offset == 0) { + $file_offset = $offset; + } elsif ($file_offset != $offset) { + die .bss and .brk lack common file offset\n; + } + } +} + +if ($file_offset == 0) { + die Never found .bss or .brk file offset\n; +} +printf(%d\n, $mem_size + $file_offset); Given that bss and brk are nobits (i.e. only ALLOC) sections, does file_offset make sense as a load address. This fails with gold: $ git checkout v3.18-rc5 $ make # with gold [...] ..bss and .brk lack common file offset ..bss and .brk lack common file offset ..bss and .brk lack common file offset ..bss and .brk lack common file offset MKPIGGY arch/x86/boot/compressed/piggy.S Usage: arch/x86/boot/compressed/mkpiggy compressed_file run_size make[2]: *** [arch/x86/boot/compressed/piggy.S] Error 1 make[1]: *** [arch/x86/boot/compressed/vmlinux] Error 2 make: *** [bzImage] Error 2 In ld.bfd brk/bss file_offsets match, but they differ with ld.gold: $ objdump -h vmlinux.ld [...] 0 .text 00818bb3 8100 0100 0020 2**12 CONTENTS, ALLOC, LOAD, READONLY, CODE [...] 26 .bss 000e 81fe8000 01fe8000 013e8000 2**12 ALLOC 27 .brk 00026000 820c8000 020c8000 013e8000 2**0 ALLOC $ objdump -h vmlinux.ld | perl arch/x86/tools/calc_run_size.pl 21946368 # aka 0x14ee000 $ objdump -h vmlinux.gold [...] 0 .text 00818bb3 8100 0100 1000 2**12 CONTENTS, ALLOC, LOAD, READONLY, CODE [...] 26 .bss 000e 81feb000 01feb000 00e9 2**12 ALLOC 27 .brk 00026000 820cb000 020cb000 00f7 2**0 ALLOC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] x86, kaslr: Prevent .bss from overlaping initrd
On Mon, Nov 17 2014, Greg Thelen wrote: [...] Given that bss and brk are nobits (i.e. only ALLOC) sections, does file_offset make sense as a load address. This fails with gold: $ git checkout v3.18-rc5 $ make # with gold [...] ..bss and .brk lack common file offset ..bss and .brk lack common file offset ..bss and .brk lack common file offset ..bss and .brk lack common file offset MKPIGGY arch/x86/boot/compressed/piggy.S Usage: arch/x86/boot/compressed/mkpiggy compressed_file run_size make[2]: *** [arch/x86/boot/compressed/piggy.S] Error 1 make[1]: *** [arch/x86/boot/compressed/vmlinux] Error 2 make: *** [bzImage] Error 2 [...] I just saw http://www.spinics.net/lists/kernel/msg1869328.html which fixes things for me. Sorry for the noise. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm: memcontrol: support transparent huge pages under pressure
On Tue, Sep 23 2014, Johannes Weiner wrote: > On Mon, Sep 22, 2014 at 10:52:50PM -0700, Greg Thelen wrote: >> >> On Fri, Sep 19 2014, Johannes Weiner wrote: >> >> > In a memcg with even just moderate cache pressure, success rates for >> > transparent huge page allocations drop to zero, wasting a lot of >> > effort that the allocator puts into assembling these pages. >> > >> > The reason for this is that the memcg reclaim code was never designed >> > for higher-order charges. It reclaims in small batches until there is >> > room for at least one page. Huge pages charges only succeed when >> > these batches add up over a series of huge faults, which is unlikely >> > under any significant load involving order-0 allocations in the group. >> > >> > Remove that loop on the memcg side in favor of passing the actual >> > reclaim goal to direct reclaim, which is already set up and optimized >> > to meet higher-order goals efficiently. >> > >> > This brings memcg's THP policy in line with the system policy: if the >> > allocator painstakingly assembles a hugepage, memcg will at least make >> > an honest effort to charge it. As a result, transparent hugepage >> > allocation rates amid cache activity are drastically improved: >> > >> > vanilla patched >> > pgalloc 4717530.80 ( +0.00%) 4451376.40 ( -5.64%) >> > pgfault 491370.60 ( +0.00%)225477.40 ( -54.11%) >> > pgmajfault2.00 ( +0.00%) 1.80 ( -6.67%) >> > thp_fault_alloc 0.00 ( +0.00%) 531.60 (+100.00%) >> > thp_fault_fallback 749.00 ( +0.00%) 217.40 ( -70.88%) >> > >> > [ Note: this may in turn increase memory consumption from internal >> > fragmentation, which is an inherent risk of transparent hugepages. >> > Some setups may have to adjust the memcg limits accordingly to >> > accomodate this - or, if the machine is already packed to capacity, >> > disable the transparent huge page feature. ] >> >> We're using an earlier version of this patch, so I approve of the >> general direction. But I have some feedback. >> >> The memsw aspect of this change seems somewhat separate. Can it be >> split into a different patch? >> >> The memsw aspect of this patch seems to change behavior. Is this >> intended? If so, a mention of it in the commit log would assuage the >> reader. I'll explain... Assume a machine with swap enabled and >> res.limit==memsw.limit, thus memsw_is_minimum is true. My understanding >> is that memsw.usage represents sum(ram_usage, swap_usage). So when >> memsw_is_minimum=true, then both swap_usage=0 and >> memsw.usage==res.usage. In this condition, if res usage is at limit >> then there's no point in swapping because memsw.usage is already >> maximal. Prior to this patch I think the kernel did the right thing, >> but not afterwards. >> >> Before this patch: >> if res.usage == res.limit, try_charge() indirectly calls >> try_to_free_mem_cgroup_pages(noswap=true) >> >> After this patch: >> if res.usage == res.limit, try_charge() calls >> try_to_free_mem_cgroup_pages(may_swap=true) >> >> Notice the inverted swap-is-allowed value. > > For some reason I had myself convinced that this is dead code due to a > change in callsites a long time ago, but you are right that currently > try_charge() relies on it, thanks for pointing it out. > > However, memsw is always equal to or bigger than the memory limit - so > instead of keeping a separate state variable to track when memory > failure implies memsw failure, couldn't we just charge memsw first? > > How about the following? But yeah, I'd split this into a separate > patch now. Looks good to me. Thanks. Acked-by: Greg Thelen > --- > mm/memcontrol.c | 15 --- > 1 file changed, 8 insertions(+), 7 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e2def11f1ec1..7c9a8971d0f4 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2497,16 +2497,17 @@ retry: > goto done; > > size = batch * PAGE_SIZE; > - if (!res_counter_charge(>res, size, _res)) { > - if (!do_swap_account) > + if (!do_swap_account || > + !res_counter_charge(>memsw, size, _res)) { > + if (!res_counter_charge(>res, size, _res)) > goto done_restock; > - if (!res_counte
Re: [patch] mm: memcontrol: support transparent huge pages under pressure
On Fri, Sep 19 2014, Johannes Weiner wrote: > In a memcg with even just moderate cache pressure, success rates for > transparent huge page allocations drop to zero, wasting a lot of > effort that the allocator puts into assembling these pages. > > The reason for this is that the memcg reclaim code was never designed > for higher-order charges. It reclaims in small batches until there is > room for at least one page. Huge pages charges only succeed when > these batches add up over a series of huge faults, which is unlikely > under any significant load involving order-0 allocations in the group. > > Remove that loop on the memcg side in favor of passing the actual > reclaim goal to direct reclaim, which is already set up and optimized > to meet higher-order goals efficiently. > > This brings memcg's THP policy in line with the system policy: if the > allocator painstakingly assembles a hugepage, memcg will at least make > an honest effort to charge it. As a result, transparent hugepage > allocation rates amid cache activity are drastically improved: > > vanilla patched > pgalloc 4717530.80 ( +0.00%) 4451376.40 ( -5.64%) > pgfault 491370.60 ( +0.00%)225477.40 ( -54.11%) > pgmajfault2.00 ( +0.00%) 1.80 ( -6.67%) > thp_fault_alloc 0.00 ( +0.00%) 531.60 (+100.00%) > thp_fault_fallback 749.00 ( +0.00%) 217.40 ( -70.88%) > > [ Note: this may in turn increase memory consumption from internal > fragmentation, which is an inherent risk of transparent hugepages. > Some setups may have to adjust the memcg limits accordingly to > accomodate this - or, if the machine is already packed to capacity, > disable the transparent huge page feature. ] We're using an earlier version of this patch, so I approve of the general direction. But I have some feedback. The memsw aspect of this change seems somewhat separate. Can it be split into a different patch? The memsw aspect of this patch seems to change behavior. Is this intended? If so, a mention of it in the commit log would assuage the reader. I'll explain... Assume a machine with swap enabled and res.limit==memsw.limit, thus memsw_is_minimum is true. My understanding is that memsw.usage represents sum(ram_usage, swap_usage). So when memsw_is_minimum=true, then both swap_usage=0 and memsw.usage==res.usage. In this condition, if res usage is at limit then there's no point in swapping because memsw.usage is already maximal. Prior to this patch I think the kernel did the right thing, but not afterwards. Before this patch: if res.usage == res.limit, try_charge() indirectly calls try_to_free_mem_cgroup_pages(noswap=true) After this patch: if res.usage == res.limit, try_charge() calls try_to_free_mem_cgroup_pages(may_swap=true) Notice the inverted swap-is-allowed value. I haven't had time to look at your other outstanding memcg patches. These comments were made with this patch in isolation. > Signed-off-by: Johannes Weiner > --- > include/linux/swap.h | 6 ++-- > mm/memcontrol.c | 86 > +++- > mm/vmscan.c | 7 +++-- > 3 files changed, 25 insertions(+), 74 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index ea4f926e6b9b..37a585beef5c 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -327,8 +327,10 @@ extern void lru_cache_add_active_or_unevictable(struct > page *page, > extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > gfp_t gfp_mask, nodemask_t *mask); > extern int __isolate_lru_page(struct page *page, isolate_mode_t mode); > -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, > - gfp_t gfp_mask, bool noswap); > +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > + unsigned long nr_pages, > + gfp_t gfp_mask, > + bool may_swap); > extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, > gfp_t gfp_mask, bool noswap, > struct zone *zone, > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 9431024e490c..e2def11f1ec1 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -315,9 +315,6 @@ struct mem_cgroup { > /* OOM-Killer disable */ > int oom_kill_disable; > > - /* set when res.limit == memsw.limit */ > - boolmemsw_is_minimum; > - > /* protect arrays of thresholds */ > struct mutex thresholds_lock; > > @@ -481,14 +478,6 @@ enum res_type { > #define OOM_CONTROL (0) > > /* > - * Reclaim
Re: [patch] mm: memcontrol: support transparent huge pages under pressure
On Fri, Sep 19 2014, Johannes Weiner wrote: In a memcg with even just moderate cache pressure, success rates for transparent huge page allocations drop to zero, wasting a lot of effort that the allocator puts into assembling these pages. The reason for this is that the memcg reclaim code was never designed for higher-order charges. It reclaims in small batches until there is room for at least one page. Huge pages charges only succeed when these batches add up over a series of huge faults, which is unlikely under any significant load involving order-0 allocations in the group. Remove that loop on the memcg side in favor of passing the actual reclaim goal to direct reclaim, which is already set up and optimized to meet higher-order goals efficiently. This brings memcg's THP policy in line with the system policy: if the allocator painstakingly assembles a hugepage, memcg will at least make an honest effort to charge it. As a result, transparent hugepage allocation rates amid cache activity are drastically improved: vanilla patched pgalloc 4717530.80 ( +0.00%) 4451376.40 ( -5.64%) pgfault 491370.60 ( +0.00%)225477.40 ( -54.11%) pgmajfault2.00 ( +0.00%) 1.80 ( -6.67%) thp_fault_alloc 0.00 ( +0.00%) 531.60 (+100.00%) thp_fault_fallback 749.00 ( +0.00%) 217.40 ( -70.88%) [ Note: this may in turn increase memory consumption from internal fragmentation, which is an inherent risk of transparent hugepages. Some setups may have to adjust the memcg limits accordingly to accomodate this - or, if the machine is already packed to capacity, disable the transparent huge page feature. ] We're using an earlier version of this patch, so I approve of the general direction. But I have some feedback. The memsw aspect of this change seems somewhat separate. Can it be split into a different patch? The memsw aspect of this patch seems to change behavior. Is this intended? If so, a mention of it in the commit log would assuage the reader. I'll explain... Assume a machine with swap enabled and res.limit==memsw.limit, thus memsw_is_minimum is true. My understanding is that memsw.usage represents sum(ram_usage, swap_usage). So when memsw_is_minimum=true, then both swap_usage=0 and memsw.usage==res.usage. In this condition, if res usage is at limit then there's no point in swapping because memsw.usage is already maximal. Prior to this patch I think the kernel did the right thing, but not afterwards. Before this patch: if res.usage == res.limit, try_charge() indirectly calls try_to_free_mem_cgroup_pages(noswap=true) After this patch: if res.usage == res.limit, try_charge() calls try_to_free_mem_cgroup_pages(may_swap=true) Notice the inverted swap-is-allowed value. I haven't had time to look at your other outstanding memcg patches. These comments were made with this patch in isolation. Signed-off-by: Johannes Weiner han...@cmpxchg.org --- include/linux/swap.h | 6 ++-- mm/memcontrol.c | 86 +++- mm/vmscan.c | 7 +++-- 3 files changed, 25 insertions(+), 74 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index ea4f926e6b9b..37a585beef5c 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -327,8 +327,10 @@ extern void lru_cache_add_active_or_unevictable(struct page *page, extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); extern int __isolate_lru_page(struct page *page, isolate_mode_t mode); -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, - gfp_t gfp_mask, bool noswap); +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, + unsigned long nr_pages, + gfp_t gfp_mask, + bool may_swap); extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, struct zone *zone, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9431024e490c..e2def11f1ec1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -315,9 +315,6 @@ struct mem_cgroup { /* OOM-Killer disable */ int oom_kill_disable; - /* set when res.limit == memsw.limit */ - boolmemsw_is_minimum; - /* protect arrays of thresholds */ struct mutex thresholds_lock; @@ -481,14 +478,6 @@ enum res_type { #define OOM_CONTROL (0) /* - * Reclaim flags for mem_cgroup_hierarchical_reclaim - */ -#define
Re: [patch] mm: memcontrol: support transparent huge pages under pressure
On Tue, Sep 23 2014, Johannes Weiner wrote: On Mon, Sep 22, 2014 at 10:52:50PM -0700, Greg Thelen wrote: On Fri, Sep 19 2014, Johannes Weiner wrote: In a memcg with even just moderate cache pressure, success rates for transparent huge page allocations drop to zero, wasting a lot of effort that the allocator puts into assembling these pages. The reason for this is that the memcg reclaim code was never designed for higher-order charges. It reclaims in small batches until there is room for at least one page. Huge pages charges only succeed when these batches add up over a series of huge faults, which is unlikely under any significant load involving order-0 allocations in the group. Remove that loop on the memcg side in favor of passing the actual reclaim goal to direct reclaim, which is already set up and optimized to meet higher-order goals efficiently. This brings memcg's THP policy in line with the system policy: if the allocator painstakingly assembles a hugepage, memcg will at least make an honest effort to charge it. As a result, transparent hugepage allocation rates amid cache activity are drastically improved: vanilla patched pgalloc 4717530.80 ( +0.00%) 4451376.40 ( -5.64%) pgfault 491370.60 ( +0.00%)225477.40 ( -54.11%) pgmajfault2.00 ( +0.00%) 1.80 ( -6.67%) thp_fault_alloc 0.00 ( +0.00%) 531.60 (+100.00%) thp_fault_fallback 749.00 ( +0.00%) 217.40 ( -70.88%) [ Note: this may in turn increase memory consumption from internal fragmentation, which is an inherent risk of transparent hugepages. Some setups may have to adjust the memcg limits accordingly to accomodate this - or, if the machine is already packed to capacity, disable the transparent huge page feature. ] We're using an earlier version of this patch, so I approve of the general direction. But I have some feedback. The memsw aspect of this change seems somewhat separate. Can it be split into a different patch? The memsw aspect of this patch seems to change behavior. Is this intended? If so, a mention of it in the commit log would assuage the reader. I'll explain... Assume a machine with swap enabled and res.limit==memsw.limit, thus memsw_is_minimum is true. My understanding is that memsw.usage represents sum(ram_usage, swap_usage). So when memsw_is_minimum=true, then both swap_usage=0 and memsw.usage==res.usage. In this condition, if res usage is at limit then there's no point in swapping because memsw.usage is already maximal. Prior to this patch I think the kernel did the right thing, but not afterwards. Before this patch: if res.usage == res.limit, try_charge() indirectly calls try_to_free_mem_cgroup_pages(noswap=true) After this patch: if res.usage == res.limit, try_charge() calls try_to_free_mem_cgroup_pages(may_swap=true) Notice the inverted swap-is-allowed value. For some reason I had myself convinced that this is dead code due to a change in callsites a long time ago, but you are right that currently try_charge() relies on it, thanks for pointing it out. However, memsw is always equal to or bigger than the memory limit - so instead of keeping a separate state variable to track when memory failure implies memsw failure, couldn't we just charge memsw first? How about the following? But yeah, I'd split this into a separate patch now. Looks good to me. Thanks. Acked-by: Greg Thelen gthe...@google.com --- mm/memcontrol.c | 15 --- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e2def11f1ec1..7c9a8971d0f4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2497,16 +2497,17 @@ retry: goto done; size = batch * PAGE_SIZE; - if (!res_counter_charge(memcg-res, size, fail_res)) { - if (!do_swap_account) + if (!do_swap_account || + !res_counter_charge(memcg-memsw, size, fail_res)) { + if (!res_counter_charge(memcg-res, size, fail_res)) goto done_restock; - if (!res_counter_charge(memcg-memsw, size, fail_res)) - goto done_restock; - res_counter_uncharge(memcg-res, size); + if (do_swap_account) + res_counter_uncharge(memcg-memsw, size); + mem_over_limit = mem_cgroup_from_res_counter(fail_res, res); + } else { mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw); may_swap = false; - } else - mem_over_limit = mem_cgroup_from_res_counter(fail_res, res); + } if (batch nr_pages) { batch = nr_pages; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message
Re: [RFC] memory cgroup: weak points of kmem accounting design
On Tue, Sep 16 2014, Vladimir Davydov wrote: > Hi Suleiman, > > On Mon, Sep 15, 2014 at 12:13:33PM -0700, Suleiman Souhlal wrote: >> On Mon, Sep 15, 2014 at 3:44 AM, Vladimir Davydov >> wrote: >> > Hi, >> > >> > I'd like to discuss downsides of the kmem accounting part of the memory >> > cgroup controller and a possible way to fix them. I'd really appreciate >> > if you could share your thoughts on it. >> > >> > The idea lying behind the kmem accounting design is to provide each >> > memory cgroup with its private copy of every kmem_cache and list_lru >> > it's going to use. This is implemented by bundling these structures with >> > arrays storing per-memcg copies. The arrays are referenced by css id. >> > When a process in a cgroup tries to allocate an object from a kmem cache >> > we first find out which cgroup the process resides in, then look up the >> > cache copy corresponding to the cgroup, and finally allocate a new >> > object from the private cache. Similarly, on addition/deletion of an >> > object from a list_lru, we first obtain the kmem cache the object was >> > allocated from, then look up the memory cgroup which the cache belongs >> > to, and finally add/remove the object from the private copy of the >> > list_lru corresponding to the cgroup. >> > >> > Though simple it looks from the first glance, it has a number of serious >> > weaknesses: >> > >> > - Providing each memory cgroup with its own kmem cache increases >> >external fragmentation. >> >> I haven't seen any evidence of this being a problem (but that doesn't >> mean it doesn't exist). > > Actually, it's rather speculative. For example, if we have say a hundred > of extra objects per cache (fragmented or in per-cpu stocks) of size 256 > bytes, then for one cache the overhead would be 25K, which is > negligible. Now if there are thousand cgroups using the cache, we have > to pay 25M, which is noticeable. Anyway, to estimate this exactly, one > needs to run a typical workload inside a cgroup. > >> >> > - SLAB isn't ready to deal with thousands of caches: its algorithm >> >walks over all system caches and shrinks them periodically, which may >> >be really costly if we have thousands active memory cgroups. >> >> This could be throttled. > > It could be, but then we'd have more objects in per-cpu stocks, which > means more memory overhead. > >> >> > >> > - Caches may now be created/destroyed frequently and from various >> >places: on system cache destruction, on cgroup offline, from a work >> >struct scheduled by kmalloc. Synchronizing them properly is really >> >difficult. I've fixed some places, but it's still desperately buggy. >> >> Agreed. >> >> > - It's hard to determine when we should destroy a cache that belongs to >> >a dead memory cgroup. The point is both SLAB and SLUB implementations >> >always keep some pages in stock for performance reasons, so just >> >scheduling cache destruction work from kfree once the last slab page >> >is freed isn't enough - it will normally never happen for SLUB and >> >may take really long for SLAB. Of course, we can forbid SL[AU]B >> >algorithm to stock pages in dead caches, but it looks ugly and has >> >negative impact on performance (I did this, but finally decided to >> >revert). Another approach could be scanning dead caches periodically >> >or on memory pressure, but that would be ugly too. >> >> Not sure about slub, but for SLAB doesn't cache_reap take care of that? > > It is, but it takes some time. If we decide to throttle it, then it'll > take even longer. Anyway, SLUB has nothing like that, therefore we'd > have to handle different algorithms in different ways, which I > particularly dislike. > >> >> > >> > - The arrays for storing per-memcg copies can get really large, >> >especially if we finally decide to leave dead memory cgroups hanging >> >until memory pressure reaps objects assigned to them and let them >> >free. How can we deal with an array of, say, 20K elements? Simply >> >allocating them with kmal^W vmalloc will result in memory wastes. It >> >will be particularly funny if the user wants to provide each cgroup >> >with a separate mount point: each super block will have a list_lru >> >for every memory cgroup, but only one of them will be really used. >> >That said we need a kind of dynamic reclaimable arrays. Radix trees >> >would fit, but they are way slower than plain arrays, which is a >> >no-go, because we want to look up on each kmalloc, list_lru_add/del, >> >which are fast paths. >> >> The initial design we had was to have an array indexed by "cache id" >> in struct memcg, instead of the current array indexed by "css id" in >> struct kmem_cache. >> The initial design doesn't have the problem you're describing here, as >> far as I can tell. > > It is indexed by "cache id", not "css id", but it doesn't matter > actually. Suppose, when a cgroup is taken
Re: [RFC] memory cgroup: weak points of kmem accounting design
On Tue, Sep 16 2014, Vladimir Davydov wrote: Hi Suleiman, On Mon, Sep 15, 2014 at 12:13:33PM -0700, Suleiman Souhlal wrote: On Mon, Sep 15, 2014 at 3:44 AM, Vladimir Davydov vdavy...@parallels.com wrote: Hi, I'd like to discuss downsides of the kmem accounting part of the memory cgroup controller and a possible way to fix them. I'd really appreciate if you could share your thoughts on it. The idea lying behind the kmem accounting design is to provide each memory cgroup with its private copy of every kmem_cache and list_lru it's going to use. This is implemented by bundling these structures with arrays storing per-memcg copies. The arrays are referenced by css id. When a process in a cgroup tries to allocate an object from a kmem cache we first find out which cgroup the process resides in, then look up the cache copy corresponding to the cgroup, and finally allocate a new object from the private cache. Similarly, on addition/deletion of an object from a list_lru, we first obtain the kmem cache the object was allocated from, then look up the memory cgroup which the cache belongs to, and finally add/remove the object from the private copy of the list_lru corresponding to the cgroup. Though simple it looks from the first glance, it has a number of serious weaknesses: - Providing each memory cgroup with its own kmem cache increases external fragmentation. I haven't seen any evidence of this being a problem (but that doesn't mean it doesn't exist). Actually, it's rather speculative. For example, if we have say a hundred of extra objects per cache (fragmented or in per-cpu stocks) of size 256 bytes, then for one cache the overhead would be 25K, which is negligible. Now if there are thousand cgroups using the cache, we have to pay 25M, which is noticeable. Anyway, to estimate this exactly, one needs to run a typical workload inside a cgroup. - SLAB isn't ready to deal with thousands of caches: its algorithm walks over all system caches and shrinks them periodically, which may be really costly if we have thousands active memory cgroups. This could be throttled. It could be, but then we'd have more objects in per-cpu stocks, which means more memory overhead. - Caches may now be created/destroyed frequently and from various places: on system cache destruction, on cgroup offline, from a work struct scheduled by kmalloc. Synchronizing them properly is really difficult. I've fixed some places, but it's still desperately buggy. Agreed. - It's hard to determine when we should destroy a cache that belongs to a dead memory cgroup. The point is both SLAB and SLUB implementations always keep some pages in stock for performance reasons, so just scheduling cache destruction work from kfree once the last slab page is freed isn't enough - it will normally never happen for SLUB and may take really long for SLAB. Of course, we can forbid SL[AU]B algorithm to stock pages in dead caches, but it looks ugly and has negative impact on performance (I did this, but finally decided to revert). Another approach could be scanning dead caches periodically or on memory pressure, but that would be ugly too. Not sure about slub, but for SLAB doesn't cache_reap take care of that? It is, but it takes some time. If we decide to throttle it, then it'll take even longer. Anyway, SLUB has nothing like that, therefore we'd have to handle different algorithms in different ways, which I particularly dislike. - The arrays for storing per-memcg copies can get really large, especially if we finally decide to leave dead memory cgroups hanging until memory pressure reaps objects assigned to them and let them free. How can we deal with an array of, say, 20K elements? Simply allocating them with kmal^W vmalloc will result in memory wastes. It will be particularly funny if the user wants to provide each cgroup with a separate mount point: each super block will have a list_lru for every memory cgroup, but only one of them will be really used. That said we need a kind of dynamic reclaimable arrays. Radix trees would fit, but they are way slower than plain arrays, which is a no-go, because we want to look up on each kmalloc, list_lru_add/del, which are fast paths. The initial design we had was to have an array indexed by cache id in struct memcg, instead of the current array indexed by css id in struct kmem_cache. The initial design doesn't have the problem you're describing here, as far as I can tell. It is indexed by cache id, not css id, but it doesn't matter actually. Suppose, when a cgroup is taken offline it still has kmem objects accounted to it. Then we have to keep its cache id along with the caches hosting the objects until all the objects are freed. All caches and, what is worse, list_lru's will have to
Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests
On Thu, Aug 07 2014, Johannes Weiner wrote: > On Thu, Aug 07, 2014 at 03:08:22PM +0200, Michal Hocko wrote: >> On Mon 04-08-14 17:14:54, Johannes Weiner wrote: >> > Instead of passing the request size to direct reclaim, memcg just >> > manually loops around reclaiming SWAP_CLUSTER_MAX pages until the >> > charge can succeed. That potentially wastes scan progress when huge >> > page allocations require multiple invocations, which always have to >> > restart from the default scan priority. >> > >> > Pass the request size as a reclaim target to direct reclaim and leave >> > it to that code to reach the goal. >> >> THP charge then will ask for 512 pages to be (direct) reclaimed. That >> is _a lot_ and I would expect long stalls to achieve this target. I >> would also expect quick priority drop down and potential over-reclaim >> for small and moderately sized memcgs (e.g. memcg with 1G worth of pages >> would need to drop down below DEF_PRIORITY-2 to have a chance to scan >> that many pages). All that done for a charge which can fallback to a >> single page charge. >> >> The current code is quite hostile to THP when we are close to the limit >> but solving this by introducing long stalls instead doesn't sound like a >> proper approach to me. > > THP latencies are actually the same when comparing high limit nr_pages > reclaim with the current hard limit SWAP_CLUSTER_MAX reclaim, although > system time is reduced with the high limit. > > High limit reclaim with SWAP_CLUSTER_MAX has better fault latency but > it doesn't actually contain the workload - with 1G high and a 4G load, > the consumption at the end of the run is 3.7G. > > So what I'm proposing works and is of equal quality from a THP POV. > This change is complicated enough when we stick to the facts, let's > not make up things based on gut feeling. I think that high order non THP page allocations also benefit from this. Such allocations don't have a small page fallback. This may be in flux, but linux-next shows me that: * mem_cgroup_reclaim() frees at least SWAP_CLUSTER_MAX (32) pages. * try_charge() calls mem_cgroup_reclaim() indefinitely for costly (3) or smaller orders assuming that something is reclaimed on each iteration. * try_charge() uses a loop of MEM_CGROUP_RECLAIM_RETRIES (5) for larger-than-costly orders. So for larger-than-costly allocations, try_charge() should be able to reclaim 160 (5*32) pages which satisfies an order:7 allocation. But for order:8+ allocations try_charge() and mem_cgroup_reclaim() are too eager to give up without something like this. So I think this patch is a step in the right direction. Coincidentally, we've been recently been experimenting with something like this. Though we didn't modify the interface between mem_cgroup_reclaim() and try_to_free_mem_cgroup_pages() - instead we looped within mem_cgroup_reclaim() until nr_pages of margin were found. But I have no objection the proposed plumbing of nr_pages all the way into try_to_free_mem_cgroup_pages(). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests
On Thu, Aug 07 2014, Johannes Weiner wrote: On Thu, Aug 07, 2014 at 03:08:22PM +0200, Michal Hocko wrote: On Mon 04-08-14 17:14:54, Johannes Weiner wrote: Instead of passing the request size to direct reclaim, memcg just manually loops around reclaiming SWAP_CLUSTER_MAX pages until the charge can succeed. That potentially wastes scan progress when huge page allocations require multiple invocations, which always have to restart from the default scan priority. Pass the request size as a reclaim target to direct reclaim and leave it to that code to reach the goal. THP charge then will ask for 512 pages to be (direct) reclaimed. That is _a lot_ and I would expect long stalls to achieve this target. I would also expect quick priority drop down and potential over-reclaim for small and moderately sized memcgs (e.g. memcg with 1G worth of pages would need to drop down below DEF_PRIORITY-2 to have a chance to scan that many pages). All that done for a charge which can fallback to a single page charge. The current code is quite hostile to THP when we are close to the limit but solving this by introducing long stalls instead doesn't sound like a proper approach to me. THP latencies are actually the same when comparing high limit nr_pages reclaim with the current hard limit SWAP_CLUSTER_MAX reclaim, although system time is reduced with the high limit. High limit reclaim with SWAP_CLUSTER_MAX has better fault latency but it doesn't actually contain the workload - with 1G high and a 4G load, the consumption at the end of the run is 3.7G. So what I'm proposing works and is of equal quality from a THP POV. This change is complicated enough when we stick to the facts, let's not make up things based on gut feeling. I think that high order non THP page allocations also benefit from this. Such allocations don't have a small page fallback. This may be in flux, but linux-next shows me that: * mem_cgroup_reclaim() frees at least SWAP_CLUSTER_MAX (32) pages. * try_charge() calls mem_cgroup_reclaim() indefinitely for costly (3) or smaller orders assuming that something is reclaimed on each iteration. * try_charge() uses a loop of MEM_CGROUP_RECLAIM_RETRIES (5) for larger-than-costly orders. So for larger-than-costly allocations, try_charge() should be able to reclaim 160 (5*32) pages which satisfies an order:7 allocation. But for order:8+ allocations try_charge() and mem_cgroup_reclaim() are too eager to give up without something like this. So I think this patch is a step in the right direction. Coincidentally, we've been recently been experimenting with something like this. Though we didn't modify the interface between mem_cgroup_reclaim() and try_to_free_mem_cgroup_pages() - instead we looped within mem_cgroup_reclaim() until nr_pages of margin were found. But I have no objection the proposed plumbing of nr_pages all the way into try_to_free_mem_cgroup_pages(). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] dm bufio: fully initialize shrinker
1d3d4437eae1 ("vmscan: per-node deferred work") added a flags field to struct shrinker assuming that all shrinkers were zero filled. The dm bufio shrinker is not zero filled, which leaves arbitrary kmalloc() data in flags. So far the only defined flags bit is SHRINKER_NUMA_AWARE. But there are proposed patches which add other bits to shrinker.flags (e.g. memcg awareness). Rather than simply initializing the shrinker, this patch uses kzalloc() when allocating the dm_bufio_client to ensure that the embedded shrinker and any other similar structures are zeroed. This fixes theoretical over aggressive shrinking of dm bufio objects. If the uninitialized dm_bufio_client.shrinker.flags contains SHRINKER_NUMA_AWARE then shrink_slab() would call the dm shrinker for each numa node rather than just once. This has been broken since 3.12. Signed-off-by: Greg Thelen --- drivers/md/dm-bufio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c index 4e84095833db..d724459860d9 100644 --- a/drivers/md/dm-bufio.c +++ b/drivers/md/dm-bufio.c @@ -1541,7 +1541,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign BUG_ON(block_size < 1 << SECTOR_SHIFT || (block_size & (block_size - 1))); - c = kmalloc(sizeof(*c), GFP_KERNEL); + c = kzalloc(sizeof(*c), GFP_KERNEL); if (!c) { r = -ENOMEM; goto bad_client; -- 2.0.0.526.g5318336 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] dm bufio: fully initialize shrinker
1d3d4437eae1 (vmscan: per-node deferred work) added a flags field to struct shrinker assuming that all shrinkers were zero filled. The dm bufio shrinker is not zero filled, which leaves arbitrary kmalloc() data in flags. So far the only defined flags bit is SHRINKER_NUMA_AWARE. But there are proposed patches which add other bits to shrinker.flags (e.g. memcg awareness). Rather than simply initializing the shrinker, this patch uses kzalloc() when allocating the dm_bufio_client to ensure that the embedded shrinker and any other similar structures are zeroed. This fixes theoretical over aggressive shrinking of dm bufio objects. If the uninitialized dm_bufio_client.shrinker.flags contains SHRINKER_NUMA_AWARE then shrink_slab() would call the dm shrinker for each numa node rather than just once. This has been broken since 3.12. Signed-off-by: Greg Thelen gthe...@google.com --- drivers/md/dm-bufio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c index 4e84095833db..d724459860d9 100644 --- a/drivers/md/dm-bufio.c +++ b/drivers/md/dm-bufio.c @@ -1541,7 +1541,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign BUG_ON(block_size 1 SECTOR_SHIFT || (block_size (block_size - 1))); - c = kmalloc(sizeof(*c), GFP_KERNEL); + c = kzalloc(sizeof(*c), GFP_KERNEL); if (!c) { r = -ENOMEM; goto bad_client; -- 2.0.0.526.g5318336 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/5] Virtual Memory Resource Controller for cgroups
On Wed, Jul 9, 2014 at 9:36 AM, Vladimir Davydov wrote: > Hi Tim, > > On Wed, Jul 09, 2014 at 08:08:07AM -0700, Tim Hockin wrote: >> How is this different from RLIMIT_AS? You specifically mentioned it >> earlier but you don't explain how this is different. > > The main difference is that RLIMIT_AS is per process while this > controller is per cgroup. RLIMIT_AS doesn't allow us to limit VSIZE for > a group of unrelated or cooperating through shmem processes. > > Also RLIMIT_AS accounts for total VM usage (including file mappings), > while this only charges private writable and shared mappings, whose > faulted-in pages always occupy mem+swap and therefore cannot be just > synced and dropped like file pages. In other words, this controller > works exactly as the global overcommit control. > >> From my perspective, this is pointless. There's plenty of perfectly >> correct software that mmaps files without concern for VSIZE, because >> they never fault most of those pages in. > > But there's also software that correctly handles ENOMEM returned by > mmap. For example, mongodb keeps growing its buffers until mmap fails. > Therefore, if there's no overcommit control, it will be OOM-killed > sooner or later, which may be pretty annoying. And we did have customers > complaining about that. Is mongodb's buffer growth causing the oom kills? If yes, I wonder if apps, like mongodb, that want ENOMEM should (1) use MAP_POPULATE and (2) we change vm_map_pgoff() to propagate mm_populate() ENOMEM failures back to mmap()? >> From my observations it is not generally possible to predict an >> average VSIZE limit that would satisfy your concerns *and* not kill >> lots of valid apps. > > Yes, it's difficult. Actually, we can only guess. Nevertheless, we > predict and set the VSIZE limit system-wide by default. > >> It sounds like what you want is to limit or even disable swap usage. > > I want to avoid OOM kill if it's possible to return ENOMEM. OOM can be > painful. It can kill lots of innocent processes. Of course, the user can > protect some processes by setting oom_score_adj, but this is difficult > and requires time and expertise, so an average user won't do that. > >> Given your example, your hypothetical user would probably be better of >> getting an OOM kill early so she can fix her job spec to request more >> memory. > > In my example the user won't get OOM kill *early*... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/5] Virtual Memory Resource Controller for cgroups
On Wed, Jul 9, 2014 at 9:36 AM, Vladimir Davydov vdavy...@parallels.com wrote: Hi Tim, On Wed, Jul 09, 2014 at 08:08:07AM -0700, Tim Hockin wrote: How is this different from RLIMIT_AS? You specifically mentioned it earlier but you don't explain how this is different. The main difference is that RLIMIT_AS is per process while this controller is per cgroup. RLIMIT_AS doesn't allow us to limit VSIZE for a group of unrelated or cooperating through shmem processes. Also RLIMIT_AS accounts for total VM usage (including file mappings), while this only charges private writable and shared mappings, whose faulted-in pages always occupy mem+swap and therefore cannot be just synced and dropped like file pages. In other words, this controller works exactly as the global overcommit control. From my perspective, this is pointless. There's plenty of perfectly correct software that mmaps files without concern for VSIZE, because they never fault most of those pages in. But there's also software that correctly handles ENOMEM returned by mmap. For example, mongodb keeps growing its buffers until mmap fails. Therefore, if there's no overcommit control, it will be OOM-killed sooner or later, which may be pretty annoying. And we did have customers complaining about that. Is mongodb's buffer growth causing the oom kills? If yes, I wonder if apps, like mongodb, that want ENOMEM should (1) use MAP_POPULATE and (2) we change vm_map_pgoff() to propagate mm_populate() ENOMEM failures back to mmap()? From my observations it is not generally possible to predict an average VSIZE limit that would satisfy your concerns *and* not kill lots of valid apps. Yes, it's difficult. Actually, we can only guess. Nevertheless, we predict and set the VSIZE limit system-wide by default. It sounds like what you want is to limit or even disable swap usage. I want to avoid OOM kill if it's possible to return ENOMEM. OOM can be painful. It can kill lots of innocent processes. Of course, the user can protect some processes by setting oom_score_adj, but this is difficult and requires time and expertise, so an average user won't do that. Given your example, your hypothetical user would probably be better of getting an OOM kill early so she can fix her job spec to request more memory. In my example the user won't get OOM kill *early*... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] memcg: remove lookup_cgroup_page() prototype
6b208e3f6e35 ("mm: memcg: remove unused node/section info from pc->flags") deleted the lookup_cgroup_page() function but left a prototype for it. Kill the vestigial prototype. Signed-off-by: Greg Thelen --- include/linux/page_cgroup.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index 777a524716db..0ff470de3c12 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -42,7 +42,6 @@ static inline void __init page_cgroup_init(void) #endif struct page_cgroup *lookup_page_cgroup(struct page *page); -struct page *lookup_cgroup_page(struct page_cgroup *pc); #define TESTPCGFLAG(uname, lname) \ static inline int PageCgroup##uname(struct page_cgroup *pc)\ -- 2.0.0.526.g5318336 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] memcg: remove lookup_cgroup_page() prototype
6b208e3f6e35 (mm: memcg: remove unused node/section info from pc-flags) deleted the lookup_cgroup_page() function but left a prototype for it. Kill the vestigial prototype. Signed-off-by: Greg Thelen gthe...@google.com --- include/linux/page_cgroup.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index 777a524716db..0ff470de3c12 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -42,7 +42,6 @@ static inline void __init page_cgroup_init(void) #endif struct page_cgroup *lookup_page_cgroup(struct page *page); -struct page *lookup_cgroup_page(struct page_cgroup *pc); #define TESTPCGFLAG(uname, lname) \ static inline int PageCgroup##uname(struct page_cgroup *pc)\ -- 2.0.0.526.g5318336 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
On Tue, Jun 10 2014, Johannes Weiner wrote: > On Mon, Jun 09, 2014 at 03:52:51PM -0700, Greg Thelen wrote: >> >> On Fri, Jun 06 2014, Michal Hocko wrote: >> >> > Some users (e.g. Google) would like to have stronger semantic than low >> > limit offers currently. The fallback mode is not desirable and they >> > prefer hitting OOM killer rather than ignoring low limit for protected >> > groups. There are other possible usecases which can benefit from hard >> > guarantees. I can imagine workloads where setting low_limit to the same >> > value as hard_limit to prevent from any reclaim at all makes a lot of >> > sense because reclaim is much more disrupting than restart of the load. >> > >> > This patch adds a new per memcg memory.reclaim_strategy knob which >> > tells what to do in a situation when memory reclaim cannot do any >> > progress because all groups in the reclaimed hierarchy are within their >> > low_limit. There are two options available: >> >- low_limit_best_effort - the current mode when reclaim falls >> > back to the even reclaim of all groups in the reclaimed >> > hierarchy >> >- low_limit_guarantee - groups within low_limit are never >> > reclaimed and OOM killer is triggered instead. OOM message >> > will mention the fact that the OOM was triggered due to >> > low_limit reclaim protection. >> >> To (a) be consistent with existing hard and soft limits APIs and (b) >> allow use of both best effort and guarantee memory limits, I wonder if >> it's best to offer three per memcg limits, rather than two limits (hard, >> low_limit) and a related reclaim_strategy knob. The three limits I'm >> thinking about are: >> >> 1) hard_limit (aka the existing limit_in_bytes cgroupfs file). No >>change needed here. This is an upper bound on a memcg hierarchy's >>memory consumption (assuming use_hierarchy=1). > > This creates internal pressure. Outside reclaim is not affected by > it, but internal charges can not exceed this limit. This is set to > hard limit the maximum memory consumption of a group (max). > >> 2) best_effort_limit (aka desired working set). This allow an >>application or administrator to provide a hint to the kernel about >>desired working set size. Before oom'ing the kernel is allowed to >>reclaim below this limit. I think the current soft_limit_in_bytes >>claims to provide this. If we prefer to deprecate >>soft_limit_in_bytes, then a new desired_working_set_in_bytes (or a >>hopefully better named) API seems reasonable. > > This controls how external pressure applies to the group. > > But it's conceivable that we'd like to have the equivalent of such a > soft limit for *internal* pressure. Set below the hard limit, this > internal soft limit would have charges trigger direct reclaim in the > memcg but allow them to continue to the hard limit. This would create > a situation wherein the allocating tasks are not killed, but throttled > under reclaim, which gives the administrator a window to detect the > situation with vmpressure and possibly intervene. Because as it > stands, once the current hard limit is hit things can go down pretty > fast and the window for reacting to vmpressure readings is often too > small. This would offer a more gradual deterioration. It would be > set to the upper end of the working set size range (high). > > I think for many users such an internal soft limit would actually be > preferred over the current hard limit, as they'd rather have some > reclaim throttling than an OOM kill when the group reaches its upper > bound. The current hard limit would be reserved for more advanced or > paid cases, where the admin would rather see a memcg get OOM killed > than exceed a certain size. > > Then, as you proposed, we'd have the soft limit for external pressure, > where the kernel only reclaims groups within that limit in order to > avoid OOM kills. It would be set to the estimated lower end of the > working set size range (low). > >> 3) low_limit_guarantee which is a lower bound of memory usage. A memcg >>would prefer to be oom killed rather than operate below this >>threshold. Default value is zero to preserve compatibility with >>existing apps. > > And this would be the external pressure hard limit, which would be set > to the absolute minimum requirement of the group (min). > > Either because it would be hopelessly thrashing without it, or because > this guaranteed memory is actually paid for. Again, I would expect > many users to not eve
Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
On Tue, Jun 10 2014, Johannes Weiner han...@cmpxchg.org wrote: On Mon, Jun 09, 2014 at 03:52:51PM -0700, Greg Thelen wrote: On Fri, Jun 06 2014, Michal Hocko mho...@suse.cz wrote: Some users (e.g. Google) would like to have stronger semantic than low limit offers currently. The fallback mode is not desirable and they prefer hitting OOM killer rather than ignoring low limit for protected groups. There are other possible usecases which can benefit from hard guarantees. I can imagine workloads where setting low_limit to the same value as hard_limit to prevent from any reclaim at all makes a lot of sense because reclaim is much more disrupting than restart of the load. This patch adds a new per memcg memory.reclaim_strategy knob which tells what to do in a situation when memory reclaim cannot do any progress because all groups in the reclaimed hierarchy are within their low_limit. There are two options available: - low_limit_best_effort - the current mode when reclaim falls back to the even reclaim of all groups in the reclaimed hierarchy - low_limit_guarantee - groups within low_limit are never reclaimed and OOM killer is triggered instead. OOM message will mention the fact that the OOM was triggered due to low_limit reclaim protection. To (a) be consistent with existing hard and soft limits APIs and (b) allow use of both best effort and guarantee memory limits, I wonder if it's best to offer three per memcg limits, rather than two limits (hard, low_limit) and a related reclaim_strategy knob. The three limits I'm thinking about are: 1) hard_limit (aka the existing limit_in_bytes cgroupfs file). No change needed here. This is an upper bound on a memcg hierarchy's memory consumption (assuming use_hierarchy=1). This creates internal pressure. Outside reclaim is not affected by it, but internal charges can not exceed this limit. This is set to hard limit the maximum memory consumption of a group (max). 2) best_effort_limit (aka desired working set). This allow an application or administrator to provide a hint to the kernel about desired working set size. Before oom'ing the kernel is allowed to reclaim below this limit. I think the current soft_limit_in_bytes claims to provide this. If we prefer to deprecate soft_limit_in_bytes, then a new desired_working_set_in_bytes (or a hopefully better named) API seems reasonable. This controls how external pressure applies to the group. But it's conceivable that we'd like to have the equivalent of such a soft limit for *internal* pressure. Set below the hard limit, this internal soft limit would have charges trigger direct reclaim in the memcg but allow them to continue to the hard limit. This would create a situation wherein the allocating tasks are not killed, but throttled under reclaim, which gives the administrator a window to detect the situation with vmpressure and possibly intervene. Because as it stands, once the current hard limit is hit things can go down pretty fast and the window for reacting to vmpressure readings is often too small. This would offer a more gradual deterioration. It would be set to the upper end of the working set size range (high). I think for many users such an internal soft limit would actually be preferred over the current hard limit, as they'd rather have some reclaim throttling than an OOM kill when the group reaches its upper bound. The current hard limit would be reserved for more advanced or paid cases, where the admin would rather see a memcg get OOM killed than exceed a certain size. Then, as you proposed, we'd have the soft limit for external pressure, where the kernel only reclaims groups within that limit in order to avoid OOM kills. It would be set to the estimated lower end of the working set size range (low). 3) low_limit_guarantee which is a lower bound of memory usage. A memcg would prefer to be oom killed rather than operate below this threshold. Default value is zero to preserve compatibility with existing apps. And this would be the external pressure hard limit, which would be set to the absolute minimum requirement of the group (min). Either because it would be hopelessly thrashing without it, or because this guaranteed memory is actually paid for. Again, I would expect many users to not even set this minimum guarantee but solely use the external soft limit (low) instead. Logically hard_limit = best_effort_limit = low_limit_guarantee. max = high = low = min I think we should be able to express all desired usecases with these four limits, including the advanced configurations, while making it easy for many users to set up groups without being a) dead certain about their memory consumption or b) prepared for frequent OOM kills, while still allowing them to properly utilize their machines. What do you think? Sounds good
Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
On Fri, Jun 06 2014, Michal Hocko wrote: > Some users (e.g. Google) would like to have stronger semantic than low > limit offers currently. The fallback mode is not desirable and they > prefer hitting OOM killer rather than ignoring low limit for protected > groups. There are other possible usecases which can benefit from hard > guarantees. I can imagine workloads where setting low_limit to the same > value as hard_limit to prevent from any reclaim at all makes a lot of > sense because reclaim is much more disrupting than restart of the load. > > This patch adds a new per memcg memory.reclaim_strategy knob which > tells what to do in a situation when memory reclaim cannot do any > progress because all groups in the reclaimed hierarchy are within their > low_limit. There are two options available: > - low_limit_best_effort - the current mode when reclaim falls > back to the even reclaim of all groups in the reclaimed > hierarchy > - low_limit_guarantee - groups within low_limit are never > reclaimed and OOM killer is triggered instead. OOM message > will mention the fact that the OOM was triggered due to > low_limit reclaim protection. To (a) be consistent with existing hard and soft limits APIs and (b) allow use of both best effort and guarantee memory limits, I wonder if it's best to offer three per memcg limits, rather than two limits (hard, low_limit) and a related reclaim_strategy knob. The three limits I'm thinking about are: 1) hard_limit (aka the existing limit_in_bytes cgroupfs file). No change needed here. This is an upper bound on a memcg hierarchy's memory consumption (assuming use_hierarchy=1). 2) best_effort_limit (aka desired working set). This allow an application or administrator to provide a hint to the kernel about desired working set size. Before oom'ing the kernel is allowed to reclaim below this limit. I think the current soft_limit_in_bytes claims to provide this. If we prefer to deprecate soft_limit_in_bytes, then a new desired_working_set_in_bytes (or a hopefully better named) API seems reasonable. 3) low_limit_guarantee which is a lower bound of memory usage. A memcg would prefer to be oom killed rather than operate below this threshold. Default value is zero to preserve compatibility with existing apps. Logically hard_limit >= best_effort_limit >= low_limit_guarantee. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
On Fri, Jun 06 2014, Michal Hocko mho...@suse.cz wrote: Some users (e.g. Google) would like to have stronger semantic than low limit offers currently. The fallback mode is not desirable and they prefer hitting OOM killer rather than ignoring low limit for protected groups. There are other possible usecases which can benefit from hard guarantees. I can imagine workloads where setting low_limit to the same value as hard_limit to prevent from any reclaim at all makes a lot of sense because reclaim is much more disrupting than restart of the load. This patch adds a new per memcg memory.reclaim_strategy knob which tells what to do in a situation when memory reclaim cannot do any progress because all groups in the reclaimed hierarchy are within their low_limit. There are two options available: - low_limit_best_effort - the current mode when reclaim falls back to the even reclaim of all groups in the reclaimed hierarchy - low_limit_guarantee - groups within low_limit are never reclaimed and OOM killer is triggered instead. OOM message will mention the fact that the OOM was triggered due to low_limit reclaim protection. To (a) be consistent with existing hard and soft limits APIs and (b) allow use of both best effort and guarantee memory limits, I wonder if it's best to offer three per memcg limits, rather than two limits (hard, low_limit) and a related reclaim_strategy knob. The three limits I'm thinking about are: 1) hard_limit (aka the existing limit_in_bytes cgroupfs file). No change needed here. This is an upper bound on a memcg hierarchy's memory consumption (assuming use_hierarchy=1). 2) best_effort_limit (aka desired working set). This allow an application or administrator to provide a hint to the kernel about desired working set size. Before oom'ing the kernel is allowed to reclaim below this limit. I think the current soft_limit_in_bytes claims to provide this. If we prefer to deprecate soft_limit_in_bytes, then a new desired_working_set_in_bytes (or a hopefully better named) API seems reasonable. 3) low_limit_guarantee which is a lower bound of memory usage. A memcg would prefer to be oom killed rather than operate below this threshold. Default value is zero to preserve compatibility with existing apps. Logically hard_limit = best_effort_limit = low_limit_guarantee. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] memcg: Low-limit reclaim
On Wed, May 28 2014, Johannes Weiner wrote: > On Wed, May 28, 2014 at 04:21:44PM +0200, Michal Hocko wrote: >> On Wed 28-05-14 09:49:05, Johannes Weiner wrote: >> > On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote: >> > > Hi Andrew, Johannes, >> > > >> > > On Mon 28-04-14 14:26:41, Michal Hocko wrote: >> > > > This patchset introduces such low limit that is functionally similar >> > > > to a minimum guarantee. Memcgs which are under their lowlimit are not >> > > > considered eligible for the reclaim (both global and hardlimit) unless >> > > > all groups under the reclaimed hierarchy are below the low limit when >> > > > all of them are considered eligible. >> > > > >> > > > The previous version of the patchset posted as a RFC >> > > > (http://marc.info/?l=linux-mm=138677140628677=2) suggested a >> > > > hard guarantee without any fallback. More discussions led me to >> > > > reconsidering the default behavior and come up a more relaxed one. The >> > > > hard requirement can be added later based on a use case which really >> > > > requires. It would be controlled by memory.reclaim_flags knob which >> > > > would specify whether to OOM or fallback (default) when all groups are >> > > > bellow low limit. >> > > >> > > It seems that we are not in a full agreement about the default behavior >> > > yet. Johannes seems to be more for hard guarantee while I would like to >> > > see the weaker approach first and move to the stronger model later. >> > > Johannes, is this absolutely no-go for you? Do you think it is seriously >> > > handicapping the semantic of the new knob? >> > >> > Well we certainly can't start OOMing where we previously didn't, >> > that's called a regression and automatically limits our options. >> > >> > Any unexpected OOMs will be much more acceptable from a new feature >> > than from configuration that previously "worked" and then stopped. >> >> Yes and we are not talking about regressions, are we? >> >> > > My main motivation for the weaker model is that it is hard to see all >> > > the corner case right now and once we hit them I would like to see a >> > > graceful fallback rather than fatal action like OOM killer. Besides that >> > > the usaceses I am mostly interested in are OK with fallback when the >> > > alternative would be OOM killer. I also feel that introducing a knob >> > > with a weaker semantic which can be made stronger later is a sensible >> > > way to go. >> > >> > We can't make it stronger, but we can make it weaker. >> >> Why cannot we make it stronger by a knob/configuration option? > > Why can't we make it weaker by a knob? Why should we design the > default for unforeseeable cornercases rather than make the default > make sense for existing cases and give cornercases a fallback once > they show up? My 2c... The following works for my use cases: 1) introduce memory.low_limit_in_bytes (default=0 thus no default change from older kernels) 2) interested users will set low_limit_in_bytes to non-zero value. Memory protected by low limit should be as migratable/reclaimable as mlock memory. If a zone full of mlock memory causes oom kills, then so should the low limit. If we find corner cases where low_limit_in_bytes is too strict, then we could discuss a new knob to relax it. But I think we should start with a strict low-limit. If the oom killer gets tied in knots due to low limit, then I'd like to explore fixing the oom killer before relaxing low limit. Disclaimer: new use cases will certainly appear with various requirements. But an oom-killing low_limit_in_bytes seems like a generic opt-in feature, so I think it's worthwhile. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] memcg: Low-limit reclaim
On Wed, May 28 2014, Johannes Weiner han...@cmpxchg.org wrote: On Wed, May 28, 2014 at 04:21:44PM +0200, Michal Hocko wrote: On Wed 28-05-14 09:49:05, Johannes Weiner wrote: On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote: Hi Andrew, Johannes, On Mon 28-04-14 14:26:41, Michal Hocko wrote: This patchset introduces such low limit that is functionally similar to a minimum guarantee. Memcgs which are under their lowlimit are not considered eligible for the reclaim (both global and hardlimit) unless all groups under the reclaimed hierarchy are below the low limit when all of them are considered eligible. The previous version of the patchset posted as a RFC (http://marc.info/?l=linux-mmm=138677140628677w=2) suggested a hard guarantee without any fallback. More discussions led me to reconsidering the default behavior and come up a more relaxed one. The hard requirement can be added later based on a use case which really requires. It would be controlled by memory.reclaim_flags knob which would specify whether to OOM or fallback (default) when all groups are bellow low limit. It seems that we are not in a full agreement about the default behavior yet. Johannes seems to be more for hard guarantee while I would like to see the weaker approach first and move to the stronger model later. Johannes, is this absolutely no-go for you? Do you think it is seriously handicapping the semantic of the new knob? Well we certainly can't start OOMing where we previously didn't, that's called a regression and automatically limits our options. Any unexpected OOMs will be much more acceptable from a new feature than from configuration that previously worked and then stopped. Yes and we are not talking about regressions, are we? My main motivation for the weaker model is that it is hard to see all the corner case right now and once we hit them I would like to see a graceful fallback rather than fatal action like OOM killer. Besides that the usaceses I am mostly interested in are OK with fallback when the alternative would be OOM killer. I also feel that introducing a knob with a weaker semantic which can be made stronger later is a sensible way to go. We can't make it stronger, but we can make it weaker. Why cannot we make it stronger by a knob/configuration option? Why can't we make it weaker by a knob? Why should we design the default for unforeseeable cornercases rather than make the default make sense for existing cases and give cornercases a fallback once they show up? My 2c... The following works for my use cases: 1) introduce memory.low_limit_in_bytes (default=0 thus no default change from older kernels) 2) interested users will set low_limit_in_bytes to non-zero value. Memory protected by low limit should be as migratable/reclaimable as mlock memory. If a zone full of mlock memory causes oom kills, then so should the low limit. If we find corner cases where low_limit_in_bytes is too strict, then we could discuss a new knob to relax it. But I think we should start with a strict low-limit. If the oom killer gets tied in knots due to low limit, then I'd like to explore fixing the oom killer before relaxing low limit. Disclaimer: new use cases will certainly appear with various requirements. But an oom-killing low_limit_in_bytes seems like a generic opt-in feature, so I think it's worthwhile. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] memcg: deprecate memory.force_empty knob
On Tue, May 13 2014, Michal Hocko wrote: > force_empty has been introduced primarily to drop memory before it gets > reparented on the group removal. This alone doesn't sound fully > justified because reparented pages which are not in use can be reclaimed > also later when there is a memory pressure on the parent level. > > Mark the knob CFTYPE_INSANE which tells the cgroup core that it > shouldn't create the knob with the experimental sane_behavior. Other > users will get informed about the deprecation and asked to tell us more > because I do not expect most users will use sane_behavior cgroups mode > very soon. > Anyway I expect that most users will be simply cgroup remove handlers > which do that since ever without having any good reason for it. > > If somebody really cares because reparented pages, which would be > dropped otherwise, push out more important ones then we should fix the > reparenting code and put pages to the tail. I should mention a case where I've needed to use memory.force_empty: to synchronously flush stats from child to parent. Without force_empty memory.stat is temporarily inconsistent until async css_offline reparents charges. Here is an example on v3.14 showing that parent/memory.stat contents are in-flux immediately after rmdir of parent/child. $ cat /test #!/bin/bash # Create parent and child. Add some non-reclaimable anon rss to child, # then move running task to parent. mkdir p p/c (echo $BASHPID > p/c/cgroup.procs && exec sleep 1d) & pid=$! sleep 1 echo $pid > p/cgroup.procs grep 'rss ' {p,p/c}/memory.stat if [[ $1 == force ]]; then echo 1 > p/c/memory.force_empty fi rmdir p/c echo 'For a small time the p/c memory has not been reparented to p.' grep 'rss ' {p,p/c}/memory.stat sleep 1 echo 'After waiting all memory has been reparented' grep 'rss ' {p,p/c}/memory.stat kill $pid rmdir p -- First, demonstrate that just rmdir, without memory.force_empty, temporarily hides reparented child memory stats. $ /test p/memory.stat:rss 0 p/memory.stat:total_rss 69632 p/c/memory.stat:rss 69632 p/c/memory.stat:total_rss 69632 For a small time the p/c memory has not been reparented to p. p/memory.stat:rss 0 p/memory.stat:total_rss 0 grep: p/c/memory.stat: No such file or directory After waiting all memory has been reparented p/memory.stat:rss 69632 p/memory.stat:total_rss 69632 grep: p/c/memory.stat: No such file or directory /test: Terminated ( echo $BASHPID > p/c/cgroup.procs && exec sleep 1d ) -- Demonstrate that using memory.force_empty before rmdir, behaves more sensibly. Stats for reparented child memory are not hidden. $ /test force p/memory.stat:rss 0 p/memory.stat:total_rss 69632 p/c/memory.stat:rss 69632 p/c/memory.stat:total_rss 69632 For a small time the p/c memory has not been reparented to p. p/memory.stat:rss 69632 p/memory.stat:total_rss 69632 grep: p/c/memory.stat: No such file or directory After waiting all memory has been reparented p/memory.stat:rss 69632 p/memory.stat:total_rss 69632 grep: p/c/memory.stat: No such file or directory /test: Terminated ( echo $BASHPID > p/c/cgroup.procs && exec sleep 1d ) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] memcg: deprecate memory.force_empty knob
On Tue, May 13 2014, Michal Hocko mho...@suse.cz wrote: force_empty has been introduced primarily to drop memory before it gets reparented on the group removal. This alone doesn't sound fully justified because reparented pages which are not in use can be reclaimed also later when there is a memory pressure on the parent level. Mark the knob CFTYPE_INSANE which tells the cgroup core that it shouldn't create the knob with the experimental sane_behavior. Other users will get informed about the deprecation and asked to tell us more because I do not expect most users will use sane_behavior cgroups mode very soon. Anyway I expect that most users will be simply cgroup remove handlers which do that since ever without having any good reason for it. If somebody really cares because reparented pages, which would be dropped otherwise, push out more important ones then we should fix the reparenting code and put pages to the tail. I should mention a case where I've needed to use memory.force_empty: to synchronously flush stats from child to parent. Without force_empty memory.stat is temporarily inconsistent until async css_offline reparents charges. Here is an example on v3.14 showing that parent/memory.stat contents are in-flux immediately after rmdir of parent/child. $ cat /test #!/bin/bash # Create parent and child. Add some non-reclaimable anon rss to child, # then move running task to parent. mkdir p p/c (echo $BASHPID p/c/cgroup.procs exec sleep 1d) pid=$! sleep 1 echo $pid p/cgroup.procs grep 'rss ' {p,p/c}/memory.stat if [[ $1 == force ]]; then echo 1 p/c/memory.force_empty fi rmdir p/c echo 'For a small time the p/c memory has not been reparented to p.' grep 'rss ' {p,p/c}/memory.stat sleep 1 echo 'After waiting all memory has been reparented' grep 'rss ' {p,p/c}/memory.stat kill $pid rmdir p -- First, demonstrate that just rmdir, without memory.force_empty, temporarily hides reparented child memory stats. $ /test p/memory.stat:rss 0 p/memory.stat:total_rss 69632 p/c/memory.stat:rss 69632 p/c/memory.stat:total_rss 69632 For a small time the p/c memory has not been reparented to p. p/memory.stat:rss 0 p/memory.stat:total_rss 0 grep: p/c/memory.stat: No such file or directory After waiting all memory has been reparented p/memory.stat:rss 69632 p/memory.stat:total_rss 69632 grep: p/c/memory.stat: No such file or directory /test: Terminated ( echo $BASHPID p/c/cgroup.procs exec sleep 1d ) -- Demonstrate that using memory.force_empty before rmdir, behaves more sensibly. Stats for reparented child memory are not hidden. $ /test force p/memory.stat:rss 0 p/memory.stat:total_rss 69632 p/c/memory.stat:rss 69632 p/c/memory.stat:total_rss 69632 For a small time the p/c memory has not been reparented to p. p/memory.stat:rss 69632 p/memory.stat:total_rss 69632 grep: p/c/memory.stat: No such file or directory After waiting all memory has been reparented p/memory.stat:rss 69632 p/memory.stat:total_rss 69632 grep: p/c/memory.stat: No such file or directory /test: Terminated ( echo $BASHPID p/c/cgroup.procs exec sleep 1d ) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v3 2/6] mm, compaction: return failed migration target pages back to freelist
On Wed, May 07 2014, Andrew Morton wrote: > On Tue, 6 May 2014 19:22:43 -0700 (PDT) David Rientjes > wrote: > >> Memory compaction works by having a "freeing scanner" scan from one end of a >> zone which isolates pages as migration targets while another "migrating >> scanner" >> scans from the other end of the same zone which isolates pages for migration. >> >> When page migration fails for an isolated page, the target page is returned >> to >> the system rather than the freelist built by the freeing scanner. This may >> require the freeing scanner to continue scanning memory after suitable >> migration >> targets have already been returned to the system needlessly. >> >> This patch returns destination pages to the freeing scanner freelist when >> page >> migration fails. This prevents unnecessary work done by the freeing scanner >> but >> also encourages memory to be as compacted as possible at the end of the zone. >> >> Reported-by: Greg Thelen > > What did Greg actually report? IOW, what if any observable problem is > being fixed here? I detected the problem at runtime seeing that ext4 metadata pages (esp the ones read by "sbi->s_group_desc[i] = sb_bread(sb, block)") were constantly visited by compaction calls of migrate_pages(). These pages had a non-zero b_count which caused fallback_migrate_page() -> try_to_release_page() -> try_to_free_buffers() to fail. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v3 2/6] mm, compaction: return failed migration target pages back to freelist
On Wed, May 07 2014, Andrew Morton a...@linux-foundation.org wrote: On Tue, 6 May 2014 19:22:43 -0700 (PDT) David Rientjes rient...@google.com wrote: Memory compaction works by having a freeing scanner scan from one end of a zone which isolates pages as migration targets while another migrating scanner scans from the other end of the same zone which isolates pages for migration. When page migration fails for an isolated page, the target page is returned to the system rather than the freelist built by the freeing scanner. This may require the freeing scanner to continue scanning memory after suitable migration targets have already been returned to the system needlessly. This patch returns destination pages to the freeing scanner freelist when page migration fails. This prevents unnecessary work done by the freeing scanner but also encourages memory to be as compacted as possible at the end of the zone. Reported-by: Greg Thelen gthe...@google.com What did Greg actually report? IOW, what if any observable problem is being fixed here? I detected the problem at runtime seeing that ext4 metadata pages (esp the ones read by sbi-s_group_desc[i] = sb_bread(sb, block)) were constantly visited by compaction calls of migrate_pages(). These pages had a non-zero b_count which caused fallback_migrate_page() - try_to_release_page() - try_to_free_buffers() to fail. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] memcg: Low-limit reclaim
On Mon, Apr 28 2014, Roman Gushchin wrote: > 28.04.2014, 16:27, "Michal Hocko" : >> The series is based on top of the current mmotm tree. Once the series >> gets accepted I will post a patch which will mark the soft limit as >> deprecated with a note that it will be eventually dropped. Let me know >> if you would prefer to have such a patch a part of the series. >> >> Thoughts? > > > Looks good to me. > > The only question is: are there any ideas how the hierarchy support > will be used in this case in practice? > Will someone set low limit for non-leaf cgroups? Why? > > Thanks, > Roman I imagine that a hosting service may want to give X MB to a top level memcg (/a) with sub-jobs (/a/b, /a/c) which may(not) have their own low-limits. Examples: case_1) only set low limit on /a. /a/b and /a/c may overcommit /a's memory (b.limit_in_bytes + c.limit_in_bytes > a.limit_in_bytes). case_2) low limits on all memcg. But not overcommitting low_limits (b.low_limit_in_in_bytes + c.low_limit_in_in_bytes <= a.low_limit_in_in_bytes). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] memcg: Low-limit reclaim
On Mon, Apr 28 2014, Roman Gushchin kl...@yandex-team.ru wrote: 28.04.2014, 16:27, Michal Hocko mho...@suse.cz: The series is based on top of the current mmotm tree. Once the series gets accepted I will post a patch which will mark the soft limit as deprecated with a note that it will be eventually dropped. Let me know if you would prefer to have such a patch a part of the series. Thoughts? Looks good to me. The only question is: are there any ideas how the hierarchy support will be used in this case in practice? Will someone set low limit for non-leaf cgroups? Why? Thanks, Roman I imagine that a hosting service may want to give X MB to a top level memcg (/a) with sub-jobs (/a/b, /a/c) which may(not) have their own low-limits. Examples: case_1) only set low limit on /a. /a/b and /a/c may overcommit /a's memory (b.limit_in_bytes + c.limit_in_bytes a.limit_in_bytes). case_2) low limits on all memcg. But not overcommitting low_limits (b.low_limit_in_in_bytes + c.low_limit_in_in_bytes = a.low_limit_in_in_bytes). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm v2.1] mm: get rid of __GFP_KMEMCG
On Tue, Apr 01 2014, Vladimir Davydov wrote: > Currently to allocate a page that should be charged to kmemcg (e.g. > threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page > allocated is then to be freed by free_memcg_kmem_pages. Apart from > looking asymmetrical, this also requires intrusion to the general > allocation path. So let's introduce separate functions that will > alloc/free pages charged to kmemcg. > > The new functions are called alloc_kmem_pages and free_kmem_pages. They > should be used when the caller actually would like to use kmalloc, but > has to fall back to the page allocator for the allocation is large. They > only differ from alloc_pages and free_pages in that besides allocating > or freeing pages they also charge them to the kmem resource counter of > the current memory cgroup. > > Signed-off-by: Vladimir Davydov One comment nit below, otherwise looks good to me. Acked-by: Greg Thelen > Cc: Johannes Weiner > Cc: Michal Hocko > Cc: Glauber Costa > Cc: Christoph Lameter > Cc: Pekka Enberg > --- > Changes in v2.1: > - add missing kmalloc_order forward declaration; lacking it caused >compilation breakage with CONFIG_TRACING=n > > include/linux/gfp.h | 10 --- > include/linux/memcontrol.h |2 +- > include/linux/slab.h| 11 +--- > include/linux/thread_info.h |2 -- > include/trace/events/gfpflags.h |1 - > kernel/fork.c |6 ++--- > mm/page_alloc.c | 56 > --- > mm/slab_common.c| 12 + > mm/slub.c |6 ++--- > 9 files changed, 61 insertions(+), 45 deletions(-) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 39b81dc7d01a..d382db71e300 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -31,7 +31,6 @@ struct vm_area_struct; > #define ___GFP_HARDWALL 0x2u > #define ___GFP_THISNODE 0x4u > #define ___GFP_RECLAIMABLE 0x8u > -#define ___GFP_KMEMCG0x10u > #define ___GFP_NOTRACK 0x20u > #define ___GFP_NO_KSWAPD 0x40u > #define ___GFP_OTHER_NODE0x80u > @@ -91,7 +90,6 @@ struct vm_area_struct; > > #define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD) > #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of > other node */ > -#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from > a memcg-accounted resource */ > #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to > dirty page */ > > /* > @@ -353,6 +351,10 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int > order, > #define alloc_page_vma_node(gfp_mask, vma, addr, node) \ > alloc_pages_vma(gfp_mask, 0, vma, addr, node) > > +extern struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order); > +extern struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask, > + unsigned int order); > + > extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order); > extern unsigned long get_zeroed_page(gfp_t gfp_mask); > > @@ -372,8 +374,8 @@ extern void free_pages(unsigned long addr, unsigned int > order); > extern void free_hot_cold_page(struct page *page, int cold); > extern void free_hot_cold_page_list(struct list_head *list, int cold); > > -extern void __free_memcg_kmem_pages(struct page *page, unsigned int order); > -extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order); > +extern void __free_kmem_pages(struct page *page, unsigned int order); > +extern void free_kmem_pages(unsigned long addr, unsigned int order); > > #define __free_page(page) __free_pages((page), 0) > #define free_page(addr) free_pages((addr), 0) > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 29068dd26c3d..13acdb5259f5 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -543,7 +543,7 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup > **memcg, int order) >* res_counter_charge_nofail, but we hope those allocations are rare, >* and won't be worth the trouble. >*/ Just a few lines higher in first memcg_kmem_newpage_charge() comment, there is a leftover reference to GFP_KMEMCG which should be removed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm v2.1] mm: get rid of __GFP_KMEMCG
On Tue, Apr 01 2014, Vladimir Davydov vdavy...@parallels.com wrote: Currently to allocate a page that should be charged to kmemcg (e.g. threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page allocated is then to be freed by free_memcg_kmem_pages. Apart from looking asymmetrical, this also requires intrusion to the general allocation path. So let's introduce separate functions that will alloc/free pages charged to kmemcg. The new functions are called alloc_kmem_pages and free_kmem_pages. They should be used when the caller actually would like to use kmalloc, but has to fall back to the page allocator for the allocation is large. They only differ from alloc_pages and free_pages in that besides allocating or freeing pages they also charge them to the kmem resource counter of the current memory cgroup. Signed-off-by: Vladimir Davydov vdavy...@parallels.com One comment nit below, otherwise looks good to me. Acked-by: Greg Thelen gthe...@google.com Cc: Johannes Weiner han...@cmpxchg.org Cc: Michal Hocko mho...@suse.cz Cc: Glauber Costa glom...@gmail.com Cc: Christoph Lameter c...@linux-foundation.org Cc: Pekka Enberg penb...@kernel.org --- Changes in v2.1: - add missing kmalloc_order forward declaration; lacking it caused compilation breakage with CONFIG_TRACING=n include/linux/gfp.h | 10 --- include/linux/memcontrol.h |2 +- include/linux/slab.h| 11 +--- include/linux/thread_info.h |2 -- include/trace/events/gfpflags.h |1 - kernel/fork.c |6 ++--- mm/page_alloc.c | 56 --- mm/slab_common.c| 12 + mm/slub.c |6 ++--- 9 files changed, 61 insertions(+), 45 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 39b81dc7d01a..d382db71e300 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -31,7 +31,6 @@ struct vm_area_struct; #define ___GFP_HARDWALL 0x2u #define ___GFP_THISNODE 0x4u #define ___GFP_RECLAIMABLE 0x8u -#define ___GFP_KMEMCG0x10u #define ___GFP_NOTRACK 0x20u #define ___GFP_NO_KSWAPD 0x40u #define ___GFP_OTHER_NODE0x80u @@ -91,7 +90,6 @@ struct vm_area_struct; #define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD) #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */ -#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */ #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */ /* @@ -353,6 +351,10 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order, #define alloc_page_vma_node(gfp_mask, vma, addr, node) \ alloc_pages_vma(gfp_mask, 0, vma, addr, node) +extern struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order); +extern struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask, + unsigned int order); + extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order); extern unsigned long get_zeroed_page(gfp_t gfp_mask); @@ -372,8 +374,8 @@ extern void free_pages(unsigned long addr, unsigned int order); extern void free_hot_cold_page(struct page *page, int cold); extern void free_hot_cold_page_list(struct list_head *list, int cold); -extern void __free_memcg_kmem_pages(struct page *page, unsigned int order); -extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order); +extern void __free_kmem_pages(struct page *page, unsigned int order); +extern void free_kmem_pages(unsigned long addr, unsigned int order); #define __free_page(page) __free_pages((page), 0) #define free_page(addr) free_pages((addr), 0) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 29068dd26c3d..13acdb5259f5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -543,7 +543,7 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order) * res_counter_charge_nofail, but we hope those allocations are rare, * and won't be worth the trouble. */ Just a few lines higher in first memcg_kmem_newpage_charge() comment, there is a leftover reference to GFP_KMEMCG which should be removed. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ipc,shm: increase default size for shmmax
On Tue, Apr 01 2014, Kamezawa Hiroyuki wrote: >> On Tue, Apr 01 2014, Davidlohr Bueso wrote: >> >>> On Tue, 2014-04-01 at 19:56 -0400, KOSAKI Motohiro wrote: >>>>>>> Ah-hah, that's interesting info. >>>>>>> >>>>>>> Let's make the default 64GB? >>>>>> >>>>>> 64GB is infinity at that time, but it no longer near infinity today. I >>>>>> like >>>>>> very large or total memory proportional number. >>>>> >>>>> So I still like 0 for unlimited. Nice, clean and much easier to look at >>>>> than ULONG_MAX. And since we cannot disable shm through SHMMIN, I really >>>>> don't see any disadvantages, as opposed to some other arbitrary value. >>>>> Furthermore it wouldn't break userspace: any existing sysctl would >>>>> continue to work, and if not set, the user never has to worry about this >>>>> tunable again. >>>>> >>>>> Please let me know if you all agree with this... >>>> >>>> Surething. Why not. :) >>> >>> *sigh* actually, the plot thickens a bit with SHMALL (total size of shm >>> segments system wide, in pages). Currently by default: >>> >>> #define SHMALL (SHMMAX/getpagesize()*(SHMMNI/16)) >>> >>> This deals with physical memory, at least admins are recommended to set >>> it to some large percentage of ram / pagesize. So I think that if we >>> loose control over the default value, users can potentially DoS the >>> system, or at least cause excessive swapping if not manually set, but >>> then again the same goes for anon mem... so do we care? >> > (2014/04/02 10:08), Greg Thelen wrote: >> >> At least when there's an egregious anon leak the oom killer has the >> power to free the memory by killing until the memory is unreferenced. >> This isn't true for shm or tmpfs. So shm is more effective than anon at >> crushing a machine. > > Hm..sysctl.kernel.shm_rmid_forced won't work with oom-killer ? > > http://www.openwall.com/lists/kernel-hardening/2011/07/26/7 > > I like to handle this kind of issue under memcg but hmm..tmpfs's limit is half > of memory at default. Ah, yes. I forgot about shm_rmid_forced. Thanks. It would give the oom killer ability to cleanup shm (as it does with anon) when shm_rmid_forced=1. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm v2 1/2] sl[au]b: charge slabs to kmemcg explicitly
On Tue, Apr 01 2014, Vladimir Davydov wrote: > We have only a few places where we actually want to charge kmem so > instead of intruding into the general page allocation path with > __GFP_KMEMCG it's better to explictly charge kmem there. All kmem > charges will be easier to follow that way. > > This is a step towards removing __GFP_KMEMCG. It removes __GFP_KMEMCG > from memcg caches' allocflags. Instead it makes slab allocation path > call memcg_charge_kmem directly getting memcg to charge from the cache's > memcg params. > > This also eliminates any possibility of misaccounting an allocation > going from one memcg's cache to another memcg, because now we always > charge slabs against the memcg the cache belongs to. That's why this > patch removes the big comment to memcg_kmem_get_cache. > > Signed-off-by: Vladimir Davydov Acked-by: Greg Thelen -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ipc,shm: increase default size for shmmax
On Tue, Apr 01 2014, Davidlohr Bueso wrote: > On Tue, 2014-04-01 at 19:56 -0400, KOSAKI Motohiro wrote: >> >> > Ah-hah, that's interesting info. >> >> > >> >> > Let's make the default 64GB? >> >> >> >> 64GB is infinity at that time, but it no longer near infinity today. I >> >> like >> >> very large or total memory proportional number. >> > >> > So I still like 0 for unlimited. Nice, clean and much easier to look at >> > than ULONG_MAX. And since we cannot disable shm through SHMMIN, I really >> > don't see any disadvantages, as opposed to some other arbitrary value. >> > Furthermore it wouldn't break userspace: any existing sysctl would >> > continue to work, and if not set, the user never has to worry about this >> > tunable again. >> > >> > Please let me know if you all agree with this... >> >> Surething. Why not. :) > > *sigh* actually, the plot thickens a bit with SHMALL (total size of shm > segments system wide, in pages). Currently by default: > > #define SHMALL (SHMMAX/getpagesize()*(SHMMNI/16)) > > This deals with physical memory, at least admins are recommended to set > it to some large percentage of ram / pagesize. So I think that if we > loose control over the default value, users can potentially DoS the > system, or at least cause excessive swapping if not manually set, but > then again the same goes for anon mem... so do we care? At least when there's an egregious anon leak the oom killer has the power to free the memory by killing until the memory is unreferenced. This isn't true for shm or tmpfs. So shm is more effective than anon at crushing a machine. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm v2 2/2] mm: get rid of __GFP_KMEMCG
On Tue, Apr 01 2014, Vladimir Davydov wrote: > Currently to allocate a page that should be charged to kmemcg (e.g. > threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page > allocated is then to be freed by free_memcg_kmem_pages. Apart from > looking asymmetrical, this also requires intrusion to the general > allocation path. So let's introduce separate functions that will > alloc/free pages charged to kmemcg. > > The new functions are called alloc_kmem_pages and free_kmem_pages. They > should be used when the caller actually would like to use kmalloc, but > has to fall back to the page allocator for the allocation is large. They > only differ from alloc_pages and free_pages in that besides allocating > or freeing pages they also charge them to the kmem resource counter of > the current memory cgroup. > > Signed-off-by: Vladimir Davydov > --- > include/linux/gfp.h | 10 --- > include/linux/memcontrol.h |2 +- > include/linux/slab.h| 11 > include/linux/thread_info.h |2 -- > include/trace/events/gfpflags.h |1 - > kernel/fork.c |6 ++--- > mm/page_alloc.c | 56 > --- > mm/slab_common.c| 12 + > mm/slub.c |6 ++--- > 9 files changed, 60 insertions(+), 46 deletions(-) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 39b81dc7d01a..d382db71e300 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -31,7 +31,6 @@ struct vm_area_struct; > #define ___GFP_HARDWALL 0x2u > #define ___GFP_THISNODE 0x4u > #define ___GFP_RECLAIMABLE 0x8u > -#define ___GFP_KMEMCG0x10u > #define ___GFP_NOTRACK 0x20u > #define ___GFP_NO_KSWAPD 0x40u > #define ___GFP_OTHER_NODE0x80u > @@ -91,7 +90,6 @@ struct vm_area_struct; > > #define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD) > #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of > other node */ > -#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from > a memcg-accounted resource */ > #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to > dirty page */ > > /* > @@ -353,6 +351,10 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int > order, > #define alloc_page_vma_node(gfp_mask, vma, addr, node) \ > alloc_pages_vma(gfp_mask, 0, vma, addr, node) > > +extern struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order); > +extern struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask, > + unsigned int order); > + > extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order); > extern unsigned long get_zeroed_page(gfp_t gfp_mask); > > @@ -372,8 +374,8 @@ extern void free_pages(unsigned long addr, unsigned int > order); > extern void free_hot_cold_page(struct page *page, int cold); > extern void free_hot_cold_page_list(struct list_head *list, int cold); > > -extern void __free_memcg_kmem_pages(struct page *page, unsigned int order); > -extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order); > +extern void __free_kmem_pages(struct page *page, unsigned int order); > +extern void free_kmem_pages(unsigned long addr, unsigned int order); > > #define __free_page(page) __free_pages((page), 0) > #define free_page(addr) free_pages((addr), 0) > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 29068dd26c3d..13acdb5259f5 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -543,7 +543,7 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup > **memcg, int order) >* res_counter_charge_nofail, but we hope those allocations are rare, >* and won't be worth the trouble. >*/ > - if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL)) > + if (gfp & __GFP_NOFAIL) > return true; > if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD)) > return true; > diff --git a/include/linux/slab.h b/include/linux/slab.h > index 3dd389aa91c7..6d6959292e00 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -358,17 +358,6 @@ kmem_cache_alloc_node_trace(struct kmem_cache *s, > #include > #endif > > -static __always_inline void * > -kmalloc_order(size_t size, gfp_t flags, unsigned int order) > -{ > - void *ret; > - > - flags |= (__GFP_COMP | __GFP_KMEMCG); > - ret = (void *) __get_free_pages(flags, order); > - kmemleak_alloc(ret, size, 1, flags); > - return ret; > -} > - Removing this from the header file breaks builds without CONFIG_TRACING. Example: % make allnoconfig && make -j4 mm/ [...] include/linux/slab.h: In function ‘kmalloc_order_trace’: include/linux/slab.h:367:2: error:
Re: [PATCH -mm v2 2/2] mm: get rid of __GFP_KMEMCG
On Tue, Apr 01 2014, Vladimir Davydov vdavy...@parallels.com wrote: Currently to allocate a page that should be charged to kmemcg (e.g. threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page allocated is then to be freed by free_memcg_kmem_pages. Apart from looking asymmetrical, this also requires intrusion to the general allocation path. So let's introduce separate functions that will alloc/free pages charged to kmemcg. The new functions are called alloc_kmem_pages and free_kmem_pages. They should be used when the caller actually would like to use kmalloc, but has to fall back to the page allocator for the allocation is large. They only differ from alloc_pages and free_pages in that besides allocating or freeing pages they also charge them to the kmem resource counter of the current memory cgroup. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- include/linux/gfp.h | 10 --- include/linux/memcontrol.h |2 +- include/linux/slab.h| 11 include/linux/thread_info.h |2 -- include/trace/events/gfpflags.h |1 - kernel/fork.c |6 ++--- mm/page_alloc.c | 56 --- mm/slab_common.c| 12 + mm/slub.c |6 ++--- 9 files changed, 60 insertions(+), 46 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 39b81dc7d01a..d382db71e300 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -31,7 +31,6 @@ struct vm_area_struct; #define ___GFP_HARDWALL 0x2u #define ___GFP_THISNODE 0x4u #define ___GFP_RECLAIMABLE 0x8u -#define ___GFP_KMEMCG0x10u #define ___GFP_NOTRACK 0x20u #define ___GFP_NO_KSWAPD 0x40u #define ___GFP_OTHER_NODE0x80u @@ -91,7 +90,6 @@ struct vm_area_struct; #define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD) #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */ -#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */ #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */ /* @@ -353,6 +351,10 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order, #define alloc_page_vma_node(gfp_mask, vma, addr, node) \ alloc_pages_vma(gfp_mask, 0, vma, addr, node) +extern struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order); +extern struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask, + unsigned int order); + extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order); extern unsigned long get_zeroed_page(gfp_t gfp_mask); @@ -372,8 +374,8 @@ extern void free_pages(unsigned long addr, unsigned int order); extern void free_hot_cold_page(struct page *page, int cold); extern void free_hot_cold_page_list(struct list_head *list, int cold); -extern void __free_memcg_kmem_pages(struct page *page, unsigned int order); -extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order); +extern void __free_kmem_pages(struct page *page, unsigned int order); +extern void free_kmem_pages(unsigned long addr, unsigned int order); #define __free_page(page) __free_pages((page), 0) #define free_page(addr) free_pages((addr), 0) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 29068dd26c3d..13acdb5259f5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -543,7 +543,7 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order) * res_counter_charge_nofail, but we hope those allocations are rare, * and won't be worth the trouble. */ - if (!(gfp __GFP_KMEMCG) || (gfp __GFP_NOFAIL)) + if (gfp __GFP_NOFAIL) return true; if (in_interrupt() || (!current-mm) || (current-flags PF_KTHREAD)) return true; diff --git a/include/linux/slab.h b/include/linux/slab.h index 3dd389aa91c7..6d6959292e00 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -358,17 +358,6 @@ kmem_cache_alloc_node_trace(struct kmem_cache *s, #include linux/slub_def.h #endif -static __always_inline void * -kmalloc_order(size_t size, gfp_t flags, unsigned int order) -{ - void *ret; - - flags |= (__GFP_COMP | __GFP_KMEMCG); - ret = (void *) __get_free_pages(flags, order); - kmemleak_alloc(ret, size, 1, flags); - return ret; -} - Removing this from the header file breaks builds without CONFIG_TRACING. Example: % make allnoconfig make -j4 mm/ [...] include/linux/slab.h: In function ‘kmalloc_order_trace’: include/linux/slab.h:367:2: error: implicit declaration of function ‘kmalloc_order’
Re: [PATCH] ipc,shm: increase default size for shmmax
On Tue, Apr 01 2014, Davidlohr Bueso davidl...@hp.com wrote: On Tue, 2014-04-01 at 19:56 -0400, KOSAKI Motohiro wrote: Ah-hah, that's interesting info. Let's make the default 64GB? 64GB is infinity at that time, but it no longer near infinity today. I like very large or total memory proportional number. So I still like 0 for unlimited. Nice, clean and much easier to look at than ULONG_MAX. And since we cannot disable shm through SHMMIN, I really don't see any disadvantages, as opposed to some other arbitrary value. Furthermore it wouldn't break userspace: any existing sysctl would continue to work, and if not set, the user never has to worry about this tunable again. Please let me know if you all agree with this... Surething. Why not. :) *sigh* actually, the plot thickens a bit with SHMALL (total size of shm segments system wide, in pages). Currently by default: #define SHMALL (SHMMAX/getpagesize()*(SHMMNI/16)) This deals with physical memory, at least admins are recommended to set it to some large percentage of ram / pagesize. So I think that if we loose control over the default value, users can potentially DoS the system, or at least cause excessive swapping if not manually set, but then again the same goes for anon mem... so do we care? At least when there's an egregious anon leak the oom killer has the power to free the memory by killing until the memory is unreferenced. This isn't true for shm or tmpfs. So shm is more effective than anon at crushing a machine. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm v2 1/2] sl[au]b: charge slabs to kmemcg explicitly
On Tue, Apr 01 2014, Vladimir Davydov vdavy...@parallels.com wrote: We have only a few places where we actually want to charge kmem so instead of intruding into the general page allocation path with __GFP_KMEMCG it's better to explictly charge kmem there. All kmem charges will be easier to follow that way. This is a step towards removing __GFP_KMEMCG. It removes __GFP_KMEMCG from memcg caches' allocflags. Instead it makes slab allocation path call memcg_charge_kmem directly getting memcg to charge from the cache's memcg params. This also eliminates any possibility of misaccounting an allocation going from one memcg's cache to another memcg, because now we always charge slabs against the memcg the cache belongs to. That's why this patch removes the big comment to memcg_kmem_get_cache. Signed-off-by: Vladimir Davydov vdavy...@parallels.com Acked-by: Greg Thelen gthe...@google.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ipc,shm: increase default size for shmmax
On Tue, Apr 01 2014, Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Tue, Apr 01 2014, Davidlohr Bueso davidl...@hp.com wrote: On Tue, 2014-04-01 at 19:56 -0400, KOSAKI Motohiro wrote: Ah-hah, that's interesting info. Let's make the default 64GB? 64GB is infinity at that time, but it no longer near infinity today. I like very large or total memory proportional number. So I still like 0 for unlimited. Nice, clean and much easier to look at than ULONG_MAX. And since we cannot disable shm through SHMMIN, I really don't see any disadvantages, as opposed to some other arbitrary value. Furthermore it wouldn't break userspace: any existing sysctl would continue to work, and if not set, the user never has to worry about this tunable again. Please let me know if you all agree with this... Surething. Why not. :) *sigh* actually, the plot thickens a bit with SHMALL (total size of shm segments system wide, in pages). Currently by default: #define SHMALL (SHMMAX/getpagesize()*(SHMMNI/16)) This deals with physical memory, at least admins are recommended to set it to some large percentage of ram / pagesize. So I think that if we loose control over the default value, users can potentially DoS the system, or at least cause excessive swapping if not manually set, but then again the same goes for anon mem... so do we care? (2014/04/02 10:08), Greg Thelen wrote: At least when there's an egregious anon leak the oom killer has the power to free the memory by killing until the memory is unreferenced. This isn't true for shm or tmpfs. So shm is more effective than anon at crushing a machine. Hm..sysctl.kernel.shm_rmid_forced won't work with oom-killer ? http://www.openwall.com/lists/kernel-hardening/2011/07/26/7 I like to handle this kind of issue under memcg but hmm..tmpfs's limit is half of memory at default. Ah, yes. I forgot about shm_rmid_forced. Thanks. It would give the oom killer ability to cleanup shm (as it does with anon) when shm_rmid_forced=1. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 1/4] sl[au]b: do not charge large allocations to memcg
On Thu, Mar 27, 2014 at 12:37 AM, Vladimir Davydov wrote: > Hi Greg, > > On 03/27/2014 08:31 AM, Greg Thelen wrote: >> On Wed, Mar 26 2014, Vladimir Davydov wrote: >> >>> We don't track any random page allocation, so we shouldn't track kmalloc >>> that falls back to the page allocator. >> This seems like a change which will leads to confusing (and arguably >> improper) kernel behavior. I prefer the behavior prior to this patch. >> >> Before this change both of the following allocations are charged to >> memcg (assuming kmem accounting is enabled): >> a = kmalloc(KMALLOC_MAX_CACHE_SIZE, GFP_KERNEL) >> b = kmalloc(KMALLOC_MAX_CACHE_SIZE + 1, GFP_KERNEL) >> >> After this change only 'a' is charged; 'b' goes directly to page >> allocator which no longer does accounting. > > Why do we need to charge 'b' in the first place? Can the userspace > trigger such allocations massively? If there can only be one or two such > allocations from a cgroup, is there any point in charging them? Of the top of my head I don't know of any >8KIB kmalloc()s so I can't say if they're directly triggerable by user space en masse. But we recently ran into some order:3 allocations in networking. The networking allocations used a non-generic kmem_cache (rather than kmalloc which started this discussion). For details, see ed98df3361f0 ("net: use __GFP_NORETRY for high order allocations"). I can't say if such allocations exist in device drivers, but given the networking example, it's conceivable that they may (or will) exist. With slab this isn't a problem because sla has kmalloc kmem_caches for all supported allocation sizes. However, slub shows this issue for any kmalloc() allocations larger than 8KIB (at least on x86_64). It seems like a strange directly to take kmem accounting to say that kmalloc allocations are kmem limited, but only if they are either less than a threshold size or done with slab. Simply increasing the size of a data structure doesn't seem like it should automatically cause the memory to become exempt from kmem limits. > In fact, do we actually need to charge every random kmem allocation? I > guess not. For instance, filesystems often allocate data shared among > all the FS users. It's wrong to charge such allocations to a particular > memcg, IMO. That said the next step is going to be adding a per kmem > cache flag specifying if allocations from this cache should be charged > so that accounting will work only for those caches that are marked so > explicitly. It's a question of what direction to approach kmem slab accounting from: either opt-out (as the code currently is), or opt-in (with per kmem_cache flags as you suggest). I agree that some structures end up being shared (e.g. filesystem block bit map structures). In an opt-out system these are charged to a memcg initially and remain charged there until the memcg is deleted at which point the shared objects are reparented to a shared location. While this isn't perfect, it's unclear if it's better or worse than analyzing each class of allocation and deciding if they should be opt'd-in. One could (though I'm not) make the case that even dentries are easily shareable between containers and thus shouldn't be accounted to a single memcg. But given user space's ability to DoS a machine with dentires, they should be accounted. > There is one more argument for removing kmalloc_large accounting - we > don't have an easy way to track such allocations, which prevents us from > reparenting kmemcg charges on css offline. Of course, we could link > kmalloc_large pages in some sort of per-memcg list which would allow us > to find them on css offline, but I don't think such a complication is > justified. I assume that reparenting of such non kmem_cache allocations (e.g. large kmalloc) is difficult because such pages refer to the memcg, which we're trying to delete and the memcg has no index of such pages. If such zombie memcg are undesirable, then an alternative to indexing the pages is to define a kmem context object which such large pages point to. The kmem context would be reparented without needing to adjust the individual large pages. But there are plenty of options. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 1/4] sl[au]b: do not charge large allocations to memcg
On Thu, Mar 27, 2014 at 12:37 AM, Vladimir Davydov vdavy...@parallels.com wrote: Hi Greg, On 03/27/2014 08:31 AM, Greg Thelen wrote: On Wed, Mar 26 2014, Vladimir Davydov vdavy...@parallels.com wrote: We don't track any random page allocation, so we shouldn't track kmalloc that falls back to the page allocator. This seems like a change which will leads to confusing (and arguably improper) kernel behavior. I prefer the behavior prior to this patch. Before this change both of the following allocations are charged to memcg (assuming kmem accounting is enabled): a = kmalloc(KMALLOC_MAX_CACHE_SIZE, GFP_KERNEL) b = kmalloc(KMALLOC_MAX_CACHE_SIZE + 1, GFP_KERNEL) After this change only 'a' is charged; 'b' goes directly to page allocator which no longer does accounting. Why do we need to charge 'b' in the first place? Can the userspace trigger such allocations massively? If there can only be one or two such allocations from a cgroup, is there any point in charging them? Of the top of my head I don't know of any 8KIB kmalloc()s so I can't say if they're directly triggerable by user space en masse. But we recently ran into some order:3 allocations in networking. The networking allocations used a non-generic kmem_cache (rather than kmalloc which started this discussion). For details, see ed98df3361f0 (net: use __GFP_NORETRY for high order allocations). I can't say if such allocations exist in device drivers, but given the networking example, it's conceivable that they may (or will) exist. With slab this isn't a problem because sla has kmalloc kmem_caches for all supported allocation sizes. However, slub shows this issue for any kmalloc() allocations larger than 8KIB (at least on x86_64). It seems like a strange directly to take kmem accounting to say that kmalloc allocations are kmem limited, but only if they are either less than a threshold size or done with slab. Simply increasing the size of a data structure doesn't seem like it should automatically cause the memory to become exempt from kmem limits. In fact, do we actually need to charge every random kmem allocation? I guess not. For instance, filesystems often allocate data shared among all the FS users. It's wrong to charge such allocations to a particular memcg, IMO. That said the next step is going to be adding a per kmem cache flag specifying if allocations from this cache should be charged so that accounting will work only for those caches that are marked so explicitly. It's a question of what direction to approach kmem slab accounting from: either opt-out (as the code currently is), or opt-in (with per kmem_cache flags as you suggest). I agree that some structures end up being shared (e.g. filesystem block bit map structures). In an opt-out system these are charged to a memcg initially and remain charged there until the memcg is deleted at which point the shared objects are reparented to a shared location. While this isn't perfect, it's unclear if it's better or worse than analyzing each class of allocation and deciding if they should be opt'd-in. One could (though I'm not) make the case that even dentries are easily shareable between containers and thus shouldn't be accounted to a single memcg. But given user space's ability to DoS a machine with dentires, they should be accounted. There is one more argument for removing kmalloc_large accounting - we don't have an easy way to track such allocations, which prevents us from reparenting kmemcg charges on css offline. Of course, we could link kmalloc_large pages in some sort of per-memcg list which would allow us to find them on css offline, but I don't think such a complication is justified. I assume that reparenting of such non kmem_cache allocations (e.g. large kmalloc) is difficult because such pages refer to the memcg, which we're trying to delete and the memcg has no index of such pages. If such zombie memcg are undesirable, then an alternative to indexing the pages is to define a kmem context object which such large pages point to. The kmem context would be reparented without needing to adjust the individual large pages. But there are plenty of options. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 1/4] sl[au]b: do not charge large allocations to memcg
On Wed, Mar 26 2014, Vladimir Davydov wrote: > We don't track any random page allocation, so we shouldn't track kmalloc > that falls back to the page allocator. This seems like a change which will leads to confusing (and arguably improper) kernel behavior. I prefer the behavior prior to this patch. Before this change both of the following allocations are charged to memcg (assuming kmem accounting is enabled): a = kmalloc(KMALLOC_MAX_CACHE_SIZE, GFP_KERNEL) b = kmalloc(KMALLOC_MAX_CACHE_SIZE + 1, GFP_KERNEL) After this change only 'a' is charged; 'b' goes directly to page allocator which no longer does accounting. > Signed-off-by: Vladimir Davydov > Cc: Johannes Weiner > Cc: Michal Hocko > Cc: Glauber Costa > Cc: Christoph Lameter > Cc: Pekka Enberg > --- > include/linux/slab.h |2 +- > mm/memcontrol.c | 27 +-- > mm/slub.c|4 ++-- > 3 files changed, 4 insertions(+), 29 deletions(-) > > diff --git a/include/linux/slab.h b/include/linux/slab.h > index 3dd389aa91c7..8a928ff71d93 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -363,7 +363,7 @@ kmalloc_order(size_t size, gfp_t flags, unsigned int > order) > { > void *ret; > > - flags |= (__GFP_COMP | __GFP_KMEMCG); > + flags |= __GFP_COMP; > ret = (void *) __get_free_pages(flags, order); > kmemleak_alloc(ret, size, 1, flags); > return ret; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index b4b6aef562fa..81a162d01d4d 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3528,35 +3528,10 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct > mem_cgroup **_memcg, int order) > > *_memcg = NULL; > > - /* > - * Disabling accounting is only relevant for some specific memcg > - * internal allocations. Therefore we would initially not have such > - * check here, since direct calls to the page allocator that are marked > - * with GFP_KMEMCG only happen outside memcg core. We are mostly > - * concerned with cache allocations, and by having this test at > - * memcg_kmem_get_cache, we are already able to relay the allocation to > - * the root cache and bypass the memcg cache altogether. > - * > - * There is one exception, though: the SLUB allocator does not create > - * large order caches, but rather service large kmallocs directly from > - * the page allocator. Therefore, the following sequence when backed by > - * the SLUB allocator: > - * > - * memcg_stop_kmem_account(); > - * kmalloc() > - * memcg_resume_kmem_account(); > - * > - * would effectively ignore the fact that we should skip accounting, > - * since it will drive us directly to this function without passing > - * through the cache selector memcg_kmem_get_cache. Such large > - * allocations are extremely rare but can happen, for instance, for the > - * cache arrays. We bring this test here. > - */ > - if (!current->mm || current->memcg_kmem_skip_account) > + if (!current->mm) > return true; > > memcg = get_mem_cgroup_from_mm(current->mm); > - > if (!memcg_can_account_kmem(memcg)) { > css_put(>css); > return true; > diff --git a/mm/slub.c b/mm/slub.c > index 5e234f1f8853..c2e58a787443 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -3325,7 +3325,7 @@ static void *kmalloc_large_node(size_t size, gfp_t > flags, int node) > struct page *page; > void *ptr = NULL; > > - flags |= __GFP_COMP | __GFP_NOTRACK | __GFP_KMEMCG; > + flags |= __GFP_COMP | __GFP_NOTRACK; > page = alloc_pages_node(node, flags, get_order(size)); > if (page) > ptr = page_address(page); > @@ -3395,7 +3395,7 @@ void kfree(const void *x) > if (unlikely(!PageSlab(page))) { > BUG_ON(!PageCompound(page)); > kfree_hook(x); > - __free_memcg_kmem_pages(page, compound_order(page)); > + __free_pages(page, compound_order(page)); > return; > } > slab_free(page->slab_cache, page, object, _RET_IP_); > -- > 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 1/4] sl[au]b: do not charge large allocations to memcg
On Wed, Mar 26 2014, Vladimir Davydov vdavy...@parallels.com wrote: We don't track any random page allocation, so we shouldn't track kmalloc that falls back to the page allocator. This seems like a change which will leads to confusing (and arguably improper) kernel behavior. I prefer the behavior prior to this patch. Before this change both of the following allocations are charged to memcg (assuming kmem accounting is enabled): a = kmalloc(KMALLOC_MAX_CACHE_SIZE, GFP_KERNEL) b = kmalloc(KMALLOC_MAX_CACHE_SIZE + 1, GFP_KERNEL) After this change only 'a' is charged; 'b' goes directly to page allocator which no longer does accounting. Signed-off-by: Vladimir Davydov vdavy...@parallels.com Cc: Johannes Weiner han...@cmpxchg.org Cc: Michal Hocko mho...@suse.cz Cc: Glauber Costa glom...@gmail.com Cc: Christoph Lameter c...@linux-foundation.org Cc: Pekka Enberg penb...@kernel.org --- include/linux/slab.h |2 +- mm/memcontrol.c | 27 +-- mm/slub.c|4 ++-- 3 files changed, 4 insertions(+), 29 deletions(-) diff --git a/include/linux/slab.h b/include/linux/slab.h index 3dd389aa91c7..8a928ff71d93 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -363,7 +363,7 @@ kmalloc_order(size_t size, gfp_t flags, unsigned int order) { void *ret; - flags |= (__GFP_COMP | __GFP_KMEMCG); + flags |= __GFP_COMP; ret = (void *) __get_free_pages(flags, order); kmemleak_alloc(ret, size, 1, flags); return ret; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b4b6aef562fa..81a162d01d4d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3528,35 +3528,10 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order) *_memcg = NULL; - /* - * Disabling accounting is only relevant for some specific memcg - * internal allocations. Therefore we would initially not have such - * check here, since direct calls to the page allocator that are marked - * with GFP_KMEMCG only happen outside memcg core. We are mostly - * concerned with cache allocations, and by having this test at - * memcg_kmem_get_cache, we are already able to relay the allocation to - * the root cache and bypass the memcg cache altogether. - * - * There is one exception, though: the SLUB allocator does not create - * large order caches, but rather service large kmallocs directly from - * the page allocator. Therefore, the following sequence when backed by - * the SLUB allocator: - * - * memcg_stop_kmem_account(); - * kmalloc(large_number) - * memcg_resume_kmem_account(); - * - * would effectively ignore the fact that we should skip accounting, - * since it will drive us directly to this function without passing - * through the cache selector memcg_kmem_get_cache. Such large - * allocations are extremely rare but can happen, for instance, for the - * cache arrays. We bring this test here. - */ - if (!current-mm || current-memcg_kmem_skip_account) + if (!current-mm) return true; memcg = get_mem_cgroup_from_mm(current-mm); - if (!memcg_can_account_kmem(memcg)) { css_put(memcg-css); return true; diff --git a/mm/slub.c b/mm/slub.c index 5e234f1f8853..c2e58a787443 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3325,7 +3325,7 @@ static void *kmalloc_large_node(size_t size, gfp_t flags, int node) struct page *page; void *ptr = NULL; - flags |= __GFP_COMP | __GFP_NOTRACK | __GFP_KMEMCG; + flags |= __GFP_COMP | __GFP_NOTRACK; page = alloc_pages_node(node, flags, get_order(size)); if (page) ptr = page_address(page); @@ -3395,7 +3395,7 @@ void kfree(const void *x) if (unlikely(!PageSlab(page))) { BUG_ON(!PageCompound(page)); kfree_hook(x); - __free_memcg_kmem_pages(page, compound_order(page)); + __free_pages(page, compound_order(page)); return; } slab_free(page-slab_cache, page, object, _RET_IP_); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] memcg: Low-limit reclaim
On Mon, Feb 03 2014, Michal Hocko wrote: > On Thu 30-01-14 16:28:27, Greg Thelen wrote: >> On Thu, Jan 30 2014, Michal Hocko wrote: >> >> > On Wed 29-01-14 11:08:46, Greg Thelen wrote: >> > [...] >> >> The series looks useful. We (Google) have been using something similar. >> >> In practice such a low_limit (or memory guarantee), doesn't nest very >> >> well. >> >> >> >> Example: >> >> - parent_memcg: limit 500, low_limit 500, usage 500 >> >> 1 privately charged non-reclaimable page (e.g. mlock, slab) >> >> - child_memcg: limit 500, low_limit 500, usage 499 >> > >> > I am not sure this is a good example. Your setup basically say that no >> > single page should be reclaimed. I can imagine this might be useful in >> > some cases and I would like to allow it but it sounds too extreme (e.g. >> > a load which would start trashing heavily once the reclaim starts and it >> > makes more sense to start it again rather than crowl - think about some >> > mathematical simulation which might diverge). >> >> Pages will still be reclaimed the usage_in_bytes is exceeds >> limit_in_bytes. I see the low_limit as a way to tell the kernel: don't >> reclaim my memory due to external pressure, but internal pressure is >> different. > > That sounds strange and very confusing to me. What if the internal > pressure comes from children memcgs? Lowlimit is intended for protecting > a group from reclaim and it shouldn't matter whether the reclaim is a > result of the internal or external pressure. > >> >> If a streaming file cache workload (e.g. sha1sum) starts gobbling up >> >> page cache it will lead to an oom kill instead of reclaiming. >> > >> > Does it make any sense to protect all of such memory although it is >> > easily reclaimable? >> >> I think protection makes sense in this case. If I know my workload >> needs 500 to operate well, then I reserve 500 using low_limit. My app >> doesn't want to run with less than its reservation. >> >> >> One could argue that this is working as intended because child_memcg >> >> was promised 500 but can only get 499. So child_memcg is oom killed >> >> rather than being forced to operate below its promised low limit. >> >> >> >> This has led to various internal workarounds like: >> >> - don't charge any memory to interior tree nodes (e.g. parent_memcg); >> >> only charge memory to cgroup leafs. This gets tricky when dealing >> >> with reparented memory inherited to parent from child during cgroup >> >> deletion. >> > >> > Do those need any protection at all? >> >> Interior tree nodes don't need protection from their children. But >> children and interior nodes need protection from siblings and parents. > > Why? They contains only reparented pages in the above case. Those would > be #1 candidate for reclaim in most cases, no? I think we're on the same page. My example interior node has reclaimed pages and is a #1 candidate for reclaim induced from charges against parent_memcg, but not a candidate for reclaim due to global memory pressure induced by a sibling of parent_memcg. >> >> - don't set low_limit on non leafs (e.g. do not set low limit on >> >> parent_memcg). This constrains the cgroup layout a bit. Some >> >> customers want to purchase $MEM and setup their workload with a few >> >> child cgroups. A system daemon hands out $MEM by setting low_limit >> >> for top-level containers (e.g. parent_memcg). Thereafter such >> >> customers are able to partition their workload with sub memcg below >> >> child_memcg. Example: >> >> parent_memcg >> >> \ >> >> child_memcg >> >> / \ >> >> server backup >> > >> > I think that the low_limit makes sense where you actually want to >> > protect something from reclaim. And backup sounds like a bad fit for >> > that. >> >> The backup job would presumably have a small low_limit, but it may still >> have a minimum working set required to make useful forward progress. >> >> Example: >> parent_memcg >> \ >>child_memcg limit 500, low_limit 500, usage 500 >> / \ >> | backup limit 10, low_limit 10, usage 10 >> | >> server limit 490, low_limit 490, usage 490 >> >> One could
Re: [RFC 0/4] memcg: Low-limit reclaim
On Mon, Feb 03 2014, Michal Hocko wrote: On Thu 30-01-14 16:28:27, Greg Thelen wrote: On Thu, Jan 30 2014, Michal Hocko wrote: On Wed 29-01-14 11:08:46, Greg Thelen wrote: [...] The series looks useful. We (Google) have been using something similar. In practice such a low_limit (or memory guarantee), doesn't nest very well. Example: - parent_memcg: limit 500, low_limit 500, usage 500 1 privately charged non-reclaimable page (e.g. mlock, slab) - child_memcg: limit 500, low_limit 500, usage 499 I am not sure this is a good example. Your setup basically say that no single page should be reclaimed. I can imagine this might be useful in some cases and I would like to allow it but it sounds too extreme (e.g. a load which would start trashing heavily once the reclaim starts and it makes more sense to start it again rather than crowl - think about some mathematical simulation which might diverge). Pages will still be reclaimed the usage_in_bytes is exceeds limit_in_bytes. I see the low_limit as a way to tell the kernel: don't reclaim my memory due to external pressure, but internal pressure is different. That sounds strange and very confusing to me. What if the internal pressure comes from children memcgs? Lowlimit is intended for protecting a group from reclaim and it shouldn't matter whether the reclaim is a result of the internal or external pressure. If a streaming file cache workload (e.g. sha1sum) starts gobbling up page cache it will lead to an oom kill instead of reclaiming. Does it make any sense to protect all of such memory although it is easily reclaimable? I think protection makes sense in this case. If I know my workload needs 500 to operate well, then I reserve 500 using low_limit. My app doesn't want to run with less than its reservation. One could argue that this is working as intended because child_memcg was promised 500 but can only get 499. So child_memcg is oom killed rather than being forced to operate below its promised low limit. This has led to various internal workarounds like: - don't charge any memory to interior tree nodes (e.g. parent_memcg); only charge memory to cgroup leafs. This gets tricky when dealing with reparented memory inherited to parent from child during cgroup deletion. Do those need any protection at all? Interior tree nodes don't need protection from their children. But children and interior nodes need protection from siblings and parents. Why? They contains only reparented pages in the above case. Those would be #1 candidate for reclaim in most cases, no? I think we're on the same page. My example interior node has reclaimed pages and is a #1 candidate for reclaim induced from charges against parent_memcg, but not a candidate for reclaim due to global memory pressure induced by a sibling of parent_memcg. - don't set low_limit on non leafs (e.g. do not set low limit on parent_memcg). This constrains the cgroup layout a bit. Some customers want to purchase $MEM and setup their workload with a few child cgroups. A system daemon hands out $MEM by setting low_limit for top-level containers (e.g. parent_memcg). Thereafter such customers are able to partition their workload with sub memcg below child_memcg. Example: parent_memcg \ child_memcg / \ server backup I think that the low_limit makes sense where you actually want to protect something from reclaim. And backup sounds like a bad fit for that. The backup job would presumably have a small low_limit, but it may still have a minimum working set required to make useful forward progress. Example: parent_memcg \ child_memcg limit 500, low_limit 500, usage 500 / \ | backup limit 10, low_limit 10, usage 10 | server limit 490, low_limit 490, usage 490 One could argue that problems appear when server.low_limit+backup.lower_limit=child_memcg.limit. So the safer configuration is leave some padding: server.low_limit + backup.low_limit + padding = child_memcg.limit but this just defers the problem. As memory is reparented into parent, then padding must grow. Which all sounds like a drawback of internal vs. external pressure semantic which you have mentioned above. Huh? I probably confused matters with the internal vs external talk above. Forgetting about that, I'm happy with the following configuration assuming low_limit_fallback (ll_fallback) is eventually available. parent_memcg \ child_memcg limit 500, low_limit 500, usage 500, ll_fallback 0 / \ | backup limit 10, low_limit 10, usage 10, ll_fallback 1 | server limit 490, low_limit 490, usage 490, ll_fallback 1 Thereafter customers often want some weak isolation between server and backup. To avoid
Re: [RFC 0/4] memcg: Low-limit reclaim
On Thu, Jan 30 2014, Michal Hocko wrote: > On Wed 29-01-14 11:08:46, Greg Thelen wrote: > [...] >> The series looks useful. We (Google) have been using something similar. >> In practice such a low_limit (or memory guarantee), doesn't nest very >> well. >> >> Example: >> - parent_memcg: limit 500, low_limit 500, usage 500 >> 1 privately charged non-reclaimable page (e.g. mlock, slab) >> - child_memcg: limit 500, low_limit 500, usage 499 > > I am not sure this is a good example. Your setup basically say that no > single page should be reclaimed. I can imagine this might be useful in > some cases and I would like to allow it but it sounds too extreme (e.g. > a load which would start trashing heavily once the reclaim starts and it > makes more sense to start it again rather than crowl - think about some > mathematical simulation which might diverge). Pages will still be reclaimed the usage_in_bytes is exceeds limit_in_bytes. I see the low_limit as a way to tell the kernel: don't reclaim my memory due to external pressure, but internal pressure is different. >> If a streaming file cache workload (e.g. sha1sum) starts gobbling up >> page cache it will lead to an oom kill instead of reclaiming. > > Does it make any sense to protect all of such memory although it is > easily reclaimable? I think protection makes sense in this case. If I know my workload needs 500 to operate well, then I reserve 500 using low_limit. My app doesn't want to run with less than its reservation. >> One could argue that this is working as intended because child_memcg >> was promised 500 but can only get 499. So child_memcg is oom killed >> rather than being forced to operate below its promised low limit. >> >> This has led to various internal workarounds like: >> - don't charge any memory to interior tree nodes (e.g. parent_memcg); >> only charge memory to cgroup leafs. This gets tricky when dealing >> with reparented memory inherited to parent from child during cgroup >> deletion. > > Do those need any protection at all? Interior tree nodes don't need protection from their children. But children and interior nodes need protection from siblings and parents. >> - don't set low_limit on non leafs (e.g. do not set low limit on >> parent_memcg). This constrains the cgroup layout a bit. Some >> customers want to purchase $MEM and setup their workload with a few >> child cgroups. A system daemon hands out $MEM by setting low_limit >> for top-level containers (e.g. parent_memcg). Thereafter such >> customers are able to partition their workload with sub memcg below >> child_memcg. Example: >> parent_memcg >> \ >> child_memcg >> / \ >> server backup > > I think that the low_limit makes sense where you actually want to > protect something from reclaim. And backup sounds like a bad fit for > that. The backup job would presumably have a small low_limit, but it may still have a minimum working set required to make useful forward progress. Example: parent_memcg \ child_memcg limit 500, low_limit 500, usage 500 / \ | backup limit 10, low_limit 10, usage 10 | server limit 490, low_limit 490, usage 490 One could argue that problems appear when server.low_limit+backup.lower_limit=child_memcg.limit. So the safer configuration is leave some padding: server.low_limit + backup.low_limit + padding = child_memcg.limit but this just defers the problem. As memory is reparented into parent, then padding must grow. >> Thereafter customers often want some weak isolation between server and >> backup. To avoid undesired oom kills the server/backup isolation is >> provided with a softer memory guarantee (e.g. soft_limit). The soft >> limit acts like the low_limit until priority becomes desperate. > > Johannes was already suggesting that the low_limit should allow for a > weaker semantic as well. I am not very much inclined to that but I can > leave with a knob which would say oom_on_lowlimit (on by default but > allowed to be set to 0). We would fallback to the full reclaim if > no groups turn out to be reclaimable. I like the strong semantic of your low_limit at least at level:1 cgroups (direct children of root). But I have also encountered situations where a strict guarantee is too strict and a mere preference is desirable. Perhaps the best plan is to continue with the proposed strict low_limit and eventually provide an additional mechanism which provides weaker guarantees (e.g. soft_limit or something else if soft_limit cannot be altered). These two would offer good support for a variety of use cases. I thinkin
Re: [RFC 0/4] memcg: Low-limit reclaim
On Thu, Jan 30 2014, Michal Hocko wrote: On Wed 29-01-14 11:08:46, Greg Thelen wrote: [...] The series looks useful. We (Google) have been using something similar. In practice such a low_limit (or memory guarantee), doesn't nest very well. Example: - parent_memcg: limit 500, low_limit 500, usage 500 1 privately charged non-reclaimable page (e.g. mlock, slab) - child_memcg: limit 500, low_limit 500, usage 499 I am not sure this is a good example. Your setup basically say that no single page should be reclaimed. I can imagine this might be useful in some cases and I would like to allow it but it sounds too extreme (e.g. a load which would start trashing heavily once the reclaim starts and it makes more sense to start it again rather than crowl - think about some mathematical simulation which might diverge). Pages will still be reclaimed the usage_in_bytes is exceeds limit_in_bytes. I see the low_limit as a way to tell the kernel: don't reclaim my memory due to external pressure, but internal pressure is different. If a streaming file cache workload (e.g. sha1sum) starts gobbling up page cache it will lead to an oom kill instead of reclaiming. Does it make any sense to protect all of such memory although it is easily reclaimable? I think protection makes sense in this case. If I know my workload needs 500 to operate well, then I reserve 500 using low_limit. My app doesn't want to run with less than its reservation. One could argue that this is working as intended because child_memcg was promised 500 but can only get 499. So child_memcg is oom killed rather than being forced to operate below its promised low limit. This has led to various internal workarounds like: - don't charge any memory to interior tree nodes (e.g. parent_memcg); only charge memory to cgroup leafs. This gets tricky when dealing with reparented memory inherited to parent from child during cgroup deletion. Do those need any protection at all? Interior tree nodes don't need protection from their children. But children and interior nodes need protection from siblings and parents. - don't set low_limit on non leafs (e.g. do not set low limit on parent_memcg). This constrains the cgroup layout a bit. Some customers want to purchase $MEM and setup their workload with a few child cgroups. A system daemon hands out $MEM by setting low_limit for top-level containers (e.g. parent_memcg). Thereafter such customers are able to partition their workload with sub memcg below child_memcg. Example: parent_memcg \ child_memcg / \ server backup I think that the low_limit makes sense where you actually want to protect something from reclaim. And backup sounds like a bad fit for that. The backup job would presumably have a small low_limit, but it may still have a minimum working set required to make useful forward progress. Example: parent_memcg \ child_memcg limit 500, low_limit 500, usage 500 / \ | backup limit 10, low_limit 10, usage 10 | server limit 490, low_limit 490, usage 490 One could argue that problems appear when server.low_limit+backup.lower_limit=child_memcg.limit. So the safer configuration is leave some padding: server.low_limit + backup.low_limit + padding = child_memcg.limit but this just defers the problem. As memory is reparented into parent, then padding must grow. Thereafter customers often want some weak isolation between server and backup. To avoid undesired oom kills the server/backup isolation is provided with a softer memory guarantee (e.g. soft_limit). The soft limit acts like the low_limit until priority becomes desperate. Johannes was already suggesting that the low_limit should allow for a weaker semantic as well. I am not very much inclined to that but I can leave with a knob which would say oom_on_lowlimit (on by default but allowed to be set to 0). We would fallback to the full reclaim if no groups turn out to be reclaimable. I like the strong semantic of your low_limit at least at level:1 cgroups (direct children of root). But I have also encountered situations where a strict guarantee is too strict and a mere preference is desirable. Perhaps the best plan is to continue with the proposed strict low_limit and eventually provide an additional mechanism which provides weaker guarantees (e.g. soft_limit or something else if soft_limit cannot be altered). These two would offer good support for a variety of use cases. I thinking of something like: bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg, struct mem_cgroup *root, int priority) { do { if (memcg == root) break; if (!res_counter_low_limit_excess(memcg-res)) return false; if ((priority = DEF_PRIORITY
Re: [RFC 0/4] memcg: Low-limit reclaim
On Wed, Dec 11 2013, Michal Hocko wrote: > Hi, > previous discussions have shown that soft limits cannot be reformed > (http://lwn.net/Articles/555249/). This series introduces an alternative > approach to protecting memory allocated to processes executing within > a memory cgroup controller. It is based on a new tunable that was > discussed with Johannes and Tejun held during the last kernel summit. > > This patchset introduces such low limit that is functionally similar to a > minimum guarantee. Memcgs which are under their lowlimit are not considered > eligible for the reclaim (both global and hardlimit). The default value of > the limit is 0 so all groups are eligible by default and an interested > party has to explicitly set the limit. > > The primary use case is to protect an amount of memory allocated to a > workload without it being reclaimed by an unrelated activity. In some > cases this requirement can be fulfilled by mlock but it is not suitable > for many loads and generally requires application awareness. Such > application awareness can be complex. It effectively forbids the > use of memory overcommit as the application must explicitly manage > memory residency. > With low limits, such workloads can be placed in a memcg with a low > limit that protects the estimated working set. > > Another use case might be unreclaimable groups. Some loads might be so > sensitive to reclaim that it is better to kill and start it again (or > since checkpoint) rather than trash. This would be trivial with low > limit set to unlimited and the OOM killer will handle the situation as > required (e.g. kill and restart). > > The hierarchical behavior of the lowlimit is described in the first > patch. It is followed by a direct reclaim fix which is necessary to > handle situation when a no group is eligible because all groups are > below low limit. This is not a big deal for hardlimit reclaim because > we simply retry the reclaim few times and then trigger memcg OOM killer > path. It would blow up in the global case when we would loop without > doing any progress or trigger OOM killer. I would consider configuration > leading to this state invalid but we should handle that gracefully. > > The third patch finally allows setting the lowlimit. > > The last patch tries expedites OOM if it is clear that no group is > eligible for reclaim. It basically breaks out of loops in the direct > reclaim and lets kswapd sleep because it wouldn't do any progress anyway. > > Thoughts? > > Short log says: > Michal Hocko (4): > memcg, mm: introduce lowlimit reclaim > mm, memcg: allow OOM if no memcg is eligible during direct reclaim > memcg: Allow setting low_limit > mm, memcg: expedite OOM if no memcg is reclaimable > > And a diffstat > include/linux/memcontrol.h | 14 +++ > include/linux/res_counter.h | 40 ++ > kernel/res_counter.c| 2 ++ > mm/memcontrol.c | 60 > - > mm/vmscan.c | 59 +--- > 5 files changed, 170 insertions(+), 5 deletions(-) The series looks useful. We (Google) have been using something similar. In practice such a low_limit (or memory guarantee), doesn't nest very well. Example: - parent_memcg: limit 500, low_limit 500, usage 500 1 privately charged non-reclaimable page (e.g. mlock, slab) - child_memcg: limit 500, low_limit 500, usage 499 If a streaming file cache workload (e.g. sha1sum) starts gobbling up page cache it will lead to an oom kill instead of reclaiming. One could argue that this is working as intended because child_memcg was promised 500 but can only get 499. So child_memcg is oom killed rather than being forced to operate below its promised low limit. This has led to various internal workarounds like: - don't charge any memory to interior tree nodes (e.g. parent_memcg); only charge memory to cgroup leafs. This gets tricky when dealing with reparented memory inherited to parent from child during cgroup deletion. - don't set low_limit on non leafs (e.g. do not set low limit on parent_memcg). This constrains the cgroup layout a bit. Some customers want to purchase $MEM and setup their workload with a few child cgroups. A system daemon hands out $MEM by setting low_limit for top-level containers (e.g. parent_memcg). Thereafter such customers are able to partition their workload with sub memcg below child_memcg. Example: parent_memcg \ child_memcg / \ server backup Thereafter customers often want some weak isolation between server and backup. To avoid undesired oom kills the server/backup isolation is provided with a softer memory guarantee (e.g. soft_limit). The soft limit acts like the low_limit until priority becomes desperate. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a
Re: [RFC 0/4] memcg: Low-limit reclaim
On Wed, Dec 11 2013, Michal Hocko wrote: Hi, previous discussions have shown that soft limits cannot be reformed (http://lwn.net/Articles/555249/). This series introduces an alternative approach to protecting memory allocated to processes executing within a memory cgroup controller. It is based on a new tunable that was discussed with Johannes and Tejun held during the last kernel summit. This patchset introduces such low limit that is functionally similar to a minimum guarantee. Memcgs which are under their lowlimit are not considered eligible for the reclaim (both global and hardlimit). The default value of the limit is 0 so all groups are eligible by default and an interested party has to explicitly set the limit. The primary use case is to protect an amount of memory allocated to a workload without it being reclaimed by an unrelated activity. In some cases this requirement can be fulfilled by mlock but it is not suitable for many loads and generally requires application awareness. Such application awareness can be complex. It effectively forbids the use of memory overcommit as the application must explicitly manage memory residency. With low limits, such workloads can be placed in a memcg with a low limit that protects the estimated working set. Another use case might be unreclaimable groups. Some loads might be so sensitive to reclaim that it is better to kill and start it again (or since checkpoint) rather than trash. This would be trivial with low limit set to unlimited and the OOM killer will handle the situation as required (e.g. kill and restart). The hierarchical behavior of the lowlimit is described in the first patch. It is followed by a direct reclaim fix which is necessary to handle situation when a no group is eligible because all groups are below low limit. This is not a big deal for hardlimit reclaim because we simply retry the reclaim few times and then trigger memcg OOM killer path. It would blow up in the global case when we would loop without doing any progress or trigger OOM killer. I would consider configuration leading to this state invalid but we should handle that gracefully. The third patch finally allows setting the lowlimit. The last patch tries expedites OOM if it is clear that no group is eligible for reclaim. It basically breaks out of loops in the direct reclaim and lets kswapd sleep because it wouldn't do any progress anyway. Thoughts? Short log says: Michal Hocko (4): memcg, mm: introduce lowlimit reclaim mm, memcg: allow OOM if no memcg is eligible during direct reclaim memcg: Allow setting low_limit mm, memcg: expedite OOM if no memcg is reclaimable And a diffstat include/linux/memcontrol.h | 14 +++ include/linux/res_counter.h | 40 ++ kernel/res_counter.c| 2 ++ mm/memcontrol.c | 60 - mm/vmscan.c | 59 +--- 5 files changed, 170 insertions(+), 5 deletions(-) The series looks useful. We (Google) have been using something similar. In practice such a low_limit (or memory guarantee), doesn't nest very well. Example: - parent_memcg: limit 500, low_limit 500, usage 500 1 privately charged non-reclaimable page (e.g. mlock, slab) - child_memcg: limit 500, low_limit 500, usage 499 If a streaming file cache workload (e.g. sha1sum) starts gobbling up page cache it will lead to an oom kill instead of reclaiming. One could argue that this is working as intended because child_memcg was promised 500 but can only get 499. So child_memcg is oom killed rather than being forced to operate below its promised low limit. This has led to various internal workarounds like: - don't charge any memory to interior tree nodes (e.g. parent_memcg); only charge memory to cgroup leafs. This gets tricky when dealing with reparented memory inherited to parent from child during cgroup deletion. - don't set low_limit on non leafs (e.g. do not set low limit on parent_memcg). This constrains the cgroup layout a bit. Some customers want to purchase $MEM and setup their workload with a few child cgroups. A system daemon hands out $MEM by setting low_limit for top-level containers (e.g. parent_memcg). Thereafter such customers are able to partition their workload with sub memcg below child_memcg. Example: parent_memcg \ child_memcg / \ server backup Thereafter customers often want some weak isolation between server and backup. To avoid undesired oom kills the server/backup isolation is provided with a softer memory guarantee (e.g. soft_limit). The soft limit acts like the low_limit until priority becomes desperate. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH] ipc: introduce ipc_valid_object() helper to sort out IPC_RMID races
On Tue, Dec 17 2013, Rafael Aquini wrote: > After the locking semantics for the SysV IPC API got improved, a couple of > IPC_RMID race windows were opened because we ended up dropping the > 'kern_ipc_perm.deleted' check performed way down in ipc_lock(). > The spotted races got sorted out by re-introducing the old test within > the racy critical sections. > > This patch introduces ipc_valid_object() to consolidate the way we cope with > IPC_RMID races by using the same abstraction across the API implementation. > > Signed-off-by: Rafael Aquini Acked-by: Greg Thelen -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ipc: introduce ipc_valid_object() helper to sort out IPC_RMID races
On Tue, Dec 17 2013, Rafael Aquini wrote: After the locking semantics for the SysV IPC API got improved, a couple of IPC_RMID race windows were opened because we ended up dropping the 'kern_ipc_perm.deleted' check performed way down in ipc_lock(). The spotted races got sorted out by re-introducing the old test within the racy critical sections. This patch introduces ipc_valid_object() to consolidate the way we cope with IPC_RMID races by using the same abstraction across the API implementation. Signed-off-by: Rafael Aquini aqu...@redhat.com Acked-by: Greg Thelen gthe...@google.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ipc,shm: fix shm_file deletion races
When IPC_RMID races with other shm operations there's potential for use-after-free of the shm object's associated file (shm_file). Here's the race before this patch: TASK 1 TASK 2 -- -- shm_rmid() ipc_lock_object() shmctl() shp = shm_obtain_object_check() shm_destroy() shum_unlock() fput(shp->shm_file) ipc_lock_object() shmem_lock(shp->shm_file) The oops is caused because shm_destroy() calls fput() after dropping the ipc_lock. fput() clears the file's f_inode, f_path.dentry, and f_path.mnt, which causes various NULL pointer references in task 2. I reliably see the oops in task 2 if with shmlock, shmu This patch fixes the races by: 1) set shm_file=NULL in shm_destroy() while holding ipc_object_lock(). 2) modify at risk operations to check shm_file while holding ipc_object_lock(). Example workloads, which each trigger oops... Workload 1: while true; do id=$(shmget 1 4096) shm_rmid $id & shmlock $id & wait done The oops stack shows accessing NULL f_inode due to racing fput: _raw_spin_lock shmem_lock SyS_shmctl Workload 2: while true; do id=$(shmget 1 4096) shmat $id 4096 & shm_rmid $id & wait done The oops stack is similar to workload 1 due to NULL f_inode: touch_atime shmem_mmap shm_mmap mmap_region do_mmap_pgoff do_shmat SyS_shmat Workload 3: while true; do id=$(shmget 1 4096) shmlock $id shm_rmid $id & shmunlock $id & wait done The oops stack shows second fput tripping on an NULL f_inode. The first fput() completed via from shm_destroy(), but a racing thread did a get_file() and queued this fput(): locks_remove_flock __fput fput task_work_run do_notify_resume int_signal Fixes: c2c737a0461e ("ipc,shm: shorten critical region for shmat") Fixes: 2caacaa82a51 ("ipc,shm: shorten critical region for shmctl") Signed-off-by: Greg Thelen Cc: # 3.10.17+ 3.11.6+ --- ipc/shm.c | 28 +++- 1 file changed, 23 insertions(+), 5 deletions(-) diff --git a/ipc/shm.c b/ipc/shm.c index d69739610fd4..0bdf21c6814e 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -208,15 +208,18 @@ static void shm_open(struct vm_area_struct *vma) */ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp) { + struct file *shm_file; + + shm_file = shp->shm_file; + shp->shm_file = NULL; ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT; shm_rmid(ns, shp); shm_unlock(shp); - if (!is_file_hugepages(shp->shm_file)) - shmem_lock(shp->shm_file, 0, shp->mlock_user); + if (!is_file_hugepages(shm_file)) + shmem_lock(shm_file, 0, shp->mlock_user); else if (shp->mlock_user) - user_shm_unlock(file_inode(shp->shm_file)->i_size, - shp->mlock_user); - fput (shp->shm_file); + user_shm_unlock(file_inode(shm_file)->i_size, shp->mlock_user); + fput(shm_file); ipc_rcu_putref(shp, shm_rcu_free); } @@ -983,6 +986,13 @@ SYSCALL_DEFINE3(shmctl, int, shmid, int, cmd, struct shmid_ds __user *, buf) } shm_file = shp->shm_file; + + /* check if shm_destroy() is tearing down shp */ + if (shm_file == NULL) { + err = -EIDRM; + goto out_unlock0; + } + if (is_file_hugepages(shm_file)) goto out_unlock0; @@ -1101,6 +,14 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg, ulong *raddr, goto out_unlock; ipc_lock_object(>shm_perm); + + /* check if shm_destroy() is tearing down shp */ + if (shp->shm_file == NULL) { + ipc_unlock_object(>shm_perm); + err = -EIDRM; + goto out_unlock; + } + path = shp->shm_file->f_path; path_get(); shp->shm_nattch++; -- 1.8.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ipc,shm: fix shm_file deletion races
When IPC_RMID races with other shm operations there's potential for use-after-free of the shm object's associated file (shm_file). Here's the race before this patch: TASK 1 TASK 2 -- -- shm_rmid() ipc_lock_object() shmctl() shp = shm_obtain_object_check() shm_destroy() shum_unlock() fput(shp-shm_file) ipc_lock_object() shmem_lock(shp-shm_file) OOPS The oops is caused because shm_destroy() calls fput() after dropping the ipc_lock. fput() clears the file's f_inode, f_path.dentry, and f_path.mnt, which causes various NULL pointer references in task 2. I reliably see the oops in task 2 if with shmlock, shmu This patch fixes the races by: 1) set shm_file=NULL in shm_destroy() while holding ipc_object_lock(). 2) modify at risk operations to check shm_file while holding ipc_object_lock(). Example workloads, which each trigger oops... Workload 1: while true; do id=$(shmget 1 4096) shm_rmid $id shmlock $id wait done The oops stack shows accessing NULL f_inode due to racing fput: _raw_spin_lock shmem_lock SyS_shmctl Workload 2: while true; do id=$(shmget 1 4096) shmat $id 4096 shm_rmid $id wait done The oops stack is similar to workload 1 due to NULL f_inode: touch_atime shmem_mmap shm_mmap mmap_region do_mmap_pgoff do_shmat SyS_shmat Workload 3: while true; do id=$(shmget 1 4096) shmlock $id shm_rmid $id shmunlock $id wait done The oops stack shows second fput tripping on an NULL f_inode. The first fput() completed via from shm_destroy(), but a racing thread did a get_file() and queued this fput(): locks_remove_flock __fput fput task_work_run do_notify_resume int_signal Fixes: c2c737a0461e (ipc,shm: shorten critical region for shmat) Fixes: 2caacaa82a51 (ipc,shm: shorten critical region for shmctl) Signed-off-by: Greg Thelen gthe...@google.com Cc: sta...@vger.kernel.org # 3.10.17+ 3.11.6+ --- ipc/shm.c | 28 +++- 1 file changed, 23 insertions(+), 5 deletions(-) diff --git a/ipc/shm.c b/ipc/shm.c index d69739610fd4..0bdf21c6814e 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -208,15 +208,18 @@ static void shm_open(struct vm_area_struct *vma) */ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp) { + struct file *shm_file; + + shm_file = shp-shm_file; + shp-shm_file = NULL; ns-shm_tot -= (shp-shm_segsz + PAGE_SIZE - 1) PAGE_SHIFT; shm_rmid(ns, shp); shm_unlock(shp); - if (!is_file_hugepages(shp-shm_file)) - shmem_lock(shp-shm_file, 0, shp-mlock_user); + if (!is_file_hugepages(shm_file)) + shmem_lock(shm_file, 0, shp-mlock_user); else if (shp-mlock_user) - user_shm_unlock(file_inode(shp-shm_file)-i_size, - shp-mlock_user); - fput (shp-shm_file); + user_shm_unlock(file_inode(shm_file)-i_size, shp-mlock_user); + fput(shm_file); ipc_rcu_putref(shp, shm_rcu_free); } @@ -983,6 +986,13 @@ SYSCALL_DEFINE3(shmctl, int, shmid, int, cmd, struct shmid_ds __user *, buf) } shm_file = shp-shm_file; + + /* check if shm_destroy() is tearing down shp */ + if (shm_file == NULL) { + err = -EIDRM; + goto out_unlock0; + } + if (is_file_hugepages(shm_file)) goto out_unlock0; @@ -1101,6 +,14 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg, ulong *raddr, goto out_unlock; ipc_lock_object(shp-shm_perm); + + /* check if shm_destroy() is tearing down shp */ + if (shp-shm_file == NULL) { + ipc_unlock_object(shp-shm_perm); + err = -EIDRM; + goto out_unlock; + } + path = shp-shm_file-f_path; path_get(path); shp-shm_nattch++; -- 1.8.4.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/3] percpu: add test module for various percpu operations
On Mon, Nov 04 2013, Andrew Morton wrote: > On Sun, 27 Oct 2013 10:30:15 -0700 Greg Thelen wrote: > >> Tests various percpu operations. > > Could you please take a look at the 32-bit build (this is i386): > > lib/percpu_test.c: In function 'percpu_test_init': > lib/percpu_test.c:61: warning: integer constant is too large for 'long' type > lib/percpu_test.c:61: warning: integer constant is too large for 'long' type > lib/percpu_test.c:61: warning: integer constant is too large for 'long' type > lib/percpu_test.c:61: warning: integer constant is too large for 'long' type > lib/percpu_test.c:61: warning: integer constant is too large for 'long' type > lib/percpu_test.c:61: warning: integer constant is too large for 'long' type > lib/percpu_test.c:70: warning: integer constant is too large for 'long' type > lib/percpu_test.c:70: warning: integer constant is too large for 'long' type > lib/percpu_test.c:70: warning: integer constant is too large for 'long' type > lib/percpu_test.c:70: warning: integer constant is too large for 'long' type > lib/percpu_test.c:70: warning: integer constant is too large for 'long' type > lib/percpu_test.c:70: warning: integer constant is too large for 'long' type > lib/percpu_test.c:89: warning: integer constant is too large for 'long' type > lib/percpu_test.c:89: warning: integer constant is too large for 'long' type > lib/percpu_test.c:89: warning: integer constant is too large for 'long' type > lib/percpu_test.c:89: warning: integer constant is too large for 'long' type > lib/percpu_test.c:89: warning: integer constant is too large for 'long' type > lib/percpu_test.c:89: warning: integer constant is too large for 'long' type > lib/percpu_test.c:97: warning: integer constant is too large for 'long' type > lib/percpu_test.c:97: warning: integer constant is too large for 'long' type > lib/percpu_test.c:97: warning: integer constant is too large for 'long' type > lib/percpu_test.c:97: warning: integer constant is too large for 'long' type > lib/percpu_test.c:97: warning: integer constant is too large for 'long' type > lib/percpu_test.c:97: warning: integer constant is too large for 'long' type > lib/percpu_test.c:112: warning: integer constant is too large for 'long' type > lib/percpu_test.c:112: warning: integer constant is too large for 'long' type > lib/percpu_test.c:112: warning: integer constant is too large for 'long' type > lib/percpu_test.c:112: warning: integer constant is too large for 'long' type > lib/percpu_test.c:112: warning: integer constant is too large for 'long' type > lib/percpu_test.c:112: warning: integer constant is too large for 'long' type I was using gcc 4.6 which apparently adds LL suffix as needed. Though there were some other code problems with 32 bit beyond missing suffixes. Fixed version below tested with both gcc 4.4 and gcc 4.6 on 32 and 64 bit x86. ---8<--- >From a95bb1ce42b4492644fa10c7c80fd9bbd7bf23b9 Mon Sep 17 00:00:00 2001 In-Reply-To: <20131104160918.0c571b410cf165e9c4b4a...@linux-foundation.org> References: <20131104160918.0c571b410cf165e9c4b4a...@linux-foundation.org> From: Greg Thelen Date: Sun, 27 Oct 2013 10:30:15 -0700 Subject: [PATCH v2] percpu: add test module for various percpu operations Tests various percpu operations. Enable with CONFIG_PERCPU_TEST=m. Signed-off-by: Greg Thelen Acked-by: Tejun Heo --- Changelog since v1: - use %lld/x which allows for less casting - fix 32 bit build by casting large constants lib/Kconfig.debug | 9 lib/Makefile | 2 + lib/percpu_test.c | 138 ++ 3 files changed, 149 insertions(+) create mode 100644 lib/percpu_test.c diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 094f3152ec2b..1891eb271adf 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1472,6 +1472,15 @@ config INTERVAL_TREE_TEST help A benchmark measuring the performance of the interval tree library +config PERCPU_TEST + tristate "Per cpu operations test" + depends on m && DEBUG_KERNEL + help + Enable this option to build test module which validates per-cpu + operations. + + If unsure, say N. + config ATOMIC64_SELFTEST bool "Perform an atomic64_t self-test at boot" help diff --git a/lib/Makefile b/lib/Makefile index f3bb2cb98adf..bb016e116ba4 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -157,6 +157,8 @@ obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o interval_tree_test-objs := interval_tree_test_main.o interval_tree.o +obj-$(CONFIG_PERCPU_TEST) += percpu_test.o + obj-$(CONFIG_ASN1) += asn1_decoder.o obj-$(CONFIG_FONT_SUPPORT) += fonts/ diff --git a/lib/percpu_test.c b/lib/percpu_test.c new file mode 100644 index ..0b5d14dadd1a --- /dev/
Re: [PATCH v2 1/3] percpu: add test module for various percpu operations
On Mon, Nov 04 2013, Andrew Morton wrote: On Sun, 27 Oct 2013 10:30:15 -0700 Greg Thelen gthe...@google.com wrote: Tests various percpu operations. Could you please take a look at the 32-bit build (this is i386): lib/percpu_test.c: In function 'percpu_test_init': lib/percpu_test.c:61: warning: integer constant is too large for 'long' type lib/percpu_test.c:61: warning: integer constant is too large for 'long' type lib/percpu_test.c:61: warning: integer constant is too large for 'long' type lib/percpu_test.c:61: warning: integer constant is too large for 'long' type lib/percpu_test.c:61: warning: integer constant is too large for 'long' type lib/percpu_test.c:61: warning: integer constant is too large for 'long' type lib/percpu_test.c:70: warning: integer constant is too large for 'long' type lib/percpu_test.c:70: warning: integer constant is too large for 'long' type lib/percpu_test.c:70: warning: integer constant is too large for 'long' type lib/percpu_test.c:70: warning: integer constant is too large for 'long' type lib/percpu_test.c:70: warning: integer constant is too large for 'long' type lib/percpu_test.c:70: warning: integer constant is too large for 'long' type lib/percpu_test.c:89: warning: integer constant is too large for 'long' type lib/percpu_test.c:89: warning: integer constant is too large for 'long' type lib/percpu_test.c:89: warning: integer constant is too large for 'long' type lib/percpu_test.c:89: warning: integer constant is too large for 'long' type lib/percpu_test.c:89: warning: integer constant is too large for 'long' type lib/percpu_test.c:89: warning: integer constant is too large for 'long' type lib/percpu_test.c:97: warning: integer constant is too large for 'long' type lib/percpu_test.c:97: warning: integer constant is too large for 'long' type lib/percpu_test.c:97: warning: integer constant is too large for 'long' type lib/percpu_test.c:97: warning: integer constant is too large for 'long' type lib/percpu_test.c:97: warning: integer constant is too large for 'long' type lib/percpu_test.c:97: warning: integer constant is too large for 'long' type lib/percpu_test.c:112: warning: integer constant is too large for 'long' type lib/percpu_test.c:112: warning: integer constant is too large for 'long' type lib/percpu_test.c:112: warning: integer constant is too large for 'long' type lib/percpu_test.c:112: warning: integer constant is too large for 'long' type lib/percpu_test.c:112: warning: integer constant is too large for 'long' type lib/percpu_test.c:112: warning: integer constant is too large for 'long' type I was using gcc 4.6 which apparently adds LL suffix as needed. Though there were some other code problems with 32 bit beyond missing suffixes. Fixed version below tested with both gcc 4.4 and gcc 4.6 on 32 and 64 bit x86. ---8--- From a95bb1ce42b4492644fa10c7c80fd9bbd7bf23b9 Mon Sep 17 00:00:00 2001 In-Reply-To: 20131104160918.0c571b410cf165e9c4b4a...@linux-foundation.org References: 20131104160918.0c571b410cf165e9c4b4a...@linux-foundation.org From: Greg Thelen gthe...@google.com Date: Sun, 27 Oct 2013 10:30:15 -0700 Subject: [PATCH v2] percpu: add test module for various percpu operations Tests various percpu operations. Enable with CONFIG_PERCPU_TEST=m. Signed-off-by: Greg Thelen gthe...@google.com Acked-by: Tejun Heo t...@kernel.org --- Changelog since v1: - use %lld/x which allows for less casting - fix 32 bit build by casting large constants lib/Kconfig.debug | 9 lib/Makefile | 2 + lib/percpu_test.c | 138 ++ 3 files changed, 149 insertions(+) create mode 100644 lib/percpu_test.c diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 094f3152ec2b..1891eb271adf 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1472,6 +1472,15 @@ config INTERVAL_TREE_TEST help A benchmark measuring the performance of the interval tree library +config PERCPU_TEST + tristate Per cpu operations test + depends on m DEBUG_KERNEL + help + Enable this option to build test module which validates per-cpu + operations. + + If unsure, say N. + config ATOMIC64_SELFTEST bool Perform an atomic64_t self-test at boot help diff --git a/lib/Makefile b/lib/Makefile index f3bb2cb98adf..bb016e116ba4 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -157,6 +157,8 @@ obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o interval_tree_test-objs := interval_tree_test_main.o interval_tree.o +obj-$(CONFIG_PERCPU_TEST) += percpu_test.o + obj-$(CONFIG_ASN1) += asn1_decoder.o obj-$(CONFIG_FONT_SUPPORT) += fonts/ diff --git a/lib/percpu_test.c b/lib/percpu_test.c new file mode 100644 index ..0b5d14dadd1a --- /dev/null +++ b/lib/percpu_test.c @@ -0,0 +1,138 @@ +#include linux/module.h + +/* validate @native and @pcp counter values match @expected */ +#define CHECK(native, pcp, expected
[PATCH v2 3/3] memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend casting
As of v3.11-9444-g3ea67d0 "memcg: add per cgroup writeback pages accounting" memcg counter errors are possible when moving charged memory to a different memcg. Charge movement occurs when processing writes to memory.force_empty, moving tasks to a memcg with memcg.move_charge_at_immigrate=1, or memcg deletion. An example showing error after memory.force_empty: $ cd /sys/fs/cgroup/memory $ mkdir x $ rm /data/tmp/file $ (echo $BASHPID >> x/tasks && exec mmap_writer /data/tmp/file 1M) & [1] 13600 $ grep ^mapped x/memory.stat mapped_file 1048576 $ echo 13600 > tasks $ echo 1 > x/memory.force_empty $ grep ^mapped x/memory.stat mapped_file 4503599627370496 mapped_file should end with 0. 4503599627370496 == 0x10,,, == 0x100,, pages 1048576 == 0x10, == 0x100 pages This issue only affects the source memcg on 64 bit machines; the destination memcg counters are correct. So the rmdir case is not too important because such counters are soon disappearing with the entire memcg. But the memcg.force_empty and memory.move_charge_at_immigrate=1 cases are larger problems as the bogus counters are visible for the (possibly long) remaining life of the source memcg. The problem is due to memcg use of __this_cpu_from(.., -nr_pages), which is subtly wrong because it subtracts the unsigned int nr_pages (either -1 or -512 for THP) from a signed long percpu counter. When nr_pages=-1, -nr_pages=0x. On 64 bit machines stat->count[idx] is signed 64 bit. So memcg's attempt to simply decrement a count (e.g. from 1 to 0) boils down to: long count = 1 unsigned int nr_pages = 1 count += -nr_pages /* -nr_pages == 0x, */ count is now 0x1,, instead of 0 The fix is to subtract the unsigned page count rather than adding its negation. This only works once "percpu: fix this_cpu_sub() subtrahend casting for unsigneds" is applied to fix this_cpu_sub(). Signed-off-by: Greg Thelen Acked-by: Tejun Heo --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index aa8185c..b7ace0f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup *from, { /* Update stat data for mem_cgroup */ preempt_disable(); - __this_cpu_add(from->stat->count[idx], -nr_pages); + __this_cpu_sub(from->stat->count[idx], nr_pages); __this_cpu_add(to->stat->count[idx], nr_pages); preempt_enable(); } -- 1.8.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/3] percpu: add test module for various percpu operations
Tests various percpu operations. Enable with CONFIG_PERCPU_TEST=m. Signed-off-by: Greg Thelen Acked-by: Tejun Heo --- lib/Kconfig.debug | 9 lib/Makefile | 2 + lib/percpu_test.c | 138 ++ 3 files changed, 149 insertions(+) create mode 100644 lib/percpu_test.c diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 06344d9..9fdb452 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1472,6 +1472,15 @@ config INTERVAL_TREE_TEST help A benchmark measuring the performance of the interval tree library +config PERCPU_TEST + tristate "Per cpu operations test" + depends on m && DEBUG_KERNEL + help + Enable this option to build test module which validates per-cpu + operations. + + If unsure, say N. + config ATOMIC64_SELFTEST bool "Perform an atomic64_t self-test at boot" help diff --git a/lib/Makefile b/lib/Makefile index f3bb2cb..bb016e1 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -157,6 +157,8 @@ obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o interval_tree_test-objs := interval_tree_test_main.o interval_tree.o +obj-$(CONFIG_PERCPU_TEST) += percpu_test.o + obj-$(CONFIG_ASN1) += asn1_decoder.o obj-$(CONFIG_FONT_SUPPORT) += fonts/ diff --git a/lib/percpu_test.c b/lib/percpu_test.c new file mode 100644 index 000..fcca49e --- /dev/null +++ b/lib/percpu_test.c @@ -0,0 +1,138 @@ +#include + +/* validate @native and @pcp counter values match @expected */ +#define CHECK(native, pcp, expected)\ + do {\ + WARN((native) != (expected),\ +"raw %ld (0x%lx) != expected %ld (0x%lx)", \ +(long)(native), (long)(native),\ +(long)(expected), (long)(expected)); \ + WARN(__this_cpu_read(pcp) != (expected),\ +"pcp %ld (0x%lx) != expected %ld (0x%lx)", \ +(long)__this_cpu_read(pcp), (long)__this_cpu_read(pcp), \ +(long)(expected), (long)(expected)); \ + } while (0) + +static DEFINE_PER_CPU(long, long_counter); +static DEFINE_PER_CPU(unsigned long, ulong_counter); + +static int __init percpu_test_init(void) +{ + /* +* volatile prevents compiler from optimizing it uses, otherwise the +* +ul_one and -ul_one below would replace with inc/dec instructions. +*/ + volatile unsigned int ui_one = 1; + long l = 0; + unsigned long ul = 0; + + pr_info("percpu test start\n"); + + preempt_disable(); + + l += -1; + __this_cpu_add(long_counter, -1); + CHECK(l, long_counter, -1); + + l += 1; + __this_cpu_add(long_counter, 1); + CHECK(l, long_counter, 0); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul += 1UL; + __this_cpu_add(ulong_counter, 1UL); + CHECK(ul, ulong_counter, 1); + + ul += -1UL; + __this_cpu_add(ulong_counter, -1UL); + CHECK(ul, ulong_counter, 0); + + ul += -(unsigned long)1; + __this_cpu_add(ulong_counter, -(unsigned long)1); + CHECK(ul, ulong_counter, -1); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul -= 1; + __this_cpu_dec(ulong_counter); + CHECK(ul, ulong_counter, 0x); + CHECK(ul, ulong_counter, -1); + + l += -ui_one; + __this_cpu_add(long_counter, -ui_one); + CHECK(l, long_counter, 0x); + + l += ui_one; + __this_cpu_add(long_counter, ui_one); + CHECK(l, long_counter, 0x1); + + + l = 0; + __this_cpu_write(long_counter, 0); + + l -= ui_one; + __this_cpu_sub(long_counter, ui_one); + CHECK(l, long_counter, -1); + + l = 0; + __this_cpu_write(long_counter, 0); + + l += ui_one; + __this_cpu_add(long_counter, ui_one); + CHECK(l, long_counter, 1); + + l += -ui_one; + __this_cpu_add(long_counter, -ui_one); + CHECK(l, long_counter, 0x1); + + l = 0; + __this_cpu_write(long_counter, 0); + + l -= ui_one; + this_cpu_sub(long_counter, ui_one); + CHECK(l, long_counter, -1); + CHECK(l, long_counter, 0x); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul += ui_one; + __this_cpu_add(ulong_counter, ui_one); + CHECK(ul, ulong_counter, 1); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul -= ui_one; + __this_cpu_sub(ulong_counter, ui_one); + CHECK(ul, ulong_counter, -1); + CHECK(ul, ulong_counter, 0x); + + ul = 3; + __this_cpu_w
[PATCH v2 2/3] percpu: fix this_cpu_sub() subtrahend casting for unsigneds
this_cpu_sub() is implemented as negation and addition. This patch casts the adjustment to the counter type before negation to sign extend the adjustment. This helps in cases where the counter type is wider than an unsigned adjustment. An alternative to this patch is to declare such operations unsupported, but it seemed useful to avoid surprises. This patch specifically helps the following example: unsigned int delta = 1 preempt_disable() this_cpu_write(long_counter, 0) this_cpu_sub(long_counter, delta) preempt_enable() Before this change long_counter on a 64 bit machine ends with value 0x, rather than 0x. This is because this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta), which is basically: long_counter = 0 + 0x Also apply the same cast to: __this_cpu_sub() __this_cpu_sub_return() this_cpu_sub_return() All percpu_test.ko passes, especially the following cases which previously failed: l -= ui_one; __this_cpu_sub(long_counter, ui_one); CHECK(l, long_counter, -1); l -= ui_one; this_cpu_sub(long_counter, ui_one); CHECK(l, long_counter, -1); CHECK(l, long_counter, 0x); ul -= ui_one; __this_cpu_sub(ulong_counter, ui_one); CHECK(ul, ulong_counter, -1); CHECK(ul, ulong_counter, 0x); ul = this_cpu_sub_return(ulong_counter, ui_one); CHECK(ul, ulong_counter, 2); ul = __this_cpu_sub_return(ulong_counter, ui_one); CHECK(ul, ulong_counter, 1); Signed-off-by: Greg Thelen Acked-by: Tejun Heo --- arch/x86/include/asm/percpu.h | 3 ++- include/linux/percpu.h| 8 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 0da5200..b3e18f8 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -128,7 +128,8 @@ do { \ do { \ typedef typeof(var) pao_T__;\ const int pao_ID__ = (__builtin_constant_p(val) && \ - ((val) == 1 || (val) == -1)) ? (val) : 0; \ + ((val) == 1 || (val) == -1)) ?\ + (int)(val) : 0; \ if (0) {\ pao_T__ pao_tmp__; \ pao_tmp__ = (val); \ diff --git a/include/linux/percpu.h b/include/linux/percpu.h index cc88172..c74088a 100644 --- a/include/linux/percpu.h +++ b/include/linux/percpu.h @@ -332,7 +332,7 @@ do { \ #endif #ifndef this_cpu_sub -# define this_cpu_sub(pcp, val)this_cpu_add((pcp), -(val)) +# define this_cpu_sub(pcp, val)this_cpu_add((pcp), -(typeof(pcp))(val)) #endif #ifndef this_cpu_inc @@ -418,7 +418,7 @@ do { \ # define this_cpu_add_return(pcp, val) __pcpu_size_call_return2(this_cpu_add_return_, pcp, val) #endif -#define this_cpu_sub_return(pcp, val) this_cpu_add_return(pcp, -(val)) +#define this_cpu_sub_return(pcp, val) this_cpu_add_return(pcp, -(typeof(pcp))(val)) #define this_cpu_inc_return(pcp) this_cpu_add_return(pcp, 1) #define this_cpu_dec_return(pcp) this_cpu_add_return(pcp, -1) @@ -586,7 +586,7 @@ do { \ #endif #ifndef __this_cpu_sub -# define __this_cpu_sub(pcp, val) __this_cpu_add((pcp), -(val)) +# define __this_cpu_sub(pcp, val) __this_cpu_add((pcp), -(typeof(pcp))(val)) #endif #ifndef __this_cpu_inc @@ -668,7 +668,7 @@ do { \ __pcpu_size_call_return2(__this_cpu_add_return_, pcp, val) #endif -#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, -(val)) +#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, -(typeof(pcp))(val)) #define __this_cpu_inc_return(pcp) __this_cpu_add_return(pcp, 1) #define __this_cpu_dec_return(pcp) __this_cpu_add_return(pcp, -1) -- 1.8.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 0/3] fix unsigned pcp adjustments
As of v3.11-9444-g3ea67d0 "memcg: add per cgroup writeback pages accounting" memcg use of __this_cpu_add(counter, -nr_pages) leads to incorrect statistic values because the negated nr_pages is not sign extended (counter is long, nr_pages is unsigned int). The memcg fix is __this_cpu_sub(counter, nr_pages). But that doesn't simply work because __this_cpu_sub(counter, nr_pages) was implemented as __this_cpu_add(counter, -nr_pages) which suffers the same problem. Example: unsigned int delta = 1 preempt_disable() this_cpu_write(long_counter, 0) this_cpu_sub(long_counter, delta) preempt_enable() Before this change long_counter on a 64 bit machine ends with value 0x, rather than 0x. This is because this_cpu_sub(pcp, delta) boils down to: long_counter = 0 + 0x v3.12-rc6 shows that only new memcg code is affected by this problem - the new mem_cgroup_move_account_page_stat() is the only place where an unsigned adjustment is used. All other callers (e.g. shrink_dcache_sb) already use a signed adjustment, so no problems before v3.12. Though I did not audit the stable kernel trees, so there could be something hiding in there. Patch 1 creates a test module for percpu operations which demonstrates the __this_cpu_sub() problems. This patch is independent can be discarded if there is no interest. Patch 2 fixes __this_cpu_sub() to work with unsigned adjustments. Patch 3 uses __this_cpu_sub() in memcg. An alternative smaller solution is for memcg to use: __this_cpu_add(counter, -(int)nr_pages) admitting that __this_cpu_add/sub() doesn't work with unsigned adjustments. But I felt like fixing the core services to prevent this in the future. Changes from V1: - more accurate patch titles, patch logs, and test module description now referring to per cpu operations rather than per cpu counters. - move small test code update from patch 2 to patch 1 (where the test is introduced). Greg Thelen (3): percpu: add test module for various percpu operations percpu: fix this_cpu_sub() subtrahend casting for unsigneds memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend casting arch/x86/include/asm/percpu.h | 3 +- include/linux/percpu.h| 8 +-- lib/Kconfig.debug | 9 +++ lib/Makefile | 2 + lib/percpu_test.c | 138 ++ mm/memcontrol.c | 2 +- 6 files changed, 156 insertions(+), 6 deletions(-) create mode 100644 lib/percpu_test.c -- 1.8.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] percpu counter: cast this_cpu_sub() adjustment
On Sun, Oct 27 2013, Greg Thelen wrote: > this_cpu_sub() is implemented as negation and addition. > > This patch casts the adjustment to the counter type before negation to > sign extend the adjustment. This helps in cases where the counter > type is wider than an unsigned adjustment. An alternative to this > patch is to declare such operations unsupported, but it seemed useful > to avoid surprises. > > This patch specifically helps the following example: > unsigned int delta = 1 > preempt_disable() > this_cpu_write(long_counter, 0) > this_cpu_sub(long_counter, delta) > preempt_enable() > > Before this change long_counter on a 64 bit machine ends with value > 0x, rather than 0x. This is because > this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta), > which is basically: > long_counter = 0 + 0x > > Also apply the same cast to: > __this_cpu_sub() > this_cpu_sub_return() > and __this_cpu_sub_return() > > All percpu_test.ko passes, especially the following cases which > previously failed: > > l -= ui_one; > __this_cpu_sub(long_counter, ui_one); > CHECK(l, long_counter, -1); > > l -= ui_one; > this_cpu_sub(long_counter, ui_one); > CHECK(l, long_counter, -1); > CHECK(l, long_counter, 0x); > > ul -= ui_one; > __this_cpu_sub(ulong_counter, ui_one); > CHECK(ul, ulong_counter, -1); > CHECK(ul, ulong_counter, 0x); > > ul = this_cpu_sub_return(ulong_counter, ui_one); > CHECK(ul, ulong_counter, 2); > > ul = __this_cpu_sub_return(ulong_counter, ui_one); > CHECK(ul, ulong_counter, 1); > > Signed-off-by: Greg Thelen > --- > arch/x86/include/asm/percpu.h | 3 ++- > include/linux/percpu.h| 8 > lib/percpu_test.c | 2 +- > 3 files changed, 7 insertions(+), 6 deletions(-) > > diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h > index 0da5200..b3e18f8 100644 > --- a/arch/x86/include/asm/percpu.h > +++ b/arch/x86/include/asm/percpu.h > @@ -128,7 +128,8 @@ do { > \ > do { \ > typedef typeof(var) pao_T__;\ > const int pao_ID__ = (__builtin_constant_p(val) && \ > - ((val) == 1 || (val) == -1)) ? (val) : 0; \ > + ((val) == 1 || (val) == -1)) ?\ > + (int)(val) : 0; \ > if (0) {\ > pao_T__ pao_tmp__; \ > pao_tmp__ = (val); \ > diff --git a/include/linux/percpu.h b/include/linux/percpu.h > index cc88172..c74088a 100644 > --- a/include/linux/percpu.h > +++ b/include/linux/percpu.h > @@ -332,7 +332,7 @@ do { > \ > #endif > > #ifndef this_cpu_sub > -# define this_cpu_sub(pcp, val) this_cpu_add((pcp), -(val)) > +# define this_cpu_sub(pcp, val) this_cpu_add((pcp), > -(typeof(pcp))(val)) > #endif > > #ifndef this_cpu_inc > @@ -418,7 +418,7 @@ do { > \ > # define this_cpu_add_return(pcp, val) > __pcpu_size_call_return2(this_cpu_add_return_, pcp, val) > #endif > > -#define this_cpu_sub_return(pcp, val)this_cpu_add_return(pcp, -(val)) > +#define this_cpu_sub_return(pcp, val)this_cpu_add_return(pcp, > -(typeof(pcp))(val)) > #define this_cpu_inc_return(pcp) this_cpu_add_return(pcp, 1) > #define this_cpu_dec_return(pcp) this_cpu_add_return(pcp, -1) > > @@ -586,7 +586,7 @@ do { > \ > #endif > > #ifndef __this_cpu_sub > -# define __this_cpu_sub(pcp, val)__this_cpu_add((pcp), -(val)) > +# define __this_cpu_sub(pcp, val)__this_cpu_add((pcp), > -(typeof(pcp))(val)) > #endif > > #ifndef __this_cpu_inc > @@ -668,7 +668,7 @@ do { > \ > __pcpu_size_call_return2(__this_cpu_add_return_, pcp, val) > #endif > > -#define __this_cpu_sub_return(pcp, val) __this_cpu_add_return(pcp, > -(val)) > +#define __this_cpu_sub_return(pcp, val) __this_cpu_add_return(pcp, > -(typeof(pcp))(val)) > #define __this_cpu_inc_return(pcp) __this_cpu_add_return(pcp, 1) > #define __this_cpu_d
Re: [PATCH 2/3] percpu counter: cast this_cpu_sub() adjustment
On Sun, Oct 27 2013, Tejun Heo wrote: > On Sun, Oct 27, 2013 at 05:04:29AM -0700, Andrew Morton wrote: >> On Sun, 27 Oct 2013 07:22:55 -0400 Tejun Heo wrote: >> >> > We probably want to cc stable for this and the next one. How should >> > these be routed? I can take these through percpu tree or mm works >> > too. Either way, it'd be best to route them together. >> >> Yes, all three look like -stable material to me. I'll grab them later >> in the week if you haven't ;) > > Tried to apply to percpu but the third one is a fix for a patch which > was added to -mm during v3.12-rc1, so these are yours. :) I don't object to stable for the first two non-memcg patches, but it's probably unnecessary. I should have made it more clear, but an audit of v3.12-rc6 shows that only new memcg code is affected - the new mem_cgroup_move_account_page_stat() is the only place where an unsigned adjustment is used. All other callers (e.g. shrink_dcache_sb) already use a signed adjustment, so no problems before v3.12. Though I did not audit the stable kernel trees, so there could be something hiding in there. >> The names of the first two patches distress me. They rather clearly >> assert that the code affects percpu_counter.[ch], but that is not the case. >> Massaging is needed to fix that up. > > Yeah, something like the following would be better > > percpu: add test module for various percpu operations > percpu: fix this_cpu_sub() subtrahend casting for unsigneds > memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend > casting No objection to renaming. Let me know if you want these reposed with updated titles. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] memcg: use __this_cpu_sub to decrement stats
As of v3.11-9444-g3ea67d0 "memcg: add per cgroup writeback pages accounting" memcg counter errors are possible when moving charged memory to a different memcg. Charge movement occurs when processing writes to memory.force_empty, moving tasks to a memcg with memcg.move_charge_at_immigrate=1, or memcg deletion. An example showing error after memory.force_empty: $ cd /sys/fs/cgroup/memory $ mkdir x $ rm /data/tmp/file $ (echo $BASHPID >> x/tasks && exec mmap_writer /data/tmp/file 1M) & [1] 13600 $ grep ^mapped x/memory.stat mapped_file 1048576 $ echo 13600 > tasks $ echo 1 > x/memory.force_empty $ grep ^mapped x/memory.stat mapped_file 4503599627370496 mapped_file should end with 0. 4503599627370496 == 0x10,,, == 0x100,, pages 1048576 == 0x10, == 0x100 pages This issue only affects the source memcg on 64 bit machines; the destination memcg counters are correct. So the rmdir case is not too important because such counters are soon disappearing with the entire memcg. But the memcg.force_empty and memory.move_charge_at_immigrate=1 cases are larger problems as the bogus counters are visible for the (possibly long) remaining life of the source memcg. The problem is due to memcg use of __this_cpu_from(.., -nr_pages), which is subtly wrong because it subtracts the unsigned int nr_pages (either -1 or -512 for THP) from a signed long percpu counter. When nr_pages=-1, -nr_pages=0x. On 64 bit machines stat->count[idx] is signed 64 bit. So memcg's attempt to simply decrement a count (e.g. from 1 to 0) boils down to: long count = 1 unsigned int nr_pages = 1 count += -nr_pages /* -nr_pages == 0x, */ count is now 0x1,, instead of 0 The fix is to subtract the unsigned page count rather than adding its negation. This only works with the "percpu counter: cast this_cpu_sub() adjustment" patch which fixes this_cpu_sub(). Signed-off-by: Greg Thelen --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index aa8185c..b7ace0f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup *from, { /* Update stat data for mem_cgroup */ preempt_disable(); - __this_cpu_add(from->stat->count[idx], -nr_pages); + __this_cpu_sub(from->stat->count[idx], nr_pages); __this_cpu_add(to->stat->count[idx], nr_pages); preempt_enable(); } -- 1.8.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/3] fix unsigned pcp adjustments
As of v3.11-9444-g3ea67d0 "memcg: add per cgroup writeback pages accounting" memcg use of __this_cpu_add(counter, -nr_pages) leads to incorrect statistic values because the negated nr_pages is not sign extended (counter is long, nr_pages is unsigned int). The memcg fix is __this_cpu_sub(counter, nr_pages). But that doesn't simply work because __this_cpu_sub(counter, nr_pages) was implemented as __this_cpu_add(counter, -nr_pages) which suffers the same problem. Example: unsigned int delta = 1 preempt_disable() this_cpu_write(long_counter, 0) this_cpu_sub(long_counter, delta) preempt_enable() Before this change long_counter on a 64 bit machine ends with value 0x, rather than 0x. This is because this_cpu_sub(pcp, delta) boils down to: long_counter = 0 + 0x Patch 1 creates a test module for percpu counters operations which demonstrates the __this_cpu_sub() problems. This patch is independent can be discarded if there is no interest. Patch 2 fixes __this_cpu_sub() to work with unsigned adjustments. Patch 3 uses __this_cpu_sub() in memcg. An alternative smaller solution is for memcg to use: __this_cpu_add(counter, -(int)nr_pages) admitting that __this_cpu_add/sub() doesn't work with unsigned adjustments. But I felt like fixing the core services to prevent this in the future. Greg Thelen (3): percpu counter: test module percpu counter: cast this_cpu_sub() adjustment memcg: use __this_cpu_sub to decrement stats arch/x86/include/asm/percpu.h | 3 +- include/linux/percpu.h| 8 +-- lib/Kconfig.debug | 9 +++ lib/Makefile | 2 + lib/percpu_test.c | 138 ++ mm/memcontrol.c | 2 +- 6 files changed, 156 insertions(+), 6 deletions(-) create mode 100644 lib/percpu_test.c -- 1.8.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] percpu counter: cast this_cpu_sub() adjustment
this_cpu_sub() is implemented as negation and addition. This patch casts the adjustment to the counter type before negation to sign extend the adjustment. This helps in cases where the counter type is wider than an unsigned adjustment. An alternative to this patch is to declare such operations unsupported, but it seemed useful to avoid surprises. This patch specifically helps the following example: unsigned int delta = 1 preempt_disable() this_cpu_write(long_counter, 0) this_cpu_sub(long_counter, delta) preempt_enable() Before this change long_counter on a 64 bit machine ends with value 0x, rather than 0x. This is because this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta), which is basically: long_counter = 0 + 0x Also apply the same cast to: __this_cpu_sub() this_cpu_sub_return() and __this_cpu_sub_return() All percpu_test.ko passes, especially the following cases which previously failed: l -= ui_one; __this_cpu_sub(long_counter, ui_one); CHECK(l, long_counter, -1); l -= ui_one; this_cpu_sub(long_counter, ui_one); CHECK(l, long_counter, -1); CHECK(l, long_counter, 0x); ul -= ui_one; __this_cpu_sub(ulong_counter, ui_one); CHECK(ul, ulong_counter, -1); CHECK(ul, ulong_counter, 0x); ul = this_cpu_sub_return(ulong_counter, ui_one); CHECK(ul, ulong_counter, 2); ul = __this_cpu_sub_return(ulong_counter, ui_one); CHECK(ul, ulong_counter, 1); Signed-off-by: Greg Thelen --- arch/x86/include/asm/percpu.h | 3 ++- include/linux/percpu.h| 8 lib/percpu_test.c | 2 +- 3 files changed, 7 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 0da5200..b3e18f8 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -128,7 +128,8 @@ do { \ do { \ typedef typeof(var) pao_T__;\ const int pao_ID__ = (__builtin_constant_p(val) && \ - ((val) == 1 || (val) == -1)) ? (val) : 0; \ + ((val) == 1 || (val) == -1)) ?\ + (int)(val) : 0; \ if (0) {\ pao_T__ pao_tmp__; \ pao_tmp__ = (val); \ diff --git a/include/linux/percpu.h b/include/linux/percpu.h index cc88172..c74088a 100644 --- a/include/linux/percpu.h +++ b/include/linux/percpu.h @@ -332,7 +332,7 @@ do { \ #endif #ifndef this_cpu_sub -# define this_cpu_sub(pcp, val)this_cpu_add((pcp), -(val)) +# define this_cpu_sub(pcp, val)this_cpu_add((pcp), -(typeof(pcp))(val)) #endif #ifndef this_cpu_inc @@ -418,7 +418,7 @@ do { \ # define this_cpu_add_return(pcp, val) __pcpu_size_call_return2(this_cpu_add_return_, pcp, val) #endif -#define this_cpu_sub_return(pcp, val) this_cpu_add_return(pcp, -(val)) +#define this_cpu_sub_return(pcp, val) this_cpu_add_return(pcp, -(typeof(pcp))(val)) #define this_cpu_inc_return(pcp) this_cpu_add_return(pcp, 1) #define this_cpu_dec_return(pcp) this_cpu_add_return(pcp, -1) @@ -586,7 +586,7 @@ do { \ #endif #ifndef __this_cpu_sub -# define __this_cpu_sub(pcp, val) __this_cpu_add((pcp), -(val)) +# define __this_cpu_sub(pcp, val) __this_cpu_add((pcp), -(typeof(pcp))(val)) #endif #ifndef __this_cpu_inc @@ -668,7 +668,7 @@ do { \ __pcpu_size_call_return2(__this_cpu_add_return_, pcp, val) #endif -#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, -(val)) +#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, -(typeof(pcp))(val)) #define __this_cpu_inc_return(pcp) __this_cpu_add_return(pcp, 1) #define __this_cpu_dec_return(pcp) __this_cpu_add_return(pcp, -1) diff --git a/lib/percpu_test.c b/lib/percpu_test.c index 1ebeb44..8ab4231 100644 --- a/lib/percpu_test.c +++ b/lib/percpu_test.c @@ -118,7 +118,7 @@ static int __init percpu_test_init(void) CHECK(ul, ulong_counter, 2); ul = __this_cpu_sub_return(ulong_counter, ui_one); - CHECK(ul, ulong_counter, 0); + CHECK(ul, ulong_counter, 1); preempt_enable(); -- 1.8.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More major
[PATCH 1/3] percpu counter: test module
Tests various percpu operations. Enable with CONFIG_PERCPU_TEST=m. Signed-off-by: Greg Thelen --- lib/Kconfig.debug | 9 lib/Makefile | 2 + lib/percpu_test.c | 138 ++ 3 files changed, 149 insertions(+) create mode 100644 lib/percpu_test.c diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 06344d9..cee589d 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1472,6 +1472,15 @@ config INTERVAL_TREE_TEST help A benchmark measuring the performance of the interval tree library +config PERCPU_TEST + tristate "Per cpu counter test" + depends on m && DEBUG_KERNEL + help + Enable this option to build test modules with validates per-cpu + counter operations. + + If unsure, say N. + config ATOMIC64_SELFTEST bool "Perform an atomic64_t self-test at boot" help diff --git a/lib/Makefile b/lib/Makefile index f3bb2cb..bb016e1 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -157,6 +157,8 @@ obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o interval_tree_test-objs := interval_tree_test_main.o interval_tree.o +obj-$(CONFIG_PERCPU_TEST) += percpu_test.o + obj-$(CONFIG_ASN1) += asn1_decoder.o obj-$(CONFIG_FONT_SUPPORT) += fonts/ diff --git a/lib/percpu_test.c b/lib/percpu_test.c new file mode 100644 index 000..1ebeb44 --- /dev/null +++ b/lib/percpu_test.c @@ -0,0 +1,138 @@ +#include + +/* validate @native and @pcp counter values match @expected */ +#define CHECK(native, pcp, expected)\ + do {\ + WARN((native) != (expected),\ +"raw %ld (0x%lx) != expected %ld (0x%lx)", \ +(long)(native), (long)(native),\ +(long)(expected), (long)(expected)); \ + WARN(__this_cpu_read(pcp) != (expected),\ +"pcp %ld (0x%lx) != expected %ld (0x%lx)", \ +(long)__this_cpu_read(pcp), (long)__this_cpu_read(pcp), \ +(long)(expected), (long)(expected)); \ + } while (0) + +static DEFINE_PER_CPU(long, long_counter); +static DEFINE_PER_CPU(unsigned long, ulong_counter); + +static int __init percpu_test_init(void) +{ + /* +* volatile prevents compiler from optimizing it uses, otherwise the +* +ul_one and -ul_one below would replace with inc/dec instructions. +*/ + volatile unsigned int ui_one = 1; + long l = 0; + unsigned long ul = 0; + + pr_info("percpu test start\n"); + + preempt_disable(); + + l += -1; + __this_cpu_add(long_counter, -1); + CHECK(l, long_counter, -1); + + l += 1; + __this_cpu_add(long_counter, 1); + CHECK(l, long_counter, 0); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul += 1UL; + __this_cpu_add(ulong_counter, 1UL); + CHECK(ul, ulong_counter, 1); + + ul += -1UL; + __this_cpu_add(ulong_counter, -1UL); + CHECK(ul, ulong_counter, 0); + + ul += -(unsigned long)1; + __this_cpu_add(ulong_counter, -(unsigned long)1); + CHECK(ul, ulong_counter, -1); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul -= 1; + __this_cpu_dec(ulong_counter); + CHECK(ul, ulong_counter, 0x); + CHECK(ul, ulong_counter, -1); + + l += -ui_one; + __this_cpu_add(long_counter, -ui_one); + CHECK(l, long_counter, 0x); + + l += ui_one; + __this_cpu_add(long_counter, ui_one); + CHECK(l, long_counter, 0x1); + + + l = 0; + __this_cpu_write(long_counter, 0); + + l -= ui_one; + __this_cpu_sub(long_counter, ui_one); + CHECK(l, long_counter, -1); + + l = 0; + __this_cpu_write(long_counter, 0); + + l += ui_one; + __this_cpu_add(long_counter, ui_one); + CHECK(l, long_counter, 1); + + l += -ui_one; + __this_cpu_add(long_counter, -ui_one); + CHECK(l, long_counter, 0x1); + + l = 0; + __this_cpu_write(long_counter, 0); + + l -= ui_one; + this_cpu_sub(long_counter, ui_one); + CHECK(l, long_counter, -1); + CHECK(l, long_counter, 0x); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul += ui_one; + __this_cpu_add(ulong_counter, ui_one); + CHECK(ul, ulong_counter, 1); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul -= ui_one; + __this_cpu_sub(ulong_counter, ui_one); + CHECK(ul, ulong_counter, -1); + CHECK(ul, ulong_counter, 0x); + + ul = 3; + __this_cpu_write(ulong_counter, 3)
[PATCH 2/3] percpu counter: cast this_cpu_sub() adjustment
this_cpu_sub() is implemented as negation and addition. This patch casts the adjustment to the counter type before negation to sign extend the adjustment. This helps in cases where the counter type is wider than an unsigned adjustment. An alternative to this patch is to declare such operations unsupported, but it seemed useful to avoid surprises. This patch specifically helps the following example: unsigned int delta = 1 preempt_disable() this_cpu_write(long_counter, 0) this_cpu_sub(long_counter, delta) preempt_enable() Before this change long_counter on a 64 bit machine ends with value 0x, rather than 0x. This is because this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta), which is basically: long_counter = 0 + 0x Also apply the same cast to: __this_cpu_sub() this_cpu_sub_return() and __this_cpu_sub_return() All percpu_test.ko passes, especially the following cases which previously failed: l -= ui_one; __this_cpu_sub(long_counter, ui_one); CHECK(l, long_counter, -1); l -= ui_one; this_cpu_sub(long_counter, ui_one); CHECK(l, long_counter, -1); CHECK(l, long_counter, 0x); ul -= ui_one; __this_cpu_sub(ulong_counter, ui_one); CHECK(ul, ulong_counter, -1); CHECK(ul, ulong_counter, 0x); ul = this_cpu_sub_return(ulong_counter, ui_one); CHECK(ul, ulong_counter, 2); ul = __this_cpu_sub_return(ulong_counter, ui_one); CHECK(ul, ulong_counter, 1); Signed-off-by: Greg Thelen gthe...@google.com --- arch/x86/include/asm/percpu.h | 3 ++- include/linux/percpu.h| 8 lib/percpu_test.c | 2 +- 3 files changed, 7 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 0da5200..b3e18f8 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -128,7 +128,8 @@ do { \ do { \ typedef typeof(var) pao_T__;\ const int pao_ID__ = (__builtin_constant_p(val) \ - ((val) == 1 || (val) == -1)) ? (val) : 0; \ + ((val) == 1 || (val) == -1)) ?\ + (int)(val) : 0; \ if (0) {\ pao_T__ pao_tmp__; \ pao_tmp__ = (val); \ diff --git a/include/linux/percpu.h b/include/linux/percpu.h index cc88172..c74088a 100644 --- a/include/linux/percpu.h +++ b/include/linux/percpu.h @@ -332,7 +332,7 @@ do { \ #endif #ifndef this_cpu_sub -# define this_cpu_sub(pcp, val)this_cpu_add((pcp), -(val)) +# define this_cpu_sub(pcp, val)this_cpu_add((pcp), -(typeof(pcp))(val)) #endif #ifndef this_cpu_inc @@ -418,7 +418,7 @@ do { \ # define this_cpu_add_return(pcp, val) __pcpu_size_call_return2(this_cpu_add_return_, pcp, val) #endif -#define this_cpu_sub_return(pcp, val) this_cpu_add_return(pcp, -(val)) +#define this_cpu_sub_return(pcp, val) this_cpu_add_return(pcp, -(typeof(pcp))(val)) #define this_cpu_inc_return(pcp) this_cpu_add_return(pcp, 1) #define this_cpu_dec_return(pcp) this_cpu_add_return(pcp, -1) @@ -586,7 +586,7 @@ do { \ #endif #ifndef __this_cpu_sub -# define __this_cpu_sub(pcp, val) __this_cpu_add((pcp), -(val)) +# define __this_cpu_sub(pcp, val) __this_cpu_add((pcp), -(typeof(pcp))(val)) #endif #ifndef __this_cpu_inc @@ -668,7 +668,7 @@ do { \ __pcpu_size_call_return2(__this_cpu_add_return_, pcp, val) #endif -#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, -(val)) +#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, -(typeof(pcp))(val)) #define __this_cpu_inc_return(pcp) __this_cpu_add_return(pcp, 1) #define __this_cpu_dec_return(pcp) __this_cpu_add_return(pcp, -1) diff --git a/lib/percpu_test.c b/lib/percpu_test.c index 1ebeb44..8ab4231 100644 --- a/lib/percpu_test.c +++ b/lib/percpu_test.c @@ -118,7 +118,7 @@ static int __init percpu_test_init(void) CHECK(ul, ulong_counter, 2); ul = __this_cpu_sub_return(ulong_counter, ui_one); - CHECK(ul, ulong_counter, 0); + CHECK(ul, ulong_counter, 1); preempt_enable(); -- 1.8.4.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo
[PATCH 1/3] percpu counter: test module
Tests various percpu operations. Enable with CONFIG_PERCPU_TEST=m. Signed-off-by: Greg Thelen gthe...@google.com --- lib/Kconfig.debug | 9 lib/Makefile | 2 + lib/percpu_test.c | 138 ++ 3 files changed, 149 insertions(+) create mode 100644 lib/percpu_test.c diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 06344d9..cee589d 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1472,6 +1472,15 @@ config INTERVAL_TREE_TEST help A benchmark measuring the performance of the interval tree library +config PERCPU_TEST + tristate Per cpu counter test + depends on m DEBUG_KERNEL + help + Enable this option to build test modules with validates per-cpu + counter operations. + + If unsure, say N. + config ATOMIC64_SELFTEST bool Perform an atomic64_t self-test at boot help diff --git a/lib/Makefile b/lib/Makefile index f3bb2cb..bb016e1 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -157,6 +157,8 @@ obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o interval_tree_test-objs := interval_tree_test_main.o interval_tree.o +obj-$(CONFIG_PERCPU_TEST) += percpu_test.o + obj-$(CONFIG_ASN1) += asn1_decoder.o obj-$(CONFIG_FONT_SUPPORT) += fonts/ diff --git a/lib/percpu_test.c b/lib/percpu_test.c new file mode 100644 index 000..1ebeb44 --- /dev/null +++ b/lib/percpu_test.c @@ -0,0 +1,138 @@ +#include linux/module.h + +/* validate @native and @pcp counter values match @expected */ +#define CHECK(native, pcp, expected)\ + do {\ + WARN((native) != (expected),\ +raw %ld (0x%lx) != expected %ld (0x%lx), \ +(long)(native), (long)(native),\ +(long)(expected), (long)(expected)); \ + WARN(__this_cpu_read(pcp) != (expected),\ +pcp %ld (0x%lx) != expected %ld (0x%lx), \ +(long)__this_cpu_read(pcp), (long)__this_cpu_read(pcp), \ +(long)(expected), (long)(expected)); \ + } while (0) + +static DEFINE_PER_CPU(long, long_counter); +static DEFINE_PER_CPU(unsigned long, ulong_counter); + +static int __init percpu_test_init(void) +{ + /* +* volatile prevents compiler from optimizing it uses, otherwise the +* +ul_one and -ul_one below would replace with inc/dec instructions. +*/ + volatile unsigned int ui_one = 1; + long l = 0; + unsigned long ul = 0; + + pr_info(percpu test start\n); + + preempt_disable(); + + l += -1; + __this_cpu_add(long_counter, -1); + CHECK(l, long_counter, -1); + + l += 1; + __this_cpu_add(long_counter, 1); + CHECK(l, long_counter, 0); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul += 1UL; + __this_cpu_add(ulong_counter, 1UL); + CHECK(ul, ulong_counter, 1); + + ul += -1UL; + __this_cpu_add(ulong_counter, -1UL); + CHECK(ul, ulong_counter, 0); + + ul += -(unsigned long)1; + __this_cpu_add(ulong_counter, -(unsigned long)1); + CHECK(ul, ulong_counter, -1); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul -= 1; + __this_cpu_dec(ulong_counter); + CHECK(ul, ulong_counter, 0x); + CHECK(ul, ulong_counter, -1); + + l += -ui_one; + __this_cpu_add(long_counter, -ui_one); + CHECK(l, long_counter, 0x); + + l += ui_one; + __this_cpu_add(long_counter, ui_one); + CHECK(l, long_counter, 0x1); + + + l = 0; + __this_cpu_write(long_counter, 0); + + l -= ui_one; + __this_cpu_sub(long_counter, ui_one); + CHECK(l, long_counter, -1); + + l = 0; + __this_cpu_write(long_counter, 0); + + l += ui_one; + __this_cpu_add(long_counter, ui_one); + CHECK(l, long_counter, 1); + + l += -ui_one; + __this_cpu_add(long_counter, -ui_one); + CHECK(l, long_counter, 0x1); + + l = 0; + __this_cpu_write(long_counter, 0); + + l -= ui_one; + this_cpu_sub(long_counter, ui_one); + CHECK(l, long_counter, -1); + CHECK(l, long_counter, 0x); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul += ui_one; + __this_cpu_add(ulong_counter, ui_one); + CHECK(ul, ulong_counter, 1); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul -= ui_one; + __this_cpu_sub(ulong_counter, ui_one); + CHECK(ul, ulong_counter, -1); + CHECK(ul, ulong_counter, 0x); + + ul = 3; + __this_cpu_write(ulong_counter, 3); + + ul = this_cpu_sub_return
[PATCH 0/3] fix unsigned pcp adjustments
As of v3.11-9444-g3ea67d0 memcg: add per cgroup writeback pages accounting memcg use of __this_cpu_add(counter, -nr_pages) leads to incorrect statistic values because the negated nr_pages is not sign extended (counter is long, nr_pages is unsigned int). The memcg fix is __this_cpu_sub(counter, nr_pages). But that doesn't simply work because __this_cpu_sub(counter, nr_pages) was implemented as __this_cpu_add(counter, -nr_pages) which suffers the same problem. Example: unsigned int delta = 1 preempt_disable() this_cpu_write(long_counter, 0) this_cpu_sub(long_counter, delta) preempt_enable() Before this change long_counter on a 64 bit machine ends with value 0x, rather than 0x. This is because this_cpu_sub(pcp, delta) boils down to: long_counter = 0 + 0x Patch 1 creates a test module for percpu counters operations which demonstrates the __this_cpu_sub() problems. This patch is independent can be discarded if there is no interest. Patch 2 fixes __this_cpu_sub() to work with unsigned adjustments. Patch 3 uses __this_cpu_sub() in memcg. An alternative smaller solution is for memcg to use: __this_cpu_add(counter, -(int)nr_pages) admitting that __this_cpu_add/sub() doesn't work with unsigned adjustments. But I felt like fixing the core services to prevent this in the future. Greg Thelen (3): percpu counter: test module percpu counter: cast this_cpu_sub() adjustment memcg: use __this_cpu_sub to decrement stats arch/x86/include/asm/percpu.h | 3 +- include/linux/percpu.h| 8 +-- lib/Kconfig.debug | 9 +++ lib/Makefile | 2 + lib/percpu_test.c | 138 ++ mm/memcontrol.c | 2 +- 6 files changed, 156 insertions(+), 6 deletions(-) create mode 100644 lib/percpu_test.c -- 1.8.4.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] memcg: use __this_cpu_sub to decrement stats
As of v3.11-9444-g3ea67d0 memcg: add per cgroup writeback pages accounting memcg counter errors are possible when moving charged memory to a different memcg. Charge movement occurs when processing writes to memory.force_empty, moving tasks to a memcg with memcg.move_charge_at_immigrate=1, or memcg deletion. An example showing error after memory.force_empty: $ cd /sys/fs/cgroup/memory $ mkdir x $ rm /data/tmp/file $ (echo $BASHPID x/tasks exec mmap_writer /data/tmp/file 1M) [1] 13600 $ grep ^mapped x/memory.stat mapped_file 1048576 $ echo 13600 tasks $ echo 1 x/memory.force_empty $ grep ^mapped x/memory.stat mapped_file 4503599627370496 mapped_file should end with 0. 4503599627370496 == 0x10,,, == 0x100,, pages 1048576 == 0x10, == 0x100 pages This issue only affects the source memcg on 64 bit machines; the destination memcg counters are correct. So the rmdir case is not too important because such counters are soon disappearing with the entire memcg. But the memcg.force_empty and memory.move_charge_at_immigrate=1 cases are larger problems as the bogus counters are visible for the (possibly long) remaining life of the source memcg. The problem is due to memcg use of __this_cpu_from(.., -nr_pages), which is subtly wrong because it subtracts the unsigned int nr_pages (either -1 or -512 for THP) from a signed long percpu counter. When nr_pages=-1, -nr_pages=0x. On 64 bit machines stat-count[idx] is signed 64 bit. So memcg's attempt to simply decrement a count (e.g. from 1 to 0) boils down to: long count = 1 unsigned int nr_pages = 1 count += -nr_pages /* -nr_pages == 0x, */ count is now 0x1,, instead of 0 The fix is to subtract the unsigned page count rather than adding its negation. This only works with the percpu counter: cast this_cpu_sub() adjustment patch which fixes this_cpu_sub(). Signed-off-by: Greg Thelen gthe...@google.com --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index aa8185c..b7ace0f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup *from, { /* Update stat data for mem_cgroup */ preempt_disable(); - __this_cpu_add(from-stat-count[idx], -nr_pages); + __this_cpu_sub(from-stat-count[idx], nr_pages); __this_cpu_add(to-stat-count[idx], nr_pages); preempt_enable(); } -- 1.8.4.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] percpu counter: cast this_cpu_sub() adjustment
On Sun, Oct 27 2013, Tejun Heo wrote: On Sun, Oct 27, 2013 at 05:04:29AM -0700, Andrew Morton wrote: On Sun, 27 Oct 2013 07:22:55 -0400 Tejun Heo t...@kernel.org wrote: We probably want to cc stable for this and the next one. How should these be routed? I can take these through percpu tree or mm works too. Either way, it'd be best to route them together. Yes, all three look like -stable material to me. I'll grab them later in the week if you haven't ;) Tried to apply to percpu but the third one is a fix for a patch which was added to -mm during v3.12-rc1, so these are yours. :) I don't object to stable for the first two non-memcg patches, but it's probably unnecessary. I should have made it more clear, but an audit of v3.12-rc6 shows that only new memcg code is affected - the new mem_cgroup_move_account_page_stat() is the only place where an unsigned adjustment is used. All other callers (e.g. shrink_dcache_sb) already use a signed adjustment, so no problems before v3.12. Though I did not audit the stable kernel trees, so there could be something hiding in there. The names of the first two patches distress me. They rather clearly assert that the code affects percpu_counter.[ch], but that is not the case. Massaging is needed to fix that up. Yeah, something like the following would be better percpu: add test module for various percpu operations percpu: fix this_cpu_sub() subtrahend casting for unsigneds memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend casting No objection to renaming. Let me know if you want these reposed with updated titles. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] percpu counter: cast this_cpu_sub() adjustment
On Sun, Oct 27 2013, Greg Thelen wrote: this_cpu_sub() is implemented as negation and addition. This patch casts the adjustment to the counter type before negation to sign extend the adjustment. This helps in cases where the counter type is wider than an unsigned adjustment. An alternative to this patch is to declare such operations unsupported, but it seemed useful to avoid surprises. This patch specifically helps the following example: unsigned int delta = 1 preempt_disable() this_cpu_write(long_counter, 0) this_cpu_sub(long_counter, delta) preempt_enable() Before this change long_counter on a 64 bit machine ends with value 0x, rather than 0x. This is because this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta), which is basically: long_counter = 0 + 0x Also apply the same cast to: __this_cpu_sub() this_cpu_sub_return() and __this_cpu_sub_return() All percpu_test.ko passes, especially the following cases which previously failed: l -= ui_one; __this_cpu_sub(long_counter, ui_one); CHECK(l, long_counter, -1); l -= ui_one; this_cpu_sub(long_counter, ui_one); CHECK(l, long_counter, -1); CHECK(l, long_counter, 0x); ul -= ui_one; __this_cpu_sub(ulong_counter, ui_one); CHECK(ul, ulong_counter, -1); CHECK(ul, ulong_counter, 0x); ul = this_cpu_sub_return(ulong_counter, ui_one); CHECK(ul, ulong_counter, 2); ul = __this_cpu_sub_return(ulong_counter, ui_one); CHECK(ul, ulong_counter, 1); Signed-off-by: Greg Thelen gthe...@google.com --- arch/x86/include/asm/percpu.h | 3 ++- include/linux/percpu.h| 8 lib/percpu_test.c | 2 +- 3 files changed, 7 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 0da5200..b3e18f8 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -128,7 +128,8 @@ do { \ do { \ typedef typeof(var) pao_T__;\ const int pao_ID__ = (__builtin_constant_p(val) \ - ((val) == 1 || (val) == -1)) ? (val) : 0; \ + ((val) == 1 || (val) == -1)) ?\ + (int)(val) : 0; \ if (0) {\ pao_T__ pao_tmp__; \ pao_tmp__ = (val); \ diff --git a/include/linux/percpu.h b/include/linux/percpu.h index cc88172..c74088a 100644 --- a/include/linux/percpu.h +++ b/include/linux/percpu.h @@ -332,7 +332,7 @@ do { \ #endif #ifndef this_cpu_sub -# define this_cpu_sub(pcp, val) this_cpu_add((pcp), -(val)) +# define this_cpu_sub(pcp, val) this_cpu_add((pcp), -(typeof(pcp))(val)) #endif #ifndef this_cpu_inc @@ -418,7 +418,7 @@ do { \ # define this_cpu_add_return(pcp, val) __pcpu_size_call_return2(this_cpu_add_return_, pcp, val) #endif -#define this_cpu_sub_return(pcp, val)this_cpu_add_return(pcp, -(val)) +#define this_cpu_sub_return(pcp, val)this_cpu_add_return(pcp, -(typeof(pcp))(val)) #define this_cpu_inc_return(pcp) this_cpu_add_return(pcp, 1) #define this_cpu_dec_return(pcp) this_cpu_add_return(pcp, -1) @@ -586,7 +586,7 @@ do { \ #endif #ifndef __this_cpu_sub -# define __this_cpu_sub(pcp, val)__this_cpu_add((pcp), -(val)) +# define __this_cpu_sub(pcp, val)__this_cpu_add((pcp), -(typeof(pcp))(val)) #endif #ifndef __this_cpu_inc @@ -668,7 +668,7 @@ do { \ __pcpu_size_call_return2(__this_cpu_add_return_, pcp, val) #endif -#define __this_cpu_sub_return(pcp, val) __this_cpu_add_return(pcp, -(val)) +#define __this_cpu_sub_return(pcp, val) __this_cpu_add_return(pcp, -(typeof(pcp))(val)) #define __this_cpu_inc_return(pcp) __this_cpu_add_return(pcp, 1) #define __this_cpu_dec_return(pcp) __this_cpu_add_return(pcp, -1) diff --git a/lib/percpu_test.c b/lib/percpu_test.c index 1ebeb44..8ab4231 100644 --- a/lib/percpu_test.c +++ b/lib/percpu_test.c @@ -118,7 +118,7 @@ static int __init percpu_test_init(void) CHECK(ul, ulong_counter, 2); ul = __this_cpu_sub_return(ulong_counter, ui_one); - CHECK(ul, ulong_counter, 0); + CHECK(ul, ulong_counter, 1); preempt_enable(); Oops. This update to percpu_test.c
[PATCH v2 0/3] fix unsigned pcp adjustments
As of v3.11-9444-g3ea67d0 memcg: add per cgroup writeback pages accounting memcg use of __this_cpu_add(counter, -nr_pages) leads to incorrect statistic values because the negated nr_pages is not sign extended (counter is long, nr_pages is unsigned int). The memcg fix is __this_cpu_sub(counter, nr_pages). But that doesn't simply work because __this_cpu_sub(counter, nr_pages) was implemented as __this_cpu_add(counter, -nr_pages) which suffers the same problem. Example: unsigned int delta = 1 preempt_disable() this_cpu_write(long_counter, 0) this_cpu_sub(long_counter, delta) preempt_enable() Before this change long_counter on a 64 bit machine ends with value 0x, rather than 0x. This is because this_cpu_sub(pcp, delta) boils down to: long_counter = 0 + 0x v3.12-rc6 shows that only new memcg code is affected by this problem - the new mem_cgroup_move_account_page_stat() is the only place where an unsigned adjustment is used. All other callers (e.g. shrink_dcache_sb) already use a signed adjustment, so no problems before v3.12. Though I did not audit the stable kernel trees, so there could be something hiding in there. Patch 1 creates a test module for percpu operations which demonstrates the __this_cpu_sub() problems. This patch is independent can be discarded if there is no interest. Patch 2 fixes __this_cpu_sub() to work with unsigned adjustments. Patch 3 uses __this_cpu_sub() in memcg. An alternative smaller solution is for memcg to use: __this_cpu_add(counter, -(int)nr_pages) admitting that __this_cpu_add/sub() doesn't work with unsigned adjustments. But I felt like fixing the core services to prevent this in the future. Changes from V1: - more accurate patch titles, patch logs, and test module description now referring to per cpu operations rather than per cpu counters. - move small test code update from patch 2 to patch 1 (where the test is introduced). Greg Thelen (3): percpu: add test module for various percpu operations percpu: fix this_cpu_sub() subtrahend casting for unsigneds memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend casting arch/x86/include/asm/percpu.h | 3 +- include/linux/percpu.h| 8 +-- lib/Kconfig.debug | 9 +++ lib/Makefile | 2 + lib/percpu_test.c | 138 ++ mm/memcontrol.c | 2 +- 6 files changed, 156 insertions(+), 6 deletions(-) create mode 100644 lib/percpu_test.c -- 1.8.4.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 2/3] percpu: fix this_cpu_sub() subtrahend casting for unsigneds
this_cpu_sub() is implemented as negation and addition. This patch casts the adjustment to the counter type before negation to sign extend the adjustment. This helps in cases where the counter type is wider than an unsigned adjustment. An alternative to this patch is to declare such operations unsupported, but it seemed useful to avoid surprises. This patch specifically helps the following example: unsigned int delta = 1 preempt_disable() this_cpu_write(long_counter, 0) this_cpu_sub(long_counter, delta) preempt_enable() Before this change long_counter on a 64 bit machine ends with value 0x, rather than 0x. This is because this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta), which is basically: long_counter = 0 + 0x Also apply the same cast to: __this_cpu_sub() __this_cpu_sub_return() this_cpu_sub_return() All percpu_test.ko passes, especially the following cases which previously failed: l -= ui_one; __this_cpu_sub(long_counter, ui_one); CHECK(l, long_counter, -1); l -= ui_one; this_cpu_sub(long_counter, ui_one); CHECK(l, long_counter, -1); CHECK(l, long_counter, 0x); ul -= ui_one; __this_cpu_sub(ulong_counter, ui_one); CHECK(ul, ulong_counter, -1); CHECK(ul, ulong_counter, 0x); ul = this_cpu_sub_return(ulong_counter, ui_one); CHECK(ul, ulong_counter, 2); ul = __this_cpu_sub_return(ulong_counter, ui_one); CHECK(ul, ulong_counter, 1); Signed-off-by: Greg Thelen gthe...@google.com Acked-by: Tejun Heo t...@kernel.org --- arch/x86/include/asm/percpu.h | 3 ++- include/linux/percpu.h| 8 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 0da5200..b3e18f8 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -128,7 +128,8 @@ do { \ do { \ typedef typeof(var) pao_T__;\ const int pao_ID__ = (__builtin_constant_p(val) \ - ((val) == 1 || (val) == -1)) ? (val) : 0; \ + ((val) == 1 || (val) == -1)) ?\ + (int)(val) : 0; \ if (0) {\ pao_T__ pao_tmp__; \ pao_tmp__ = (val); \ diff --git a/include/linux/percpu.h b/include/linux/percpu.h index cc88172..c74088a 100644 --- a/include/linux/percpu.h +++ b/include/linux/percpu.h @@ -332,7 +332,7 @@ do { \ #endif #ifndef this_cpu_sub -# define this_cpu_sub(pcp, val)this_cpu_add((pcp), -(val)) +# define this_cpu_sub(pcp, val)this_cpu_add((pcp), -(typeof(pcp))(val)) #endif #ifndef this_cpu_inc @@ -418,7 +418,7 @@ do { \ # define this_cpu_add_return(pcp, val) __pcpu_size_call_return2(this_cpu_add_return_, pcp, val) #endif -#define this_cpu_sub_return(pcp, val) this_cpu_add_return(pcp, -(val)) +#define this_cpu_sub_return(pcp, val) this_cpu_add_return(pcp, -(typeof(pcp))(val)) #define this_cpu_inc_return(pcp) this_cpu_add_return(pcp, 1) #define this_cpu_dec_return(pcp) this_cpu_add_return(pcp, -1) @@ -586,7 +586,7 @@ do { \ #endif #ifndef __this_cpu_sub -# define __this_cpu_sub(pcp, val) __this_cpu_add((pcp), -(val)) +# define __this_cpu_sub(pcp, val) __this_cpu_add((pcp), -(typeof(pcp))(val)) #endif #ifndef __this_cpu_inc @@ -668,7 +668,7 @@ do { \ __pcpu_size_call_return2(__this_cpu_add_return_, pcp, val) #endif -#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, -(val)) +#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, -(typeof(pcp))(val)) #define __this_cpu_inc_return(pcp) __this_cpu_add_return(pcp, 1) #define __this_cpu_dec_return(pcp) __this_cpu_add_return(pcp, -1) -- 1.8.4.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/3] percpu: add test module for various percpu operations
Tests various percpu operations. Enable with CONFIG_PERCPU_TEST=m. Signed-off-by: Greg Thelen gthe...@google.com Acked-by: Tejun Heo t...@kernel.org --- lib/Kconfig.debug | 9 lib/Makefile | 2 + lib/percpu_test.c | 138 ++ 3 files changed, 149 insertions(+) create mode 100644 lib/percpu_test.c diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 06344d9..9fdb452 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1472,6 +1472,15 @@ config INTERVAL_TREE_TEST help A benchmark measuring the performance of the interval tree library +config PERCPU_TEST + tristate Per cpu operations test + depends on m DEBUG_KERNEL + help + Enable this option to build test module which validates per-cpu + operations. + + If unsure, say N. + config ATOMIC64_SELFTEST bool Perform an atomic64_t self-test at boot help diff --git a/lib/Makefile b/lib/Makefile index f3bb2cb..bb016e1 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -157,6 +157,8 @@ obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o interval_tree_test-objs := interval_tree_test_main.o interval_tree.o +obj-$(CONFIG_PERCPU_TEST) += percpu_test.o + obj-$(CONFIG_ASN1) += asn1_decoder.o obj-$(CONFIG_FONT_SUPPORT) += fonts/ diff --git a/lib/percpu_test.c b/lib/percpu_test.c new file mode 100644 index 000..fcca49e --- /dev/null +++ b/lib/percpu_test.c @@ -0,0 +1,138 @@ +#include linux/module.h + +/* validate @native and @pcp counter values match @expected */ +#define CHECK(native, pcp, expected)\ + do {\ + WARN((native) != (expected),\ +raw %ld (0x%lx) != expected %ld (0x%lx), \ +(long)(native), (long)(native),\ +(long)(expected), (long)(expected)); \ + WARN(__this_cpu_read(pcp) != (expected),\ +pcp %ld (0x%lx) != expected %ld (0x%lx), \ +(long)__this_cpu_read(pcp), (long)__this_cpu_read(pcp), \ +(long)(expected), (long)(expected)); \ + } while (0) + +static DEFINE_PER_CPU(long, long_counter); +static DEFINE_PER_CPU(unsigned long, ulong_counter); + +static int __init percpu_test_init(void) +{ + /* +* volatile prevents compiler from optimizing it uses, otherwise the +* +ul_one and -ul_one below would replace with inc/dec instructions. +*/ + volatile unsigned int ui_one = 1; + long l = 0; + unsigned long ul = 0; + + pr_info(percpu test start\n); + + preempt_disable(); + + l += -1; + __this_cpu_add(long_counter, -1); + CHECK(l, long_counter, -1); + + l += 1; + __this_cpu_add(long_counter, 1); + CHECK(l, long_counter, 0); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul += 1UL; + __this_cpu_add(ulong_counter, 1UL); + CHECK(ul, ulong_counter, 1); + + ul += -1UL; + __this_cpu_add(ulong_counter, -1UL); + CHECK(ul, ulong_counter, 0); + + ul += -(unsigned long)1; + __this_cpu_add(ulong_counter, -(unsigned long)1); + CHECK(ul, ulong_counter, -1); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul -= 1; + __this_cpu_dec(ulong_counter); + CHECK(ul, ulong_counter, 0x); + CHECK(ul, ulong_counter, -1); + + l += -ui_one; + __this_cpu_add(long_counter, -ui_one); + CHECK(l, long_counter, 0x); + + l += ui_one; + __this_cpu_add(long_counter, ui_one); + CHECK(l, long_counter, 0x1); + + + l = 0; + __this_cpu_write(long_counter, 0); + + l -= ui_one; + __this_cpu_sub(long_counter, ui_one); + CHECK(l, long_counter, -1); + + l = 0; + __this_cpu_write(long_counter, 0); + + l += ui_one; + __this_cpu_add(long_counter, ui_one); + CHECK(l, long_counter, 1); + + l += -ui_one; + __this_cpu_add(long_counter, -ui_one); + CHECK(l, long_counter, 0x1); + + l = 0; + __this_cpu_write(long_counter, 0); + + l -= ui_one; + this_cpu_sub(long_counter, ui_one); + CHECK(l, long_counter, -1); + CHECK(l, long_counter, 0x); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul += ui_one; + __this_cpu_add(ulong_counter, ui_one); + CHECK(ul, ulong_counter, 1); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul -= ui_one; + __this_cpu_sub(ulong_counter, ui_one); + CHECK(ul, ulong_counter, -1); + CHECK(ul, ulong_counter, 0x); + + ul = 3; + __this_cpu_write(ulong_counter, 3
[PATCH v2 3/3] memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend casting
As of v3.11-9444-g3ea67d0 memcg: add per cgroup writeback pages accounting memcg counter errors are possible when moving charged memory to a different memcg. Charge movement occurs when processing writes to memory.force_empty, moving tasks to a memcg with memcg.move_charge_at_immigrate=1, or memcg deletion. An example showing error after memory.force_empty: $ cd /sys/fs/cgroup/memory $ mkdir x $ rm /data/tmp/file $ (echo $BASHPID x/tasks exec mmap_writer /data/tmp/file 1M) [1] 13600 $ grep ^mapped x/memory.stat mapped_file 1048576 $ echo 13600 tasks $ echo 1 x/memory.force_empty $ grep ^mapped x/memory.stat mapped_file 4503599627370496 mapped_file should end with 0. 4503599627370496 == 0x10,,, == 0x100,, pages 1048576 == 0x10, == 0x100 pages This issue only affects the source memcg on 64 bit machines; the destination memcg counters are correct. So the rmdir case is not too important because such counters are soon disappearing with the entire memcg. But the memcg.force_empty and memory.move_charge_at_immigrate=1 cases are larger problems as the bogus counters are visible for the (possibly long) remaining life of the source memcg. The problem is due to memcg use of __this_cpu_from(.., -nr_pages), which is subtly wrong because it subtracts the unsigned int nr_pages (either -1 or -512 for THP) from a signed long percpu counter. When nr_pages=-1, -nr_pages=0x. On 64 bit machines stat-count[idx] is signed 64 bit. So memcg's attempt to simply decrement a count (e.g. from 1 to 0) boils down to: long count = 1 unsigned int nr_pages = 1 count += -nr_pages /* -nr_pages == 0x, */ count is now 0x1,, instead of 0 The fix is to subtract the unsigned page count rather than adding its negation. This only works once percpu: fix this_cpu_sub() subtrahend casting for unsigneds is applied to fix this_cpu_sub(). Signed-off-by: Greg Thelen gthe...@google.com Acked-by: Tejun Heo t...@kernel.org --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index aa8185c..b7ace0f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup *from, { /* Update stat data for mem_cgroup */ preempt_disable(); - __this_cpu_add(from-stat-count[idx], -nr_pages); + __this_cpu_sub(from-stat-count[idx], nr_pages); __this_cpu_add(to-stat-count[idx], nr_pages); preempt_enable(); } -- 1.8.4.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RIP: mem_cgroup_move_account+0xf4/0x290
gt; [ 7691.500464] [] ? insert_kthread_work+0x40/0x40 >> [ 7691.507335] Code: 85 f6 48 8b 55 d0 44 8b 4d c8 4c 8b 45 c0 0f 85 b3 00 >> 00 00 41 8b 4c 24 18 >> 85 c9 0f 88 a6 00 00 00 48 8b b2 30 02 00 00 45 89 ca <4c> 39 56 18 0f 8c 36 >> 01 00 00 44 89 c9 >> f7 d9 89 cf 65 48 01 7e >> [ 7691.528638] RIP [] mem_cgroup_move_account+0xf4/0x290 >> >> Add the required __this_cpu_read(). > > Sorry for my mistake and thanks for the fix up, it looks good to me. > > Reviewed-by: Sha Zhengju > > > Thanks, > Sha >> >> Signed-off-by: Johannes Weiner >> --- >> mm/memcontrol.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 4097a78..a4864b6 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct >> mem_cgroup *from, >> { >> /* Update stat data for mem_cgroup */ >> preempt_disable(); >> -WARN_ON_ONCE(from->stat->count[idx] < nr_pages); >> +WARN_ON_ONCE(__this_cpu_read(from->stat->count[idx]) < nr_pages); >> __this_cpu_add(from->stat->count[idx], -nr_pages); >> __this_cpu_add(to->stat->count[idx], nr_pages); >> preempt_enable(); > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org I was just polishing up this following patch which I think is better because it should avoid spurious warnings. ---8<--- >From c1f43ef0f4cc42fb2ecaeaca71bd247365e3521e Mon Sep 17 00:00:00 2001 From: Greg Thelen Date: Fri, 25 Oct 2013 21:59:57 -0700 Subject: [PATCH] memcg: remove incorrect underflow check When a memcg is deleted mem_cgroup_reparent_charges() moves charged memory to the parent memcg. As of v3.11-9444-g3ea67d0 "memcg: add per cgroup writeback pages accounting" there's bad pointer read. The goal was to check for counter underflow. The counter is a per cpu counter and there are two problems with the code: (1) per cpu access function isn't used, instead a naked pointer is used which easily causes panic. (2) the check doesn't sum all cpus Test: $ cd /sys/fs/cgroup/memory $ mkdir x $ echo 3 > /proc/sys/vm/drop_caches $ (echo $BASHPID >> x/tasks && exec cat) & [1] 7154 $ grep ^mapped x/memory.stat mapped_file 53248 $ echo 7154 > tasks $ rmdir x The fix is to remove the check. It's currently dangerous and isn't worth fixing it to use something expensive, such as percpu_counter_sum(), for each reparented page. __this_cpu_read() isn't enough to fix this because there's no guarantees of the current cpus count. The only guarantees is that the sum of all per-cpu counter is >= nr_pages. Signed-off-by: Greg Thelen --- mm/memcontrol.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 34d3ca9..aa8185c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3773,7 +3773,6 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup *from, { /* Update stat data for mem_cgroup */ preempt_disable(); - WARN_ON_ONCE(from->stat->count[idx] < nr_pages); __this_cpu_add(from->stat->count[idx], -nr_pages); __this_cpu_add(to->stat->count[idx], nr_pages); preempt_enable(); -- 1.8.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RIP: mem_cgroup_move_account+0xf4/0x290
://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a I was just polishing up this following patch which I think is better because it should avoid spurious warnings. ---8--- From c1f43ef0f4cc42fb2ecaeaca71bd247365e3521e Mon Sep 17 00:00:00 2001 From: Greg Thelen gthe...@google.com Date: Fri, 25 Oct 2013 21:59:57 -0700 Subject: [PATCH] memcg: remove incorrect underflow check When a memcg is deleted mem_cgroup_reparent_charges() moves charged memory to the parent memcg. As of v3.11-9444-g3ea67d0 memcg: add per cgroup writeback pages accounting there's bad pointer read. The goal was to check for counter underflow. The counter is a per cpu counter and there are two problems with the code: (1) per cpu access function isn't used, instead a naked pointer is used which easily causes panic. (2) the check doesn't sum all cpus Test: $ cd /sys/fs/cgroup/memory $ mkdir x $ echo 3 /proc/sys/vm/drop_caches $ (echo $BASHPID x/tasks exec cat) [1] 7154 $ grep ^mapped x/memory.stat mapped_file 53248 $ echo 7154 tasks $ rmdir x PANIC The fix is to remove the check. It's currently dangerous and isn't worth fixing it to use something expensive, such as percpu_counter_sum(), for each reparented page. __this_cpu_read() isn't enough to fix this because there's no guarantees of the current cpus count. The only guarantees is that the sum of all per-cpu counter is = nr_pages. Signed-off-by: Greg Thelen gthe...@google.com --- mm/memcontrol.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 34d3ca9..aa8185c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3773,7 +3773,6 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup *from, { /* Update stat data for mem_cgroup */ preempt_disable(); - WARN_ON_ONCE(from-stat-count[idx] nr_pages); __this_cpu_add(from-stat-count[idx], -nr_pages); __this_cpu_add(to-stat-count[idx], nr_pages); preempt_enable(); -- 1.8.4.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2 v4] memcg: support hierarchical memory.numa_stats
From: Ying Han From: Ying Han The memory.numa_stat file was not hierarchical. Memory charged to the children was not shown in parent's numa_stat. This change adds the "hierarchical_" stats to the existing stats. The new hierarchical stats include the sum of all children's values in addition to the value of the memcg. Tested: Create cgroup a, a/b and run workload under b. The values of b are included in the "hierarchical_*" under a. $ cd /sys/fs/cgroup $ echo 1 > memory.use_hierarchy $ mkdir a a/b Run workload in a/b: $ (echo $BASHPID >> a/b/cgroup.procs && cat /some/file && bash) & The hierarchical_ fields in parent (a) show use of workload in a/b: $ cat a/memory.numa_stat total=0 N0=0 N1=0 N2=0 N3=0 file=0 N0=0 N1=0 N2=0 N3=0 anon=0 N0=0 N1=0 N2=0 N3=0 unevictable=0 N0=0 N1=0 N2=0 N3=0 hierarchical_total=908 N0=552 N1=317 N2=39 N3=0 hierarchical_file=850 N0=549 N1=301 N2=0 N3=0 hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0 hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0 $ cat a/b/memory.numa_stat total=908 N0=552 N1=317 N2=39 N3=0 file=850 N0=549 N1=301 N2=0 N3=0 anon=58 N0=3 N1=16 N2=39 N3=0 unevictable=0 N0=0 N1=0 N2=0 N3=0 hierarchical_total=908 N0=552 N1=317 N2=39 N3=0 hierarchical_file=850 N0=549 N1=301 N2=0 N3=0 hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0 hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0 Signed-off-by: Ying Han Signed-off-by: Greg Thelen --- Changelog since v3: - push 'iter' local variable usage closer to its usage - documentation fixup Documentation/cgroups/memory.txt | 10 +++--- mm/memcontrol.c | 17 + 2 files changed, 24 insertions(+), 3 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 8af4ad1..e2bc132 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -573,15 +573,19 @@ an memcg since the pages are allowed to be allocated from any physical node. One of the use cases is evaluating application performance by combining this information with the application's CPU allocation. -We export "total", "file", "anon" and "unevictable" pages per-node for -each memcg. The ouput format of memory.numa_stat is: +Each memcg's numa_stat file includes "total", "file", "anon" and "unevictable" +per-node page counts including "hierarchical_" which sums up all +hierarchical children's values in addition to the memcg's own value. + +The ouput format of memory.numa_stat is: total= N0= N1= ... file= N0= N1= ... anon= N0= N1= ... unevictable= N0= N1= ... +hierarchical_= N0= N1= ... -And we have total = file + anon + unevictable. +The "total" count is sum of file + anon + unevictable. 6. Hierarchy support diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5806eea..d02176d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5206,6 +5206,23 @@ static int memcg_numa_stat_show(struct cgroup_subsys_state *css, seq_putc(m, '\n'); } + for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { + struct mem_cgroup *iter; + + nr = 0; + for_each_mem_cgroup_tree(iter, memcg) + nr += mem_cgroup_nr_lru_pages(iter, stat->lru_mask); + seq_printf(m, "hierarchical_%s=%lu", stat->name, nr); + for_each_node_state(nid, N_MEMORY) { + nr = 0; + for_each_mem_cgroup_tree(iter, memcg) + nr += mem_cgroup_node_nr_lru_pages( + iter, nid, stat->lru_mask); + seq_printf(m, " N%d=%lu", nid, nr); + } + seq_putc(m, '\n'); + } + return 0; } #endif /* CONFIG_NUMA */ -- 1.8.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2 v4] memcg: refactor mem_control_numa_stat_show()
Refactor mem_control_numa_stat_show() to use a new stats structure for smaller and simpler code. This consolidates nearly identical code. text data bssdec hex filename 8,137,679 1,703,496 1,896,448 11,737,623 b31a17 vmlinux.before 8,136,911 1,703,496 1,896,448 11,736,855 b31717 vmlinux.after Signed-off-by: Greg Thelen Signed-off-by: Ying Han --- Changelog since v3: - Use ARRAY_SIZE(stats) rather than array terminator. - rebased to latest linus/master (d8efd82) to incorporate 182446d08 "cgroup: pass around cgroup_subsys_state instead of cgroup in file methods". mm/memcontrol.c | 58 +++-- 1 file changed, 23 insertions(+), 35 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d5ff3ce..5806eea 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5179,45 +5179,33 @@ static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css, static int memcg_numa_stat_show(struct cgroup_subsys_state *css, struct cftype *cft, struct seq_file *m) { + struct numa_stat { + const char *name; + unsigned int lru_mask; + }; + + static const struct numa_stat stats[] = { + { "total", LRU_ALL }, + { "file", LRU_ALL_FILE }, + { "anon", LRU_ALL_ANON }, + { "unevictable", BIT(LRU_UNEVICTABLE) }, + }; + const struct numa_stat *stat; int nid; - unsigned long total_nr, file_nr, anon_nr, unevictable_nr; - unsigned long node_nr; + unsigned long nr; struct mem_cgroup *memcg = mem_cgroup_from_css(css); - total_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL); - seq_printf(m, "total=%lu", total_nr); - for_each_node_state(nid, N_MEMORY) { - node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL); - seq_printf(m, " N%d=%lu", nid, node_nr); - } - seq_putc(m, '\n'); - - file_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE); - seq_printf(m, "file=%lu", file_nr); - for_each_node_state(nid, N_MEMORY) { - node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, - LRU_ALL_FILE); - seq_printf(m, " N%d=%lu", nid, node_nr); - } - seq_putc(m, '\n'); - - anon_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON); - seq_printf(m, "anon=%lu", anon_nr); - for_each_node_state(nid, N_MEMORY) { - node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, - LRU_ALL_ANON); - seq_printf(m, " N%d=%lu", nid, node_nr); - } - seq_putc(m, '\n'); - - unevictable_nr = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_UNEVICTABLE)); - seq_printf(m, "unevictable=%lu", unevictable_nr); - for_each_node_state(nid, N_MEMORY) { - node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, - BIT(LRU_UNEVICTABLE)); - seq_printf(m, " N%d=%lu", nid, node_nr); + for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { + nr = mem_cgroup_nr_lru_pages(memcg, stat->lru_mask); + seq_printf(m, "%s=%lu", stat->name, nr); + for_each_node_state(nid, N_MEMORY) { + nr = mem_cgroup_node_nr_lru_pages(memcg, nid, + stat->lru_mask); + seq_printf(m, " N%d=%lu", nid, nr); + } + seq_putc(m, '\n'); } - seq_putc(m, '\n'); + return 0; } #endif /* CONFIG_NUMA */ -- 1.8.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2 v4] memcg: refactor mem_control_numa_stat_show()
Refactor mem_control_numa_stat_show() to use a new stats structure for smaller and simpler code. This consolidates nearly identical code. text data bssdec hex filename 8,137,679 1,703,496 1,896,448 11,737,623 b31a17 vmlinux.before 8,136,911 1,703,496 1,896,448 11,736,855 b31717 vmlinux.after Signed-off-by: Greg Thelen gthe...@google.com Signed-off-by: Ying Han ying...@google.com --- Changelog since v3: - Use ARRAY_SIZE(stats) rather than array terminator. - rebased to latest linus/master (d8efd82) to incorporate 182446d08 cgroup: pass around cgroup_subsys_state instead of cgroup in file methods. mm/memcontrol.c | 58 +++-- 1 file changed, 23 insertions(+), 35 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d5ff3ce..5806eea 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5179,45 +5179,33 @@ static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css, static int memcg_numa_stat_show(struct cgroup_subsys_state *css, struct cftype *cft, struct seq_file *m) { + struct numa_stat { + const char *name; + unsigned int lru_mask; + }; + + static const struct numa_stat stats[] = { + { total, LRU_ALL }, + { file, LRU_ALL_FILE }, + { anon, LRU_ALL_ANON }, + { unevictable, BIT(LRU_UNEVICTABLE) }, + }; + const struct numa_stat *stat; int nid; - unsigned long total_nr, file_nr, anon_nr, unevictable_nr; - unsigned long node_nr; + unsigned long nr; struct mem_cgroup *memcg = mem_cgroup_from_css(css); - total_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL); - seq_printf(m, total=%lu, total_nr); - for_each_node_state(nid, N_MEMORY) { - node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL); - seq_printf(m, N%d=%lu, nid, node_nr); - } - seq_putc(m, '\n'); - - file_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE); - seq_printf(m, file=%lu, file_nr); - for_each_node_state(nid, N_MEMORY) { - node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, - LRU_ALL_FILE); - seq_printf(m, N%d=%lu, nid, node_nr); - } - seq_putc(m, '\n'); - - anon_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON); - seq_printf(m, anon=%lu, anon_nr); - for_each_node_state(nid, N_MEMORY) { - node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, - LRU_ALL_ANON); - seq_printf(m, N%d=%lu, nid, node_nr); - } - seq_putc(m, '\n'); - - unevictable_nr = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_UNEVICTABLE)); - seq_printf(m, unevictable=%lu, unevictable_nr); - for_each_node_state(nid, N_MEMORY) { - node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, - BIT(LRU_UNEVICTABLE)); - seq_printf(m, N%d=%lu, nid, node_nr); + for (stat = stats; stat stats + ARRAY_SIZE(stats); stat++) { + nr = mem_cgroup_nr_lru_pages(memcg, stat-lru_mask); + seq_printf(m, %s=%lu, stat-name, nr); + for_each_node_state(nid, N_MEMORY) { + nr = mem_cgroup_node_nr_lru_pages(memcg, nid, + stat-lru_mask); + seq_printf(m, N%d=%lu, nid, nr); + } + seq_putc(m, '\n'); } - seq_putc(m, '\n'); + return 0; } #endif /* CONFIG_NUMA */ -- 1.8.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2 v4] memcg: support hierarchical memory.numa_stats
From: Ying Han ying...@google.com From: Ying Han ying...@google.com The memory.numa_stat file was not hierarchical. Memory charged to the children was not shown in parent's numa_stat. This change adds the hierarchical_ stats to the existing stats. The new hierarchical stats include the sum of all children's values in addition to the value of the memcg. Tested: Create cgroup a, a/b and run workload under b. The values of b are included in the hierarchical_* under a. $ cd /sys/fs/cgroup $ echo 1 memory.use_hierarchy $ mkdir a a/b Run workload in a/b: $ (echo $BASHPID a/b/cgroup.procs cat /some/file bash) The hierarchical_ fields in parent (a) show use of workload in a/b: $ cat a/memory.numa_stat total=0 N0=0 N1=0 N2=0 N3=0 file=0 N0=0 N1=0 N2=0 N3=0 anon=0 N0=0 N1=0 N2=0 N3=0 unevictable=0 N0=0 N1=0 N2=0 N3=0 hierarchical_total=908 N0=552 N1=317 N2=39 N3=0 hierarchical_file=850 N0=549 N1=301 N2=0 N3=0 hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0 hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0 $ cat a/b/memory.numa_stat total=908 N0=552 N1=317 N2=39 N3=0 file=850 N0=549 N1=301 N2=0 N3=0 anon=58 N0=3 N1=16 N2=39 N3=0 unevictable=0 N0=0 N1=0 N2=0 N3=0 hierarchical_total=908 N0=552 N1=317 N2=39 N3=0 hierarchical_file=850 N0=549 N1=301 N2=0 N3=0 hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0 hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0 Signed-off-by: Ying Han ying...@google.com Signed-off-by: Greg Thelen gthe...@google.com --- Changelog since v3: - push 'iter' local variable usage closer to its usage - documentation fixup Documentation/cgroups/memory.txt | 10 +++--- mm/memcontrol.c | 17 + 2 files changed, 24 insertions(+), 3 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 8af4ad1..e2bc132 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -573,15 +573,19 @@ an memcg since the pages are allowed to be allocated from any physical node. One of the use cases is evaluating application performance by combining this information with the application's CPU allocation. -We export total, file, anon and unevictable pages per-node for -each memcg. The ouput format of memory.numa_stat is: +Each memcg's numa_stat file includes total, file, anon and unevictable +per-node page counts including hierarchical_counter which sums up all +hierarchical children's values in addition to the memcg's own value. + +The ouput format of memory.numa_stat is: total=total pages N0=node 0 pages N1=node 1 pages ... file=total file pages N0=node 0 pages N1=node 1 pages ... anon=total anon pages N0=node 0 pages N1=node 1 pages ... unevictable=total anon pages N0=node 0 pages N1=node 1 pages ... +hierarchical_counter=counter pages N0=node 0 pages N1=node 1 pages ... -And we have total = file + anon + unevictable. +The total count is sum of file + anon + unevictable. 6. Hierarchy support diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5806eea..d02176d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5206,6 +5206,23 @@ static int memcg_numa_stat_show(struct cgroup_subsys_state *css, seq_putc(m, '\n'); } + for (stat = stats; stat stats + ARRAY_SIZE(stats); stat++) { + struct mem_cgroup *iter; + + nr = 0; + for_each_mem_cgroup_tree(iter, memcg) + nr += mem_cgroup_nr_lru_pages(iter, stat-lru_mask); + seq_printf(m, hierarchical_%s=%lu, stat-name, nr); + for_each_node_state(nid, N_MEMORY) { + nr = 0; + for_each_mem_cgroup_tree(iter, memcg) + nr += mem_cgroup_node_nr_lru_pages( + iter, nid, stat-lru_mask); + seq_printf(m, N%d=%lu, nid, nr); + } + seq_putc(m, '\n'); + } + return 0; } #endif /* CONFIG_NUMA */ -- 1.8.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2 v3] memcg: support hierarchical memory.numa_stats
From: Ying Han The memory.numa_stat file was not hierarchical. Memory charged to the children was not shown in parent's numa_stat. This change adds the "hierarchical_" stats to the existing stats. The new hierarchical stats include the sum of all children's values in addition to the value of the memcg. Tested: Create cgroup a, a/b and run workload under b. The values of b are included in the "hierarchical_*" under a. $ cd /sys/fs/cgroup $ echo 1 > memory.use_hierarchy $ mkdir a a/b Run workload in a/b: $ (echo $BASHPID >> a/b/cgroup.procs && cat /some/file && bash) & The hierarchical_ fields in parent (a) show use of workload in a/b: $ cat a/memory.numa_stat total=0 N0=0 N1=0 N2=0 N3=0 file=0 N0=0 N1=0 N2=0 N3=0 anon=0 N0=0 N1=0 N2=0 N3=0 unevictable=0 N0=0 N1=0 N2=0 N3=0 hierarchical_total=61 N0=0 N1=41 N2=20 N3=0 hierarchical_file=14 N0=0 N1=0 N2=14 N3=0 hierarchical_anon=47 N0=0 N1=41 N2=6 N3=0 hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0 The workload memory usage: $ cat a/b/memory.numa_stat total=73 N0=0 N1=41 N2=32 N3=0 file=14 N0=0 N1=0 N2=14 N3=0 anon=59 N0=0 N1=41 N2=18 N3=0 unevictable=0 N0=0 N1=0 N2=0 N3=0 hierarchical_total=73 N0=0 N1=41 N2=32 N3=0 hierarchical_file=14 N0=0 N1=0 N2=14 N3=0 hierarchical_anon=59 N0=0 N1=41 N2=18 N3=0 hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0 Signed-off-by: Ying Han Signed-off-by: Greg Thelen --- Changelog since v2: - reworded Documentation/cgroup/memory.txt - updated commit description Documentation/cgroups/memory.txt | 10 +++--- mm/memcontrol.c | 16 2 files changed, 23 insertions(+), 3 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 2a33306..d6d6479 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -571,15 +571,19 @@ an memcg since the pages are allowed to be allocated from any physical node. One of the use cases is evaluating application performance by combining this information with the application's CPU allocation. -We export "total", "file", "anon" and "unevictable" pages per-node for -each memcg. The ouput format of memory.numa_stat is: +Each memcg's numa_stat file includes "total", "file", "anon" and "unevictable" +per-node page counts including "hierarchical_" which sums of all +hierarchical children's values in addition to the memcg's own value. + +The ouput format of memory.numa_stat is: total= N0= N1= ... file= N0= N1= ... anon= N0= N1= ... unevictable= N0= N1= ... +hierarchical_= N0= N1= ... -And we have total = file + anon + unevictable. +The "total" count is sum of file + anon + unevictable. 6. Hierarchy support diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4d2b037..0e5be30 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5394,6 +5394,7 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft, int nid; unsigned long nr; struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); + struct mem_cgroup *iter; for (stat = stats; stat->name; stat++) { nr = mem_cgroup_nr_lru_pages(memcg, stat->lru_mask); @@ -5406,6 +5407,21 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft, seq_putc(m, '\n'); } + for (stat = stats; stat->name; stat++) { + nr = 0; + for_each_mem_cgroup_tree(iter, memcg) + nr += mem_cgroup_nr_lru_pages(iter, stat->lru_mask); + seq_printf(m, "hierarchical_%s=%lu", stat->name, nr); + for_each_node_state(nid, N_MEMORY) { + nr = 0; + for_each_mem_cgroup_tree(iter, memcg) + nr += mem_cgroup_node_nr_lru_pages( + iter, nid, stat->lru_mask); + seq_printf(m, " N%d=%lu", nid, nr); + } + seq_putc(m, '\n'); + } + return 0; } #endif /* CONFIG_NUMA */ -- 1.8.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2 v3] memcg: refactor mem_control_numa_stat_show()
Refactor mem_control_numa_stat_show() to use a new stats structure for smaller and simpler code. This consolidates nearly identical code. text data bssdec hex filename 8,055,980 1,675,648 1,896,448 11,628,076 b16e2c vmlinux.before 8,055,276 1,675,648 1,896,448 11,627,372 b16b6c vmlinux.after Signed-off-by: Greg Thelen Signed-off-by: Ying Han --- Changelog since v2: - rebased to v3.11 - updated commit description mm/memcontrol.c | 57 +++-- 1 file changed, 23 insertions(+), 34 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0878ff7..4d2b037 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5378,45 +5378,34 @@ static int mem_cgroup_move_charge_write(struct cgroup *cgrp, static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft, struct seq_file *m) { + struct numa_stat { + const char *name; + unsigned int lru_mask; + }; + + static const struct numa_stat stats[] = { + { "total", LRU_ALL }, + { "file", LRU_ALL_FILE }, + { "anon", LRU_ALL_ANON }, + { "unevictable", BIT(LRU_UNEVICTABLE) }, + { NULL, 0 } /* terminator */ + }; + const struct numa_stat *stat; int nid; - unsigned long total_nr, file_nr, anon_nr, unevictable_nr; - unsigned long node_nr; + unsigned long nr; struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); - total_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL); - seq_printf(m, "total=%lu", total_nr); - for_each_node_state(nid, N_MEMORY) { - node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL); - seq_printf(m, " N%d=%lu", nid, node_nr); - } - seq_putc(m, '\n'); - - file_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE); - seq_printf(m, "file=%lu", file_nr); - for_each_node_state(nid, N_MEMORY) { - node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, - LRU_ALL_FILE); - seq_printf(m, " N%d=%lu", nid, node_nr); - } - seq_putc(m, '\n'); - - anon_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON); - seq_printf(m, "anon=%lu", anon_nr); - for_each_node_state(nid, N_MEMORY) { - node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, - LRU_ALL_ANON); - seq_printf(m, " N%d=%lu", nid, node_nr); + for (stat = stats; stat->name; stat++) { + nr = mem_cgroup_nr_lru_pages(memcg, stat->lru_mask); + seq_printf(m, "%s=%lu", stat->name, nr); + for_each_node_state(nid, N_MEMORY) { + nr = mem_cgroup_node_nr_lru_pages(memcg, nid, + stat->lru_mask); + seq_printf(m, " N%d=%lu", nid, nr); + } + seq_putc(m, '\n'); } - seq_putc(m, '\n'); - unevictable_nr = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_UNEVICTABLE)); - seq_printf(m, "unevictable=%lu", unevictable_nr); - for_each_node_state(nid, N_MEMORY) { - node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, - BIT(LRU_UNEVICTABLE)); - seq_printf(m, " N%d=%lu", nid, node_nr); - } - seq_putc(m, '\n'); return 0; } #endif /* CONFIG_NUMA */ -- 1.8.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2 v3] memcg: refactor mem_control_numa_stat_show()
Refactor mem_control_numa_stat_show() to use a new stats structure for smaller and simpler code. This consolidates nearly identical code. text data bssdec hex filename 8,055,980 1,675,648 1,896,448 11,628,076 b16e2c vmlinux.before 8,055,276 1,675,648 1,896,448 11,627,372 b16b6c vmlinux.after Signed-off-by: Greg Thelen gthe...@google.com Signed-off-by: Ying Han ying...@google.com --- Changelog since v2: - rebased to v3.11 - updated commit description mm/memcontrol.c | 57 +++-- 1 file changed, 23 insertions(+), 34 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0878ff7..4d2b037 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5378,45 +5378,34 @@ static int mem_cgroup_move_charge_write(struct cgroup *cgrp, static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft, struct seq_file *m) { + struct numa_stat { + const char *name; + unsigned int lru_mask; + }; + + static const struct numa_stat stats[] = { + { total, LRU_ALL }, + { file, LRU_ALL_FILE }, + { anon, LRU_ALL_ANON }, + { unevictable, BIT(LRU_UNEVICTABLE) }, + { NULL, 0 } /* terminator */ + }; + const struct numa_stat *stat; int nid; - unsigned long total_nr, file_nr, anon_nr, unevictable_nr; - unsigned long node_nr; + unsigned long nr; struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); - total_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL); - seq_printf(m, total=%lu, total_nr); - for_each_node_state(nid, N_MEMORY) { - node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL); - seq_printf(m, N%d=%lu, nid, node_nr); - } - seq_putc(m, '\n'); - - file_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE); - seq_printf(m, file=%lu, file_nr); - for_each_node_state(nid, N_MEMORY) { - node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, - LRU_ALL_FILE); - seq_printf(m, N%d=%lu, nid, node_nr); - } - seq_putc(m, '\n'); - - anon_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON); - seq_printf(m, anon=%lu, anon_nr); - for_each_node_state(nid, N_MEMORY) { - node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, - LRU_ALL_ANON); - seq_printf(m, N%d=%lu, nid, node_nr); + for (stat = stats; stat-name; stat++) { + nr = mem_cgroup_nr_lru_pages(memcg, stat-lru_mask); + seq_printf(m, %s=%lu, stat-name, nr); + for_each_node_state(nid, N_MEMORY) { + nr = mem_cgroup_node_nr_lru_pages(memcg, nid, + stat-lru_mask); + seq_printf(m, N%d=%lu, nid, nr); + } + seq_putc(m, '\n'); } - seq_putc(m, '\n'); - unevictable_nr = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_UNEVICTABLE)); - seq_printf(m, unevictable=%lu, unevictable_nr); - for_each_node_state(nid, N_MEMORY) { - node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, - BIT(LRU_UNEVICTABLE)); - seq_printf(m, N%d=%lu, nid, node_nr); - } - seq_putc(m, '\n'); return 0; } #endif /* CONFIG_NUMA */ -- 1.8.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2 v3] memcg: support hierarchical memory.numa_stats
From: Ying Han ying...@google.com The memory.numa_stat file was not hierarchical. Memory charged to the children was not shown in parent's numa_stat. This change adds the hierarchical_ stats to the existing stats. The new hierarchical stats include the sum of all children's values in addition to the value of the memcg. Tested: Create cgroup a, a/b and run workload under b. The values of b are included in the hierarchical_* under a. $ cd /sys/fs/cgroup $ echo 1 memory.use_hierarchy $ mkdir a a/b Run workload in a/b: $ (echo $BASHPID a/b/cgroup.procs cat /some/file bash) The hierarchical_ fields in parent (a) show use of workload in a/b: $ cat a/memory.numa_stat total=0 N0=0 N1=0 N2=0 N3=0 file=0 N0=0 N1=0 N2=0 N3=0 anon=0 N0=0 N1=0 N2=0 N3=0 unevictable=0 N0=0 N1=0 N2=0 N3=0 hierarchical_total=61 N0=0 N1=41 N2=20 N3=0 hierarchical_file=14 N0=0 N1=0 N2=14 N3=0 hierarchical_anon=47 N0=0 N1=41 N2=6 N3=0 hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0 The workload memory usage: $ cat a/b/memory.numa_stat total=73 N0=0 N1=41 N2=32 N3=0 file=14 N0=0 N1=0 N2=14 N3=0 anon=59 N0=0 N1=41 N2=18 N3=0 unevictable=0 N0=0 N1=0 N2=0 N3=0 hierarchical_total=73 N0=0 N1=41 N2=32 N3=0 hierarchical_file=14 N0=0 N1=0 N2=14 N3=0 hierarchical_anon=59 N0=0 N1=41 N2=18 N3=0 hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0 Signed-off-by: Ying Han ying...@google.com Signed-off-by: Greg Thelen gthe...@google.com --- Changelog since v2: - reworded Documentation/cgroup/memory.txt - updated commit description Documentation/cgroups/memory.txt | 10 +++--- mm/memcontrol.c | 16 2 files changed, 23 insertions(+), 3 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 2a33306..d6d6479 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -571,15 +571,19 @@ an memcg since the pages are allowed to be allocated from any physical node. One of the use cases is evaluating application performance by combining this information with the application's CPU allocation. -We export total, file, anon and unevictable pages per-node for -each memcg. The ouput format of memory.numa_stat is: +Each memcg's numa_stat file includes total, file, anon and unevictable +per-node page counts including hierarchical_counter which sums of all +hierarchical children's values in addition to the memcg's own value. + +The ouput format of memory.numa_stat is: total=total pages N0=node 0 pages N1=node 1 pages ... file=total file pages N0=node 0 pages N1=node 1 pages ... anon=total anon pages N0=node 0 pages N1=node 1 pages ... unevictable=total anon pages N0=node 0 pages N1=node 1 pages ... +hierarchical_counter=counter pages N0=node 0 pages N1=node 1 pages ... -And we have total = file + anon + unevictable. +The total count is sum of file + anon + unevictable. 6. Hierarchy support diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4d2b037..0e5be30 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5394,6 +5394,7 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft, int nid; unsigned long nr; struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); + struct mem_cgroup *iter; for (stat = stats; stat-name; stat++) { nr = mem_cgroup_nr_lru_pages(memcg, stat-lru_mask); @@ -5406,6 +5407,21 @@ static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft, seq_putc(m, '\n'); } + for (stat = stats; stat-name; stat++) { + nr = 0; + for_each_mem_cgroup_tree(iter, memcg) + nr += mem_cgroup_nr_lru_pages(iter, stat-lru_mask); + seq_printf(m, hierarchical_%s=%lu, stat-name, nr); + for_each_node_state(nid, N_MEMORY) { + nr = 0; + for_each_mem_cgroup_tree(iter, memcg) + nr += mem_cgroup_node_nr_lru_pages( + iter, nid, stat-lru_mask); + seq_printf(m, N%d=%lu, nid, nr); + } + seq_putc(m, '\n'); + } + return 0; } #endif /* CONFIG_NUMA */ -- 1.8.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] memcg: fix multiple large threshold notifications
A memory cgroup with (1) multiple threshold notifications and (2) at least one threshold >=2G was not reliable. Specifically the notifications would either not fire or would not fire in the proper order. The __mem_cgroup_threshold() signaling logic depends on keeping 64 bit thresholds in sorted order. mem_cgroup_usage_register_event() sorts them with compare_thresholds(), which returns the difference of two 64 bit thresholds as an int. If the difference is positive but has bit[31] set, then sort() treats the difference as negative and breaks sort order. This fix compares the two arbitrary 64 bit thresholds returning the classic -1, 0, 1 result. The test below sets two notifications (at 0x1000 and 0x81001000): cd /sys/fs/cgroup/memory mkdir x for x in 4096 2164264960; do cgroup_event_listener x/memory.usage_in_bytes $x | sed "s/^/$x listener:/" & done echo $$ > x/cgroup.procs anon_leaker 500M v3.11-rc7 fails to signal the 4096 event listener: Leaking... Done leaking pages. Patched v3.11-rc7 properly notifies: Leaking... 4096 listener:2013:8:31:14:13:36 Done leaking pages. The fixed bug is old. It appears to date back to the introduction of memcg threshold notifications in v2.6.34-rc1-116-g2e72b6347c94 "memcg: implement memory thresholds" Signed-off-by: Greg Thelen --- mm/memcontrol.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0878ff7..aa44621 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5616,7 +5616,13 @@ static int compare_thresholds(const void *a, const void *b) const struct mem_cgroup_threshold *_a = a; const struct mem_cgroup_threshold *_b = b; - return _a->threshold - _b->threshold; + if (_a->threshold > _b->threshold) + return 1; + + if (_a->threshold < _b->threshold) + return -1; + + return 0; } static int mem_cgroup_oom_notify_cb(struct mem_cgroup *memcg) -- 1.8.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] memcg: fix multiple large threshold notifications
A memory cgroup with (1) multiple threshold notifications and (2) at least one threshold =2G was not reliable. Specifically the notifications would either not fire or would not fire in the proper order. The __mem_cgroup_threshold() signaling logic depends on keeping 64 bit thresholds in sorted order. mem_cgroup_usage_register_event() sorts them with compare_thresholds(), which returns the difference of two 64 bit thresholds as an int. If the difference is positive but has bit[31] set, then sort() treats the difference as negative and breaks sort order. This fix compares the two arbitrary 64 bit thresholds returning the classic -1, 0, 1 result. The test below sets two notifications (at 0x1000 and 0x81001000): cd /sys/fs/cgroup/memory mkdir x for x in 4096 2164264960; do cgroup_event_listener x/memory.usage_in_bytes $x | sed s/^/$x listener:/ done echo $$ x/cgroup.procs anon_leaker 500M v3.11-rc7 fails to signal the 4096 event listener: Leaking... Done leaking pages. Patched v3.11-rc7 properly notifies: Leaking... 4096 listener:2013:8:31:14:13:36 Done leaking pages. The fixed bug is old. It appears to date back to the introduction of memcg threshold notifications in v2.6.34-rc1-116-g2e72b6347c94 memcg: implement memory thresholds Signed-off-by: Greg Thelen gthe...@google.com --- mm/memcontrol.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0878ff7..aa44621 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5616,7 +5616,13 @@ static int compare_thresholds(const void *a, const void *b) const struct mem_cgroup_threshold *_a = a; const struct mem_cgroup_threshold *_b = b; - return _a-threshold - _b-threshold; + if (_a-threshold _b-threshold) + return 1; + + if (_a-threshold _b-threshold) + return -1; + + return 0; } static int mem_cgroup_oom_notify_cb(struct mem_cgroup *memcg) -- 1.8.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv10 1/4] debugfs: add get/set for atomic types
On Wed, May 08 2013, Seth Jennings wrote: > debugfs currently lack the ability to create attributes > that set/get atomic_t values. > > This patch adds support for this through a new > debugfs_create_atomic_t() function. > > Signed-off-by: Seth Jennings > Acked-by: Greg Kroah-Hartman > Acked-by: Mel Gorman > --- > fs/debugfs/file.c | 42 ++ > include/linux/debugfs.h | 2 ++ > 2 files changed, 44 insertions(+) > > diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c > index c5ca6ae..fa26d5b 100644 > --- a/fs/debugfs/file.c > +++ b/fs/debugfs/file.c > @@ -21,6 +21,7 @@ > #include > #include > #include > +#include > > static ssize_t default_read_file(struct file *file, char __user *buf, >size_t count, loff_t *ppos) > @@ -403,6 +404,47 @@ struct dentry *debugfs_create_size_t(const char *name, > umode_t mode, > } > EXPORT_SYMBOL_GPL(debugfs_create_size_t); > > +static int debugfs_atomic_t_set(void *data, u64 val) > +{ > + atomic_set((atomic_t *)data, val); > + return 0; > +} > +static int debugfs_atomic_t_get(void *data, u64 *val) > +{ > + *val = atomic_read((atomic_t *)data); > + return 0; > +} > +DEFINE_SIMPLE_ATTRIBUTE(fops_atomic_t, debugfs_atomic_t_get, > + debugfs_atomic_t_set, "%llu\n"); > +DEFINE_SIMPLE_ATTRIBUTE(fops_atomic_t_ro, debugfs_atomic_t_get, NULL, > "%llu\n"); > +DEFINE_SIMPLE_ATTRIBUTE(fops_atomic_t_wo, NULL, debugfs_atomic_t_set, > "%llu\n"); > + > +/** > + * debugfs_create_atomic_t - create a debugfs file that is used to read and > + * write an atomic_t value > + * @name: a pointer to a string containing the name of the file to create. > + * @mode: the permission that the file should have > + * @parent: a pointer to the parent dentry for this file. This should be a > + * directory dentry if set. If this parameter is %NULL, then the > + * file will be created in the root of the debugfs filesystem. > + * @value: a pointer to the variable that the file should read to and write > + * from. > + */ > +struct dentry *debugfs_create_atomic_t(const char *name, umode_t mode, > + struct dentry *parent, atomic_t *value) > +{ > + /* if there are no write bits set, make read only */ > + if (!(mode & S_IWUGO)) > + return debugfs_create_file(name, mode, parent, value, > + _atomic_t_ro); > + /* if there are no read bits set, make write only */ > + if (!(mode & S_IRUGO)) > + return debugfs_create_file(name, mode, parent, value, > + _atomic_t_wo); > + > + return debugfs_create_file(name, mode, parent, value, _atomic_t); > +} > +EXPORT_SYMBOL_GPL(debugfs_create_atomic_t); > > static ssize_t read_file_bool(struct file *file, char __user *user_buf, > size_t count, loff_t *ppos) > diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h > index 63f2465..d68b4ea 100644 > --- a/include/linux/debugfs.h > +++ b/include/linux/debugfs.h > @@ -79,6 +79,8 @@ struct dentry *debugfs_create_x64(const char *name, umode_t > mode, > struct dentry *parent, u64 *value); > struct dentry *debugfs_create_size_t(const char *name, umode_t mode, >struct dentry *parent, size_t *value); > +struct dentry *debugfs_create_atomic_t(const char *name, umode_t mode, > + struct dentry *parent, atomic_t *value); > struct dentry *debugfs_create_bool(const char *name, umode_t mode, > struct dentry *parent, u32 *value); Looking at v3.9 I see a conflicting definition of debugfs_create_atomic_t() in lib/fault-inject.c. A kernel with this patch and CONFIG_FAULT_INJECTION=y and CONFIG_FAULT_INJECTION_DEBUG_FS=y will not build: lib/fault-inject.c:196:23: error: static declaration of 'debugfs_create_atomic_t' follows non-static declaration include/linux/debugfs.h:87:16: note: previous declaration of 'debugfs_create_atomic_t' was here -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv10 1/4] debugfs: add get/set for atomic types
On Wed, May 08 2013, Seth Jennings wrote: debugfs currently lack the ability to create attributes that set/get atomic_t values. This patch adds support for this through a new debugfs_create_atomic_t() function. Signed-off-by: Seth Jennings sjenn...@linux.vnet.ibm.com Acked-by: Greg Kroah-Hartman gre...@linuxfoundation.org Acked-by: Mel Gorman mgor...@suse.de --- fs/debugfs/file.c | 42 ++ include/linux/debugfs.h | 2 ++ 2 files changed, 44 insertions(+) diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c index c5ca6ae..fa26d5b 100644 --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -21,6 +21,7 @@ #include linux/debugfs.h #include linux/io.h #include linux/slab.h +#include linux/atomic.h static ssize_t default_read_file(struct file *file, char __user *buf, size_t count, loff_t *ppos) @@ -403,6 +404,47 @@ struct dentry *debugfs_create_size_t(const char *name, umode_t mode, } EXPORT_SYMBOL_GPL(debugfs_create_size_t); +static int debugfs_atomic_t_set(void *data, u64 val) +{ + atomic_set((atomic_t *)data, val); + return 0; +} +static int debugfs_atomic_t_get(void *data, u64 *val) +{ + *val = atomic_read((atomic_t *)data); + return 0; +} +DEFINE_SIMPLE_ATTRIBUTE(fops_atomic_t, debugfs_atomic_t_get, + debugfs_atomic_t_set, %llu\n); +DEFINE_SIMPLE_ATTRIBUTE(fops_atomic_t_ro, debugfs_atomic_t_get, NULL, %llu\n); +DEFINE_SIMPLE_ATTRIBUTE(fops_atomic_t_wo, NULL, debugfs_atomic_t_set, %llu\n); + +/** + * debugfs_create_atomic_t - create a debugfs file that is used to read and + * write an atomic_t value + * @name: a pointer to a string containing the name of the file to create. + * @mode: the permission that the file should have + * @parent: a pointer to the parent dentry for this file. This should be a + * directory dentry if set. If this parameter is %NULL, then the + * file will be created in the root of the debugfs filesystem. + * @value: a pointer to the variable that the file should read to and write + * from. + */ +struct dentry *debugfs_create_atomic_t(const char *name, umode_t mode, + struct dentry *parent, atomic_t *value) +{ + /* if there are no write bits set, make read only */ + if (!(mode S_IWUGO)) + return debugfs_create_file(name, mode, parent, value, + fops_atomic_t_ro); + /* if there are no read bits set, make write only */ + if (!(mode S_IRUGO)) + return debugfs_create_file(name, mode, parent, value, + fops_atomic_t_wo); + + return debugfs_create_file(name, mode, parent, value, fops_atomic_t); +} +EXPORT_SYMBOL_GPL(debugfs_create_atomic_t); static ssize_t read_file_bool(struct file *file, char __user *user_buf, size_t count, loff_t *ppos) diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h index 63f2465..d68b4ea 100644 --- a/include/linux/debugfs.h +++ b/include/linux/debugfs.h @@ -79,6 +79,8 @@ struct dentry *debugfs_create_x64(const char *name, umode_t mode, struct dentry *parent, u64 *value); struct dentry *debugfs_create_size_t(const char *name, umode_t mode, struct dentry *parent, size_t *value); +struct dentry *debugfs_create_atomic_t(const char *name, umode_t mode, + struct dentry *parent, atomic_t *value); struct dentry *debugfs_create_bool(const char *name, umode_t mode, struct dentry *parent, u32 *value); Looking at v3.9 I see a conflicting definition of debugfs_create_atomic_t() in lib/fault-inject.c. A kernel with this patch and CONFIG_FAULT_INJECTION=y and CONFIG_FAULT_INJECTION_DEBUG_FS=y will not build: lib/fault-inject.c:196:23: error: static declaration of 'debugfs_create_atomic_t' follows non-static declaration include/linux/debugfs.h:87:16: note: previous declaration of 'debugfs_create_atomic_t' was here -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vfs: dcache: cond_resched in shrink_dentry_list
On Wed, Apr 10 2013, Andrew Morton wrote: > On Tue, 09 Apr 2013 17:37:20 -0700 Greg Thelen wrote: > >> > Call cond_resched() in shrink_dcache_parent() to maintain >> > interactivity. >> > >> > Before this patch: >> > >> > void shrink_dcache_parent(struct dentry * parent) >> > { >> >while ((found = select_parent(parent, )) != 0) >> >shrink_dentry_list(); >> > } >> > >> > select_parent() populates the dispose list with dentries which >> > shrink_dentry_list() then deletes. select_parent() carefully uses >> > need_resched() to avoid doing too much work at once. But neither >> > shrink_dcache_parent() nor its called functions call cond_resched(). >> > So once need_resched() is set select_parent() will return single >> > dentry dispose list which is then deleted by shrink_dentry_list(). >> > This is inefficient when there are a lot of dentry to process. This >> > can cause softlockup and hurts interactivity on non preemptable >> > kernels. >> > >> > This change adds cond_resched() in shrink_dcache_parent(). The >> > benefit of this is that need_resched() is quickly cleared so that >> > future calls to select_parent() are able to efficiently return a big >> > batch of dentry. >> > >> > These additional cond_resched() do not seem to impact performance, at >> > least for the workload below. >> > >> > Here is a program which can cause soft lockup on a if other system >> > activity sets need_resched(). > > I was unable to guess what word was missing from "on a if other" ;) Less is more ;) Reword to: Here is a program which can cause soft lockup if other system activity sets need_resched(). >> Should this change go through Al's or Andrew's branch? > > I'll fight him for it. Thanks. > Softlockups are fairly serious, so I'll put a cc:stable in there. Or > were the changes which triggered this problem added after 3.9? This also applies to stable. I see the problem at least back to v3.3. I did not test earlier kernels, but could if you want. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] vfs: dcache: cond_resched in shrink_dentry_list
On Wed, Apr 10 2013, Andrew Morton wrote: On Tue, 09 Apr 2013 17:37:20 -0700 Greg Thelen gthe...@google.com wrote: Call cond_resched() in shrink_dcache_parent() to maintain interactivity. Before this patch: void shrink_dcache_parent(struct dentry * parent) { while ((found = select_parent(parent, dispose)) != 0) shrink_dentry_list(dispose); } select_parent() populates the dispose list with dentries which shrink_dentry_list() then deletes. select_parent() carefully uses need_resched() to avoid doing too much work at once. But neither shrink_dcache_parent() nor its called functions call cond_resched(). So once need_resched() is set select_parent() will return single dentry dispose list which is then deleted by shrink_dentry_list(). This is inefficient when there are a lot of dentry to process. This can cause softlockup and hurts interactivity on non preemptable kernels. This change adds cond_resched() in shrink_dcache_parent(). The benefit of this is that need_resched() is quickly cleared so that future calls to select_parent() are able to efficiently return a big batch of dentry. These additional cond_resched() do not seem to impact performance, at least for the workload below. Here is a program which can cause soft lockup on a if other system activity sets need_resched(). I was unable to guess what word was missing from on a if other ;) Less is more ;) Reword to: Here is a program which can cause soft lockup if other system activity sets need_resched(). Should this change go through Al's or Andrew's branch? I'll fight him for it. Thanks. Softlockups are fairly serious, so I'll put a cc:stable in there. Or were the changes which triggered this problem added after 3.9? This also applies to stable. I see the problem at least back to v3.3. I did not test earlier kernels, but could if you want. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/