Re: [RFC] Making memcg track ownership per address_space or anon_vma

2015-01-29 Thread Greg Thelen

On Thu, Jan 29 2015, Tejun Heo wrote:

> Hello,
>
> Since the cgroup writeback patchset[1] have been posted, several
> people brought up concerns about the complexity of allowing an inode
> to be dirtied against multiple cgroups is necessary for the purpose of
> writeback and it is true that a significant amount of complexity (note
> that bdi still needs to be split, so it's still not trivial) can be
> removed if we assume that an inode always belongs to one cgroup for
> the purpose of writeback.
>
> However, as mentioned before, this issue is directly linked to whether
> memcg needs to track the memory ownership per-page.  If there are
> valid use cases where the pages of an inode must be tracked to be
> owned by different cgroups, cgroup writeback must be able to handle
> that situation properly.  If there aren't no such cases, the cgroup
> writeback support can be simplified but again we should put memcg on
> the same cadence and enforce per-inode (or per-anon_vma) ownership
> from the beginning.  The conclusion can be either way - per-page or
> per-inode - but both memcg and blkcg must be looking at the same
> picture.  Deviating them is highly likely to lead to long-term issues
> forcing us to look at this again anyway, only with far more baggage.
>
> One thing to note is that the per-page tracking which is currently
> employed by memcg seems to have been born more out of conveninence
> rather than requirements for any actual use cases.  Per-page ownership
> makes sense iff pages of an inode have to be associated with different
> cgroups - IOW, when an inode is accessed by multiple cgroups; however,
> currently, memcg assigns a page to its instantiating memcg and leaves
> it at that till the page is released.  This means that if a page is
> instantiated by one cgroup and then subsequently accessed only by a
> different cgroup, whether the page's charge gets moved to the cgroup
> which is actively using it is purely incidental.  If the page gets
> reclaimed and released at some point, it'll be moved.  If not, it
> won't.
>
> AFAICS, the only case where the current per-page accounting works
> properly is when disjoint sections of an inode are used by different
> cgroups and the whole thing hinges on whether this use case justifies
> all the added overhead including page->mem_cgroup pointer and the
> extra complexity in the writeback layer.  FWIW, I'm doubtful.
> Johannes, Michal, Greg, what do you guys think?
>
> If the above use case - a huge file being actively accssed disjointly
> by multiple cgroups - isn't significant enough and there aren't other
> use cases that I missed which can benefit from the per-page tracking
> that's currently implemented, it'd be logical to switch to per-inode
> (or per-anon_vma or per-slab) ownership tracking.  For the short term,
> even just adding an extra ownership information to those containing
> objects and inherting those to page->mem_cgroup could work although
> it'd definitely be beneficial to eventually get rid of
> page->mem_cgroup.
>
> As with per-page, when the ownership terminates is debatable w/
> per-inode tracking.  Also, supporting some form of shared accounting
> across different cgroups may be useful (e.g. shared library's memory
> being equally split among anyone who accesses it); however, these
> aren't likely to be major and trying to do something smart may affect
> other use cases adversely, so it'd probably be best to just keep it
> dumb and clear the ownership when the inode loses all pages (a cgroup
> can disown such inode through FADV_DONTNEED if necessary).
>
> What do you guys think?  If making memcg track ownership at per-inode
> level, even for just the unified hierarchy, is the direction we can
> take, I'll go ahead and simplify the cgroup writeback patchset.
>
> Thanks.

I find simplification appealing.  But I not sure it will fly, if for no
other reason than the shared accountings.  I'm ignoring intentional
sharing, used by carefully crafted apps, and just thinking about
incidental sharing (e.g. libc).

Example:

$ mkdir small
$ echo 1M > small/memory.limit_in_bytes
$ (echo $BASHPID > small/cgroup.procs && exec sleep 1h) &

$ mkdir big
$ echo 10G > big/memory.limit_in_bytes
$ (echo $BASHPID > big/cgroup.procs && exec mlockall_database 1h) &


Assuming big/mlockall_database mlocks all of libc, then it will oom kill
the small memcg because libc is owned by small due it having touched it
first.  It'd be hard to figure out what small did wrong to deserve the
oom kill.

FWIW we've been using memcg writeback where inodes have a memcg
writeback owner.  Once multiple memcg write to an inode then the inode
becomes writeback shared which makes it more likely to be written.  Once
cleaned the inode is then again able to be privately owned:
https://lkml.org/lkml/2011/8/17/200
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [RFC] Making memcg track ownership per address_space or anon_vma

2015-01-29 Thread Greg Thelen

On Thu, Jan 29 2015, Tejun Heo wrote:

 Hello,

 Since the cgroup writeback patchset[1] have been posted, several
 people brought up concerns about the complexity of allowing an inode
 to be dirtied against multiple cgroups is necessary for the purpose of
 writeback and it is true that a significant amount of complexity (note
 that bdi still needs to be split, so it's still not trivial) can be
 removed if we assume that an inode always belongs to one cgroup for
 the purpose of writeback.

 However, as mentioned before, this issue is directly linked to whether
 memcg needs to track the memory ownership per-page.  If there are
 valid use cases where the pages of an inode must be tracked to be
 owned by different cgroups, cgroup writeback must be able to handle
 that situation properly.  If there aren't no such cases, the cgroup
 writeback support can be simplified but again we should put memcg on
 the same cadence and enforce per-inode (or per-anon_vma) ownership
 from the beginning.  The conclusion can be either way - per-page or
 per-inode - but both memcg and blkcg must be looking at the same
 picture.  Deviating them is highly likely to lead to long-term issues
 forcing us to look at this again anyway, only with far more baggage.

 One thing to note is that the per-page tracking which is currently
 employed by memcg seems to have been born more out of conveninence
 rather than requirements for any actual use cases.  Per-page ownership
 makes sense iff pages of an inode have to be associated with different
 cgroups - IOW, when an inode is accessed by multiple cgroups; however,
 currently, memcg assigns a page to its instantiating memcg and leaves
 it at that till the page is released.  This means that if a page is
 instantiated by one cgroup and then subsequently accessed only by a
 different cgroup, whether the page's charge gets moved to the cgroup
 which is actively using it is purely incidental.  If the page gets
 reclaimed and released at some point, it'll be moved.  If not, it
 won't.

 AFAICS, the only case where the current per-page accounting works
 properly is when disjoint sections of an inode are used by different
 cgroups and the whole thing hinges on whether this use case justifies
 all the added overhead including page-mem_cgroup pointer and the
 extra complexity in the writeback layer.  FWIW, I'm doubtful.
 Johannes, Michal, Greg, what do you guys think?

 If the above use case - a huge file being actively accssed disjointly
 by multiple cgroups - isn't significant enough and there aren't other
 use cases that I missed which can benefit from the per-page tracking
 that's currently implemented, it'd be logical to switch to per-inode
 (or per-anon_vma or per-slab) ownership tracking.  For the short term,
 even just adding an extra ownership information to those containing
 objects and inherting those to page-mem_cgroup could work although
 it'd definitely be beneficial to eventually get rid of
 page-mem_cgroup.

 As with per-page, when the ownership terminates is debatable w/
 per-inode tracking.  Also, supporting some form of shared accounting
 across different cgroups may be useful (e.g. shared library's memory
 being equally split among anyone who accesses it); however, these
 aren't likely to be major and trying to do something smart may affect
 other use cases adversely, so it'd probably be best to just keep it
 dumb and clear the ownership when the inode loses all pages (a cgroup
 can disown such inode through FADV_DONTNEED if necessary).

 What do you guys think?  If making memcg track ownership at per-inode
 level, even for just the unified hierarchy, is the direction we can
 take, I'll go ahead and simplify the cgroup writeback patchset.

 Thanks.

I find simplification appealing.  But I not sure it will fly, if for no
other reason than the shared accountings.  I'm ignoring intentional
sharing, used by carefully crafted apps, and just thinking about
incidental sharing (e.g. libc).

Example:

$ mkdir small
$ echo 1M  small/memory.limit_in_bytes
$ (echo $BASHPID  small/cgroup.procs  exec sleep 1h) 

$ mkdir big
$ echo 10G  big/memory.limit_in_bytes
$ (echo $BASHPID  big/cgroup.procs  exec mlockall_database 1h) 


Assuming big/mlockall_database mlocks all of libc, then it will oom kill
the small memcg because libc is owned by small due it having touched it
first.  It'd be hard to figure out what small did wrong to deserve the
oom kill.

FWIW we've been using memcg writeback where inodes have a memcg
writeback owner.  Once multiple memcg write to an inode then the inode
becomes writeback shared which makes it more likely to be written.  Once
cleaned the inode is then again able to be privately owned:
https://lkml.org/lkml/2011/8/17/200
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory

2015-01-13 Thread Greg Thelen

On Thu, Jan 08 2015, Johannes Weiner wrote:

> Introduce the basic control files to account, partition, and limit
> memory using cgroups in default hierarchy mode.
>
> This interface versioning allows us to address fundamental design
> issues in the existing memory cgroup interface, further explained
> below.  The old interface will be maintained indefinitely, but a
> clearer model and improved workload performance should encourage
> existing users to switch over to the new one eventually.
>
> The control files are thus:
>
>   - memory.current shows the current consumption of the cgroup and its
> descendants, in bytes.
>
>   - memory.low configures the lower end of the cgroup's expected
> memory consumption range.  The kernel considers memory below that
> boundary to be a reserve - the minimum that the workload needs in
> order to make forward progress - and generally avoids reclaiming
> it, unless there is an imminent risk of entering an OOM situation.

So this is try-hard, but no-promises interface.  No complaints.  But I
assume that an eventual extension is a more rigid memory.min which
specifies a minimum working set under which an container would prefer an
oom kill to thrashing.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory

2015-01-13 Thread Greg Thelen

On Thu, Jan 08 2015, Johannes Weiner wrote:

 Introduce the basic control files to account, partition, and limit
 memory using cgroups in default hierarchy mode.

 This interface versioning allows us to address fundamental design
 issues in the existing memory cgroup interface, further explained
 below.  The old interface will be maintained indefinitely, but a
 clearer model and improved workload performance should encourage
 existing users to switch over to the new one eventually.

 The control files are thus:

   - memory.current shows the current consumption of the cgroup and its
 descendants, in bytes.

   - memory.low configures the lower end of the cgroup's expected
 memory consumption range.  The kernel considers memory below that
 boundary to be a reserve - the minimum that the workload needs in
 order to make forward progress - and generally avoids reclaiming
 it, unless there is an imminent risk of entering an OOM situation.

So this is try-hard, but no-promises interface.  No complaints.  But I
assume that an eventual extension is a more rigid memory.min which
specifies a minimum working set under which an container would prefer an
oom kill to thrashing.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] memcg: remove extra newlines from memcg oom kill log

2015-01-12 Thread Greg Thelen
Commit e61734c55c24 ("cgroup: remove cgroup->name") added two extra
newlines to memcg oom kill log messages.  This makes dmesg hard to read
and parse.  The issue affects 3.15+.
Example:
  Task in /t  <<< extra #1
   killed as a result of limit of /t
  <<< extra #2
  memory: usage 102400kB, limit 102400kB, failcnt 274712

Remove the extra newlines from memcg oom kill messages, so the messages
look like:
  Task in /t killed as a result of limit of /t
  memory: usage 102400kB, limit 102400kB, failcnt 240649

Fixes: e61734c55c24 ("cgroup: remove cgroup->name")
Signed-off-by: Greg Thelen 
---
 mm/memcontrol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 851924fa5170..683b4782019b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1477,9 +1477,9 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, 
struct task_struct *p)
 
pr_info("Task in ");
pr_cont_cgroup_path(task_cgroup(p, memory_cgrp_id));
-   pr_info(" killed as a result of limit of ");
+   pr_cont(" killed as a result of limit of ");
pr_cont_cgroup_path(memcg->css.cgroup);
-   pr_info("\n");
+   pr_cont("\n");
 
rcu_read_unlock();
 
-- 
2.2.0.rc0.207.ga3a616c

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] memcg: add BUILD_BUG_ON() for string tables

2015-01-12 Thread Greg Thelen
Use BUILD_BUG_ON() to compile assert that memcg string tables are in
sync with corresponding enums.  There aren't currently any issues with
these tables.  This is just defensive.

Signed-off-by: Greg Thelen 
---
 mm/memcontrol.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ef91e856c7e4..8d1ca6c55480 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3699,6 +3699,10 @@ static int memcg_stat_show(struct seq_file *m, void *v)
struct mem_cgroup *mi;
unsigned int i;
 
+   BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_stat_names) !=
+MEM_CGROUP_STAT_NSTATS);
+   BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_events_names) !=
+MEM_CGROUP_EVENTS_NSTATS);
BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
 
for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
-- 
2.2.0.rc0.207.ga3a616c

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] memcg: remove extra newlines from memcg oom kill log

2015-01-12 Thread Greg Thelen
Commit e61734c55c24 (cgroup: remove cgroup-name) added two extra
newlines to memcg oom kill log messages.  This makes dmesg hard to read
and parse.  The issue affects 3.15+.
Example:
  Task in /t   extra #1
   killed as a result of limit of /t
   extra #2
  memory: usage 102400kB, limit 102400kB, failcnt 274712

Remove the extra newlines from memcg oom kill messages, so the messages
look like:
  Task in /t killed as a result of limit of /t
  memory: usage 102400kB, limit 102400kB, failcnt 240649

Fixes: e61734c55c24 (cgroup: remove cgroup-name)
Signed-off-by: Greg Thelen gthe...@google.com
---
 mm/memcontrol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 851924fa5170..683b4782019b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1477,9 +1477,9 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, 
struct task_struct *p)
 
pr_info(Task in );
pr_cont_cgroup_path(task_cgroup(p, memory_cgrp_id));
-   pr_info( killed as a result of limit of );
+   pr_cont( killed as a result of limit of );
pr_cont_cgroup_path(memcg-css.cgroup);
-   pr_info(\n);
+   pr_cont(\n);
 
rcu_read_unlock();
 
-- 
2.2.0.rc0.207.ga3a616c

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] memcg: add BUILD_BUG_ON() for string tables

2015-01-12 Thread Greg Thelen
Use BUILD_BUG_ON() to compile assert that memcg string tables are in
sync with corresponding enums.  There aren't currently any issues with
these tables.  This is just defensive.

Signed-off-by: Greg Thelen gthe...@google.com
---
 mm/memcontrol.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ef91e856c7e4..8d1ca6c55480 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3699,6 +3699,10 @@ static int memcg_stat_show(struct seq_file *m, void *v)
struct mem_cgroup *mi;
unsigned int i;
 
+   BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_stat_names) !=
+MEM_CGROUP_STAT_NSTATS);
+   BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_events_names) !=
+MEM_CGROUP_EVENTS_NSTATS);
BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
 
for (i = 0; i  MEM_CGROUP_STAT_NSTATS; i++) {
-- 
2.2.0.rc0.207.ga3a616c

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] x86, kaslr: Prevent .bss from overlaping initrd

2014-11-17 Thread Greg Thelen

On Mon, Nov 17 2014, Greg Thelen wrote:
[...]
> Given that bss and brk are nobits (i.e. only ALLOC) sections, does
> file_offset make sense as a load address.  This fails with gold:
>
> $ git checkout v3.18-rc5
> $ make  # with gold
> [...]
> ..bss and .brk lack common file offset
> ..bss and .brk lack common file offset
> ..bss and .brk lack common file offset
> ..bss and .brk lack common file offset
>   MKPIGGY arch/x86/boot/compressed/piggy.S
> Usage: arch/x86/boot/compressed/mkpiggy compressed_file run_size
> make[2]: *** [arch/x86/boot/compressed/piggy.S] Error 1
> make[1]: *** [arch/x86/boot/compressed/vmlinux] Error 2
> make: *** [bzImage] Error 2
[...]

I just saw http://www.spinics.net/lists/kernel/msg1869328.html which
fixes things for me.  Sorry for the noise.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] x86, kaslr: Prevent .bss from overlaping initrd

2014-11-17 Thread Greg Thelen
On Fri, Oct 31 2014, Junjie Mao wrote:

> When choosing a random address, the current implementation does not take into
> account the reversed space for .bss and .brk sections. Thus the relocated 
> kernel
> may overlap other components in memory. Here is an example of the overlap 
> from a
> x86_64 kernel in qemu (the ranges of physical addresses are presented):
>
>  Physical Address
>
> 0x0fe0  --++  <-- randomized base
>/  |  relocated kernel  |
>vmlinux.bin| (from vmlinux.bin) |
> 0x1336d000(an ELF file)   ++--
>\  ||  \
> 0x1376d870  --++   |
>   |relocs table|   |
> 0x13c1c2a8++   .bss and .brk
>   ||   |
> 0x13ce6000++   |
>   ||  /
> 0x13f77000|   initrd   |--
>   ||
> 0x13fef374++
>
> The initrd image will then be overwritten by the memset during early
> initialization:
>
> [1.655204] Unpacking initramfs...
> [1.662831] Initramfs unpacking failed: junk in compressed archive
>
> This patch prevents the above situation by requiring a larger space when 
> looking
> for a random kernel base, so that existing logic can effectively avoids the
> overlap.
>
> Fixes: 82fa9637a2 ("x86, kaslr: Select random position from e820 maps")
> Reported-by: Fengguang Wu 
> Signed-off-by: Junjie Mao 
> [kees: switched to perl to avoid hex translation pain in mawk vs gawk]
> [kees: calculated overlap without relocs table]
> Signed-off-by: Kees Cook 
> Cc: sta...@vger.kernel.org
> ---
> This version updates the commit log only.
>
> Kees, please help review the documentation. Thanks!
>
> Best Regards
> Junjie Mao
[...]
> diff --git a/arch/x86/tools/calc_run_size.pl b/arch/x86/tools/calc_run_size.pl
> new file mode 100644
> index ..0b0b124d3ece
> --- /dev/null
> +++ b/arch/x86/tools/calc_run_size.pl
> @@ -0,0 +1,30 @@
> +#!/usr/bin/perl
> +#
> +# Calculate the amount of space needed to run the kernel, including room for
> +# the .bss and .brk sections.
> +#
> +# Usage:
> +# objdump -h a.out | perl calc_run_size.pl
> +use strict;
> +
> +my $mem_size = 0;
> +my $file_offset = 0;
> +
> +my $sections=" *[0-9]+ \.(?:bss|brk) +";
> +while (<>) {
> + if (/^$sections([0-9a-f]+) +(?:[0-9a-f]+ +){2}([0-9a-f]+)/) {
> + my $size = hex($1);
> + my $offset = hex($2);
> + $mem_size += $size;
> + if ($file_offset == 0) {
> + $file_offset = $offset;
> + } elsif ($file_offset != $offset) {
> + die ".bss and .brk lack common file offset\n";
> + }
> + }
> +}
> +
> +if ($file_offset == 0) {
> + die "Never found .bss or .brk file offset\n";
> +}
> +printf("%d\n", $mem_size + $file_offset);

Given that bss and brk are nobits (i.e. only ALLOC) sections, does
file_offset make sense as a load address.  This fails with gold:

$ git checkout v3.18-rc5
$ make  # with gold
[...]
..bss and .brk lack common file offset
..bss and .brk lack common file offset
..bss and .brk lack common file offset
..bss and .brk lack common file offset
  MKPIGGY arch/x86/boot/compressed/piggy.S
Usage: arch/x86/boot/compressed/mkpiggy compressed_file run_size
make[2]: *** [arch/x86/boot/compressed/piggy.S] Error 1
make[1]: *** [arch/x86/boot/compressed/vmlinux] Error 2
make: *** [bzImage] Error 2

In ld.bfd brk/bss file_offsets match, but they differ with ld.gold:

$ objdump -h vmlinux.ld
[...]
  0 .text 00818bb3  8100  0100  0020  2**12
  CONTENTS, ALLOC, LOAD, READONLY, CODE
[...]
 26 .bss  000e  81fe8000  01fe8000  013e8000  2**12
  ALLOC
 27 .brk  00026000  820c8000  020c8000  013e8000  2**0
  ALLOC

$ objdump -h vmlinux.ld | perl arch/x86/tools/calc_run_size.pl
21946368
# aka 0x14ee000


$ objdump -h vmlinux.gold
[...]
  0 .text 00818bb3  8100  0100  1000  2**12
  CONTENTS, ALLOC, LOAD, READONLY, CODE
[...]
 26 .bss  000e  81feb000  01feb000  00e9  2**12
  ALLOC
 27 .brk  00026000  820cb000  020cb000  00f7  2**0
  ALLOC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] x86, kaslr: Prevent .bss from overlaping initrd

2014-11-17 Thread Greg Thelen
On Fri, Oct 31 2014, Junjie Mao wrote:

 When choosing a random address, the current implementation does not take into
 account the reversed space for .bss and .brk sections. Thus the relocated 
 kernel
 may overlap other components in memory. Here is an example of the overlap 
 from a
 x86_64 kernel in qemu (the ranges of physical addresses are presented):

  Physical Address

 0x0fe0  --++  -- randomized base
/  |  relocated kernel  |
vmlinux.bin| (from vmlinux.bin) |
 0x1336d000(an ELF file)   ++--
\  ||  \
 0x1376d870  --++   |
   |relocs table|   |
 0x13c1c2a8++   .bss and .brk
   ||   |
 0x13ce6000++   |
   ||  /
 0x13f77000|   initrd   |--
   ||
 0x13fef374++

 The initrd image will then be overwritten by the memset during early
 initialization:

 [1.655204] Unpacking initramfs...
 [1.662831] Initramfs unpacking failed: junk in compressed archive

 This patch prevents the above situation by requiring a larger space when 
 looking
 for a random kernel base, so that existing logic can effectively avoids the
 overlap.

 Fixes: 82fa9637a2 (x86, kaslr: Select random position from e820 maps)
 Reported-by: Fengguang Wu fengguang...@intel.com
 Signed-off-by: Junjie Mao eternal@gmail.com
 [kees: switched to perl to avoid hex translation pain in mawk vs gawk]
 [kees: calculated overlap without relocs table]
 Signed-off-by: Kees Cook keesc...@chromium.org
 Cc: sta...@vger.kernel.org
 ---
 This version updates the commit log only.

 Kees, please help review the documentation. Thanks!

 Best Regards
 Junjie Mao
[...]
 diff --git a/arch/x86/tools/calc_run_size.pl b/arch/x86/tools/calc_run_size.pl
 new file mode 100644
 index ..0b0b124d3ece
 --- /dev/null
 +++ b/arch/x86/tools/calc_run_size.pl
 @@ -0,0 +1,30 @@
 +#!/usr/bin/perl
 +#
 +# Calculate the amount of space needed to run the kernel, including room for
 +# the .bss and .brk sections.
 +#
 +# Usage:
 +# objdump -h a.out | perl calc_run_size.pl
 +use strict;
 +
 +my $mem_size = 0;
 +my $file_offset = 0;
 +
 +my $sections= *[0-9]+ \.(?:bss|brk) +;
 +while () {
 + if (/^$sections([0-9a-f]+) +(?:[0-9a-f]+ +){2}([0-9a-f]+)/) {
 + my $size = hex($1);
 + my $offset = hex($2);
 + $mem_size += $size;
 + if ($file_offset == 0) {
 + $file_offset = $offset;
 + } elsif ($file_offset != $offset) {
 + die .bss and .brk lack common file offset\n;
 + }
 + }
 +}
 +
 +if ($file_offset == 0) {
 + die Never found .bss or .brk file offset\n;
 +}
 +printf(%d\n, $mem_size + $file_offset);

Given that bss and brk are nobits (i.e. only ALLOC) sections, does
file_offset make sense as a load address.  This fails with gold:

$ git checkout v3.18-rc5
$ make  # with gold
[...]
..bss and .brk lack common file offset
..bss and .brk lack common file offset
..bss and .brk lack common file offset
..bss and .brk lack common file offset
  MKPIGGY arch/x86/boot/compressed/piggy.S
Usage: arch/x86/boot/compressed/mkpiggy compressed_file run_size
make[2]: *** [arch/x86/boot/compressed/piggy.S] Error 1
make[1]: *** [arch/x86/boot/compressed/vmlinux] Error 2
make: *** [bzImage] Error 2

In ld.bfd brk/bss file_offsets match, but they differ with ld.gold:

$ objdump -h vmlinux.ld
[...]
  0 .text 00818bb3  8100  0100  0020  2**12
  CONTENTS, ALLOC, LOAD, READONLY, CODE
[...]
 26 .bss  000e  81fe8000  01fe8000  013e8000  2**12
  ALLOC
 27 .brk  00026000  820c8000  020c8000  013e8000  2**0
  ALLOC

$ objdump -h vmlinux.ld | perl arch/x86/tools/calc_run_size.pl
21946368
# aka 0x14ee000


$ objdump -h vmlinux.gold
[...]
  0 .text 00818bb3  8100  0100  1000  2**12
  CONTENTS, ALLOC, LOAD, READONLY, CODE
[...]
 26 .bss  000e  81feb000  01feb000  00e9  2**12
  ALLOC
 27 .brk  00026000  820cb000  020cb000  00f7  2**0
  ALLOC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] x86, kaslr: Prevent .bss from overlaping initrd

2014-11-17 Thread Greg Thelen

On Mon, Nov 17 2014, Greg Thelen wrote:
[...]
 Given that bss and brk are nobits (i.e. only ALLOC) sections, does
 file_offset make sense as a load address.  This fails with gold:

 $ git checkout v3.18-rc5
 $ make  # with gold
 [...]
 ..bss and .brk lack common file offset
 ..bss and .brk lack common file offset
 ..bss and .brk lack common file offset
 ..bss and .brk lack common file offset
   MKPIGGY arch/x86/boot/compressed/piggy.S
 Usage: arch/x86/boot/compressed/mkpiggy compressed_file run_size
 make[2]: *** [arch/x86/boot/compressed/piggy.S] Error 1
 make[1]: *** [arch/x86/boot/compressed/vmlinux] Error 2
 make: *** [bzImage] Error 2
[...]

I just saw http://www.spinics.net/lists/kernel/msg1869328.html which
fixes things for me.  Sorry for the noise.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] mm: memcontrol: support transparent huge pages under pressure

2014-09-23 Thread Greg Thelen

On Tue, Sep 23 2014, Johannes Weiner wrote:

> On Mon, Sep 22, 2014 at 10:52:50PM -0700, Greg Thelen wrote:
>> 
>> On Fri, Sep 19 2014, Johannes Weiner wrote:
>> 
>> > In a memcg with even just moderate cache pressure, success rates for
>> > transparent huge page allocations drop to zero, wasting a lot of
>> > effort that the allocator puts into assembling these pages.
>> >
>> > The reason for this is that the memcg reclaim code was never designed
>> > for higher-order charges.  It reclaims in small batches until there is
>> > room for at least one page.  Huge pages charges only succeed when
>> > these batches add up over a series of huge faults, which is unlikely
>> > under any significant load involving order-0 allocations in the group.
>> >
>> > Remove that loop on the memcg side in favor of passing the actual
>> > reclaim goal to direct reclaim, which is already set up and optimized
>> > to meet higher-order goals efficiently.
>> >
>> > This brings memcg's THP policy in line with the system policy: if the
>> > allocator painstakingly assembles a hugepage, memcg will at least make
>> > an honest effort to charge it.  As a result, transparent hugepage
>> > allocation rates amid cache activity are drastically improved:
>> >
>> >   vanilla patched
>> > pgalloc 4717530.80 (  +0.00%)   4451376.40 (  -5.64%)
>> > pgfault  491370.60 (  +0.00%)225477.40 ( -54.11%)
>> > pgmajfault2.00 (  +0.00%) 1.80 (  -6.67%)
>> > thp_fault_alloc   0.00 (  +0.00%)   531.60 (+100.00%)
>> > thp_fault_fallback  749.00 (  +0.00%)   217.40 ( -70.88%)
>> >
>> > [ Note: this may in turn increase memory consumption from internal
>> >   fragmentation, which is an inherent risk of transparent hugepages.
>> >   Some setups may have to adjust the memcg limits accordingly to
>> >   accomodate this - or, if the machine is already packed to capacity,
>> >   disable the transparent huge page feature. ]
>> 
>> We're using an earlier version of this patch, so I approve of the
>> general direction.  But I have some feedback.
>> 
>> The memsw aspect of this change seems somewhat separate.  Can it be
>> split into a different patch?
>> 
>> The memsw aspect of this patch seems to change behavior.  Is this
>> intended?  If so, a mention of it in the commit log would assuage the
>> reader.  I'll explain...  Assume a machine with swap enabled and
>> res.limit==memsw.limit, thus memsw_is_minimum is true.  My understanding
>> is that memsw.usage represents sum(ram_usage, swap_usage).  So when
>> memsw_is_minimum=true, then both swap_usage=0 and
>> memsw.usage==res.usage.  In this condition, if res usage is at limit
>> then there's no point in swapping because memsw.usage is already
>> maximal.  Prior to this patch I think the kernel did the right thing,
>> but not afterwards.
>> 
>> Before this patch:
>>   if res.usage == res.limit, try_charge() indirectly calls
>>   try_to_free_mem_cgroup_pages(noswap=true)
>> 
>> After this patch:
>>   if res.usage == res.limit, try_charge() calls
>>   try_to_free_mem_cgroup_pages(may_swap=true)
>> 
>> Notice the inverted swap-is-allowed value.
>
> For some reason I had myself convinced that this is dead code due to a
> change in callsites a long time ago, but you are right that currently
> try_charge() relies on it, thanks for pointing it out.
>
> However, memsw is always equal to or bigger than the memory limit - so
> instead of keeping a separate state variable to track when memory
> failure implies memsw failure, couldn't we just charge memsw first?
>
> How about the following?  But yeah, I'd split this into a separate
> patch now.

Looks good to me.  Thanks.

Acked-by: Greg Thelen 

> ---
>  mm/memcontrol.c | 15 ---
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e2def11f1ec1..7c9a8971d0f4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2497,16 +2497,17 @@ retry:
>   goto done;
>  
>   size = batch * PAGE_SIZE;
> - if (!res_counter_charge(>res, size, _res)) {
> - if (!do_swap_account)
> + if (!do_swap_account ||
> + !res_counter_charge(>memsw, size, _res)) {
> + if (!res_counter_charge(>res, size, _res))
>   goto done_restock;
> - if (!res_counte

Re: [patch] mm: memcontrol: support transparent huge pages under pressure

2014-09-23 Thread Greg Thelen

On Fri, Sep 19 2014, Johannes Weiner wrote:

> In a memcg with even just moderate cache pressure, success rates for
> transparent huge page allocations drop to zero, wasting a lot of
> effort that the allocator puts into assembling these pages.
>
> The reason for this is that the memcg reclaim code was never designed
> for higher-order charges.  It reclaims in small batches until there is
> room for at least one page.  Huge pages charges only succeed when
> these batches add up over a series of huge faults, which is unlikely
> under any significant load involving order-0 allocations in the group.
>
> Remove that loop on the memcg side in favor of passing the actual
> reclaim goal to direct reclaim, which is already set up and optimized
> to meet higher-order goals efficiently.
>
> This brings memcg's THP policy in line with the system policy: if the
> allocator painstakingly assembles a hugepage, memcg will at least make
> an honest effort to charge it.  As a result, transparent hugepage
> allocation rates amid cache activity are drastically improved:
>
>   vanilla patched
> pgalloc 4717530.80 (  +0.00%)   4451376.40 (  -5.64%)
> pgfault  491370.60 (  +0.00%)225477.40 ( -54.11%)
> pgmajfault2.00 (  +0.00%) 1.80 (  -6.67%)
> thp_fault_alloc   0.00 (  +0.00%)   531.60 (+100.00%)
> thp_fault_fallback  749.00 (  +0.00%)   217.40 ( -70.88%)
>
> [ Note: this may in turn increase memory consumption from internal
>   fragmentation, which is an inherent risk of transparent hugepages.
>   Some setups may have to adjust the memcg limits accordingly to
>   accomodate this - or, if the machine is already packed to capacity,
>   disable the transparent huge page feature. ]

We're using an earlier version of this patch, so I approve of the
general direction.  But I have some feedback.

The memsw aspect of this change seems somewhat separate.  Can it be
split into a different patch?

The memsw aspect of this patch seems to change behavior.  Is this
intended?  If so, a mention of it in the commit log would assuage the
reader.  I'll explain...  Assume a machine with swap enabled and
res.limit==memsw.limit, thus memsw_is_minimum is true.  My understanding
is that memsw.usage represents sum(ram_usage, swap_usage).  So when
memsw_is_minimum=true, then both swap_usage=0 and
memsw.usage==res.usage.  In this condition, if res usage is at limit
then there's no point in swapping because memsw.usage is already
maximal.  Prior to this patch I think the kernel did the right thing,
but not afterwards.

Before this patch:
  if res.usage == res.limit, try_charge() indirectly calls
  try_to_free_mem_cgroup_pages(noswap=true)

After this patch:
  if res.usage == res.limit, try_charge() calls
  try_to_free_mem_cgroup_pages(may_swap=true)

Notice the inverted swap-is-allowed value.

I haven't had time to look at your other outstanding memcg patches.
These comments were made with this patch in isolation.

> Signed-off-by: Johannes Weiner 
> ---
>  include/linux/swap.h |  6 ++--
>  mm/memcontrol.c  | 86 
> +++-
>  mm/vmscan.c  |  7 +++--
>  3 files changed, 25 insertions(+), 74 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index ea4f926e6b9b..37a585beef5c 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -327,8 +327,10 @@ extern void lru_cache_add_active_or_unevictable(struct 
> page *page,
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>   gfp_t gfp_mask, nodemask_t *mask);
>  extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> -   gfp_t gfp_mask, bool noswap);
> +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> +   unsigned long nr_pages,
> +   gfp_t gfp_mask,
> +   bool may_swap);
>  extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
>   gfp_t gfp_mask, bool noswap,
>   struct zone *zone,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9431024e490c..e2def11f1ec1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -315,9 +315,6 @@ struct mem_cgroup {
>   /* OOM-Killer disable */
>   int oom_kill_disable;
>  
> - /* set when res.limit == memsw.limit */
> - boolmemsw_is_minimum;
> -
>   /* protect arrays of thresholds */
>   struct mutex thresholds_lock;
>  
> @@ -481,14 +478,6 @@ enum res_type {
>  #define OOM_CONTROL  (0)
>  
>  /*
> - * Reclaim 

Re: [patch] mm: memcontrol: support transparent huge pages under pressure

2014-09-23 Thread Greg Thelen

On Fri, Sep 19 2014, Johannes Weiner wrote:

 In a memcg with even just moderate cache pressure, success rates for
 transparent huge page allocations drop to zero, wasting a lot of
 effort that the allocator puts into assembling these pages.

 The reason for this is that the memcg reclaim code was never designed
 for higher-order charges.  It reclaims in small batches until there is
 room for at least one page.  Huge pages charges only succeed when
 these batches add up over a series of huge faults, which is unlikely
 under any significant load involving order-0 allocations in the group.

 Remove that loop on the memcg side in favor of passing the actual
 reclaim goal to direct reclaim, which is already set up and optimized
 to meet higher-order goals efficiently.

 This brings memcg's THP policy in line with the system policy: if the
 allocator painstakingly assembles a hugepage, memcg will at least make
 an honest effort to charge it.  As a result, transparent hugepage
 allocation rates amid cache activity are drastically improved:

   vanilla patched
 pgalloc 4717530.80 (  +0.00%)   4451376.40 (  -5.64%)
 pgfault  491370.60 (  +0.00%)225477.40 ( -54.11%)
 pgmajfault2.00 (  +0.00%) 1.80 (  -6.67%)
 thp_fault_alloc   0.00 (  +0.00%)   531.60 (+100.00%)
 thp_fault_fallback  749.00 (  +0.00%)   217.40 ( -70.88%)

 [ Note: this may in turn increase memory consumption from internal
   fragmentation, which is an inherent risk of transparent hugepages.
   Some setups may have to adjust the memcg limits accordingly to
   accomodate this - or, if the machine is already packed to capacity,
   disable the transparent huge page feature. ]

We're using an earlier version of this patch, so I approve of the
general direction.  But I have some feedback.

The memsw aspect of this change seems somewhat separate.  Can it be
split into a different patch?

The memsw aspect of this patch seems to change behavior.  Is this
intended?  If so, a mention of it in the commit log would assuage the
reader.  I'll explain...  Assume a machine with swap enabled and
res.limit==memsw.limit, thus memsw_is_minimum is true.  My understanding
is that memsw.usage represents sum(ram_usage, swap_usage).  So when
memsw_is_minimum=true, then both swap_usage=0 and
memsw.usage==res.usage.  In this condition, if res usage is at limit
then there's no point in swapping because memsw.usage is already
maximal.  Prior to this patch I think the kernel did the right thing,
but not afterwards.

Before this patch:
  if res.usage == res.limit, try_charge() indirectly calls
  try_to_free_mem_cgroup_pages(noswap=true)

After this patch:
  if res.usage == res.limit, try_charge() calls
  try_to_free_mem_cgroup_pages(may_swap=true)

Notice the inverted swap-is-allowed value.

I haven't had time to look at your other outstanding memcg patches.
These comments were made with this patch in isolation.

 Signed-off-by: Johannes Weiner han...@cmpxchg.org
 ---
  include/linux/swap.h |  6 ++--
  mm/memcontrol.c  | 86 
 +++-
  mm/vmscan.c  |  7 +++--
  3 files changed, 25 insertions(+), 74 deletions(-)

 diff --git a/include/linux/swap.h b/include/linux/swap.h
 index ea4f926e6b9b..37a585beef5c 100644
 --- a/include/linux/swap.h
 +++ b/include/linux/swap.h
 @@ -327,8 +327,10 @@ extern void lru_cache_add_active_or_unevictable(struct 
 page *page,
  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
   gfp_t gfp_mask, nodemask_t *mask);
  extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
 -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 -   gfp_t gfp_mask, bool noswap);
 +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 +   unsigned long nr_pages,
 +   gfp_t gfp_mask,
 +   bool may_swap);
  extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
   gfp_t gfp_mask, bool noswap,
   struct zone *zone,
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 9431024e490c..e2def11f1ec1 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -315,9 +315,6 @@ struct mem_cgroup {
   /* OOM-Killer disable */
   int oom_kill_disable;
  
 - /* set when res.limit == memsw.limit */
 - boolmemsw_is_minimum;
 -
   /* protect arrays of thresholds */
   struct mutex thresholds_lock;
  
 @@ -481,14 +478,6 @@ enum res_type {
  #define OOM_CONTROL  (0)
  
  /*
 - * Reclaim flags for mem_cgroup_hierarchical_reclaim
 - */
 -#define 

Re: [patch] mm: memcontrol: support transparent huge pages under pressure

2014-09-23 Thread Greg Thelen

On Tue, Sep 23 2014, Johannes Weiner wrote:

 On Mon, Sep 22, 2014 at 10:52:50PM -0700, Greg Thelen wrote:
 
 On Fri, Sep 19 2014, Johannes Weiner wrote:
 
  In a memcg with even just moderate cache pressure, success rates for
  transparent huge page allocations drop to zero, wasting a lot of
  effort that the allocator puts into assembling these pages.
 
  The reason for this is that the memcg reclaim code was never designed
  for higher-order charges.  It reclaims in small batches until there is
  room for at least one page.  Huge pages charges only succeed when
  these batches add up over a series of huge faults, which is unlikely
  under any significant load involving order-0 allocations in the group.
 
  Remove that loop on the memcg side in favor of passing the actual
  reclaim goal to direct reclaim, which is already set up and optimized
  to meet higher-order goals efficiently.
 
  This brings memcg's THP policy in line with the system policy: if the
  allocator painstakingly assembles a hugepage, memcg will at least make
  an honest effort to charge it.  As a result, transparent hugepage
  allocation rates amid cache activity are drastically improved:
 
vanilla patched
  pgalloc 4717530.80 (  +0.00%)   4451376.40 (  -5.64%)
  pgfault  491370.60 (  +0.00%)225477.40 ( -54.11%)
  pgmajfault2.00 (  +0.00%) 1.80 (  -6.67%)
  thp_fault_alloc   0.00 (  +0.00%)   531.60 (+100.00%)
  thp_fault_fallback  749.00 (  +0.00%)   217.40 ( -70.88%)
 
  [ Note: this may in turn increase memory consumption from internal
fragmentation, which is an inherent risk of transparent hugepages.
Some setups may have to adjust the memcg limits accordingly to
accomodate this - or, if the machine is already packed to capacity,
disable the transparent huge page feature. ]
 
 We're using an earlier version of this patch, so I approve of the
 general direction.  But I have some feedback.
 
 The memsw aspect of this change seems somewhat separate.  Can it be
 split into a different patch?
 
 The memsw aspect of this patch seems to change behavior.  Is this
 intended?  If so, a mention of it in the commit log would assuage the
 reader.  I'll explain...  Assume a machine with swap enabled and
 res.limit==memsw.limit, thus memsw_is_minimum is true.  My understanding
 is that memsw.usage represents sum(ram_usage, swap_usage).  So when
 memsw_is_minimum=true, then both swap_usage=0 and
 memsw.usage==res.usage.  In this condition, if res usage is at limit
 then there's no point in swapping because memsw.usage is already
 maximal.  Prior to this patch I think the kernel did the right thing,
 but not afterwards.
 
 Before this patch:
   if res.usage == res.limit, try_charge() indirectly calls
   try_to_free_mem_cgroup_pages(noswap=true)
 
 After this patch:
   if res.usage == res.limit, try_charge() calls
   try_to_free_mem_cgroup_pages(may_swap=true)
 
 Notice the inverted swap-is-allowed value.

 For some reason I had myself convinced that this is dead code due to a
 change in callsites a long time ago, but you are right that currently
 try_charge() relies on it, thanks for pointing it out.

 However, memsw is always equal to or bigger than the memory limit - so
 instead of keeping a separate state variable to track when memory
 failure implies memsw failure, couldn't we just charge memsw first?

 How about the following?  But yeah, I'd split this into a separate
 patch now.

Looks good to me.  Thanks.

Acked-by: Greg Thelen gthe...@google.com

 ---
  mm/memcontrol.c | 15 ---
  1 file changed, 8 insertions(+), 7 deletions(-)

 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index e2def11f1ec1..7c9a8971d0f4 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -2497,16 +2497,17 @@ retry:
   goto done;
  
   size = batch * PAGE_SIZE;
 - if (!res_counter_charge(memcg-res, size, fail_res)) {
 - if (!do_swap_account)
 + if (!do_swap_account ||
 + !res_counter_charge(memcg-memsw, size, fail_res)) {
 + if (!res_counter_charge(memcg-res, size, fail_res))
   goto done_restock;
 - if (!res_counter_charge(memcg-memsw, size, fail_res))
 - goto done_restock;
 - res_counter_uncharge(memcg-res, size);
 + if (do_swap_account)
 + res_counter_uncharge(memcg-memsw, size);
 + mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
 + } else {
   mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
   may_swap = false;
 - } else
 - mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
 + }
  
   if (batch  nr_pages) {
   batch = nr_pages;

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message

Re: [RFC] memory cgroup: weak points of kmem accounting design

2014-09-18 Thread Greg Thelen

On Tue, Sep 16 2014, Vladimir Davydov wrote:

> Hi Suleiman,
>
> On Mon, Sep 15, 2014 at 12:13:33PM -0700, Suleiman Souhlal wrote:
>> On Mon, Sep 15, 2014 at 3:44 AM, Vladimir Davydov
>>  wrote:
>> > Hi,
>> >
>> > I'd like to discuss downsides of the kmem accounting part of the memory
>> > cgroup controller and a possible way to fix them. I'd really appreciate
>> > if you could share your thoughts on it.
>> >
>> > The idea lying behind the kmem accounting design is to provide each
>> > memory cgroup with its private copy of every kmem_cache and list_lru
>> > it's going to use. This is implemented by bundling these structures with
>> > arrays storing per-memcg copies. The arrays are referenced by css id.
>> > When a process in a cgroup tries to allocate an object from a kmem cache
>> > we first find out which cgroup the process resides in, then look up the
>> > cache copy corresponding to the cgroup, and finally allocate a new
>> > object from the private cache. Similarly, on addition/deletion of an
>> > object from a list_lru, we first obtain the kmem cache the object was
>> > allocated from, then look up the memory cgroup which the cache belongs
>> > to, and finally add/remove the object from the private copy of the
>> > list_lru corresponding to the cgroup.
>> >
>> > Though simple it looks from the first glance, it has a number of serious
>> > weaknesses:
>> >
>> >  - Providing each memory cgroup with its own kmem cache increases
>> >external fragmentation.
>> 
>> I haven't seen any evidence of this being a problem (but that doesn't
>> mean it doesn't exist).
>
> Actually, it's rather speculative. For example, if we have say a hundred
> of extra objects per cache (fragmented or in per-cpu stocks) of size 256
> bytes, then for one cache the overhead would be 25K, which is
> negligible. Now if there are thousand cgroups using the cache, we have
> to pay 25M, which is noticeable. Anyway, to estimate this exactly, one
> needs to run a typical workload inside a cgroup.
>
>>
>> >  - SLAB isn't ready to deal with thousands of caches: its algorithm
>> >walks over all system caches and shrinks them periodically, which may
>> >be really costly if we have thousands active memory cgroups.
>> 
>> This could be throttled.
>
> It could be, but then we'd have more objects in per-cpu stocks, which
> means more memory overhead.
>
>> 
>> >
>> >  - Caches may now be created/destroyed frequently and from various
>> >places: on system cache destruction, on cgroup offline, from a work
>> >struct scheduled by kmalloc. Synchronizing them properly is really
>> >difficult. I've fixed some places, but it's still desperately buggy.
>> 
>> Agreed.
>> 
>> >  - It's hard to determine when we should destroy a cache that belongs to
>> >a dead memory cgroup. The point is both SLAB and SLUB implementations
>> >always keep some pages in stock for performance reasons, so just
>> >scheduling cache destruction work from kfree once the last slab page
>> >is freed isn't enough - it will normally never happen for SLUB and
>> >may take really long for SLAB. Of course, we can forbid SL[AU]B
>> >algorithm to stock pages in dead caches, but it looks ugly and has
>> >negative impact on performance (I did this, but finally decided to
>> >revert). Another approach could be scanning dead caches periodically
>> >or on memory pressure, but that would be ugly too.
>> 
>> Not sure about slub, but for SLAB doesn't cache_reap take care of that?
>
> It is, but it takes some time. If we decide to throttle it, then it'll
> take even longer. Anyway, SLUB has nothing like that, therefore we'd
> have to handle different algorithms in different ways, which I
> particularly dislike.
>
>> 
>> >
>> >  - The arrays for storing per-memcg copies can get really large,
>> >especially if we finally decide to leave dead memory cgroups hanging
>> >until memory pressure reaps objects assigned to them and let them
>> >free. How can we deal with an array of, say, 20K elements? Simply
>> >allocating them with kmal^W vmalloc will result in memory wastes. It
>> >will be particularly funny if the user wants to provide each cgroup
>> >with a separate mount point: each super block will have a list_lru
>> >for every memory cgroup, but only one of them will be really used.
>> >That said we need a kind of dynamic reclaimable arrays. Radix trees
>> >would fit, but they are way slower than plain arrays, which is a
>> >no-go, because we want to look up on each kmalloc, list_lru_add/del,
>> >which are fast paths.
>> 
>> The initial design we had was to have an array indexed by "cache id"
>> in struct memcg, instead of the current array indexed by "css id" in
>> struct kmem_cache.
>> The initial design doesn't have the problem you're describing here, as
>> far as I can tell.
>
> It is indexed by "cache id", not "css id", but it doesn't matter
> actually. Suppose, when a cgroup is taken 

Re: [RFC] memory cgroup: weak points of kmem accounting design

2014-09-18 Thread Greg Thelen

On Tue, Sep 16 2014, Vladimir Davydov wrote:

 Hi Suleiman,

 On Mon, Sep 15, 2014 at 12:13:33PM -0700, Suleiman Souhlal wrote:
 On Mon, Sep 15, 2014 at 3:44 AM, Vladimir Davydov
 vdavy...@parallels.com wrote:
  Hi,
 
  I'd like to discuss downsides of the kmem accounting part of the memory
  cgroup controller and a possible way to fix them. I'd really appreciate
  if you could share your thoughts on it.
 
  The idea lying behind the kmem accounting design is to provide each
  memory cgroup with its private copy of every kmem_cache and list_lru
  it's going to use. This is implemented by bundling these structures with
  arrays storing per-memcg copies. The arrays are referenced by css id.
  When a process in a cgroup tries to allocate an object from a kmem cache
  we first find out which cgroup the process resides in, then look up the
  cache copy corresponding to the cgroup, and finally allocate a new
  object from the private cache. Similarly, on addition/deletion of an
  object from a list_lru, we first obtain the kmem cache the object was
  allocated from, then look up the memory cgroup which the cache belongs
  to, and finally add/remove the object from the private copy of the
  list_lru corresponding to the cgroup.
 
  Though simple it looks from the first glance, it has a number of serious
  weaknesses:
 
   - Providing each memory cgroup with its own kmem cache increases
 external fragmentation.
 
 I haven't seen any evidence of this being a problem (but that doesn't
 mean it doesn't exist).

 Actually, it's rather speculative. For example, if we have say a hundred
 of extra objects per cache (fragmented or in per-cpu stocks) of size 256
 bytes, then for one cache the overhead would be 25K, which is
 negligible. Now if there are thousand cgroups using the cache, we have
 to pay 25M, which is noticeable. Anyway, to estimate this exactly, one
 needs to run a typical workload inside a cgroup.


   - SLAB isn't ready to deal with thousands of caches: its algorithm
 walks over all system caches and shrinks them periodically, which may
 be really costly if we have thousands active memory cgroups.
 
 This could be throttled.

 It could be, but then we'd have more objects in per-cpu stocks, which
 means more memory overhead.

 
 
   - Caches may now be created/destroyed frequently and from various
 places: on system cache destruction, on cgroup offline, from a work
 struct scheduled by kmalloc. Synchronizing them properly is really
 difficult. I've fixed some places, but it's still desperately buggy.
 
 Agreed.
 
   - It's hard to determine when we should destroy a cache that belongs to
 a dead memory cgroup. The point is both SLAB and SLUB implementations
 always keep some pages in stock for performance reasons, so just
 scheduling cache destruction work from kfree once the last slab page
 is freed isn't enough - it will normally never happen for SLUB and
 may take really long for SLAB. Of course, we can forbid SL[AU]B
 algorithm to stock pages in dead caches, but it looks ugly and has
 negative impact on performance (I did this, but finally decided to
 revert). Another approach could be scanning dead caches periodically
 or on memory pressure, but that would be ugly too.
 
 Not sure about slub, but for SLAB doesn't cache_reap take care of that?

 It is, but it takes some time. If we decide to throttle it, then it'll
 take even longer. Anyway, SLUB has nothing like that, therefore we'd
 have to handle different algorithms in different ways, which I
 particularly dislike.

 
 
   - The arrays for storing per-memcg copies can get really large,
 especially if we finally decide to leave dead memory cgroups hanging
 until memory pressure reaps objects assigned to them and let them
 free. How can we deal with an array of, say, 20K elements? Simply
 allocating them with kmal^W vmalloc will result in memory wastes. It
 will be particularly funny if the user wants to provide each cgroup
 with a separate mount point: each super block will have a list_lru
 for every memory cgroup, but only one of them will be really used.
 That said we need a kind of dynamic reclaimable arrays. Radix trees
 would fit, but they are way slower than plain arrays, which is a
 no-go, because we want to look up on each kmalloc, list_lru_add/del,
 which are fast paths.
 
 The initial design we had was to have an array indexed by cache id
 in struct memcg, instead of the current array indexed by css id in
 struct kmem_cache.
 The initial design doesn't have the problem you're describing here, as
 far as I can tell.

 It is indexed by cache id, not css id, but it doesn't matter
 actually. Suppose, when a cgroup is taken offline it still has kmem
 objects accounted to it. Then we have to keep its cache id along with
 the caches hosting the objects until all the objects are freed. All
 caches and, what is worse, list_lru's will have to 

Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests

2014-08-07 Thread Greg Thelen

On Thu, Aug 07 2014, Johannes Weiner wrote:

> On Thu, Aug 07, 2014 at 03:08:22PM +0200, Michal Hocko wrote:
>> On Mon 04-08-14 17:14:54, Johannes Weiner wrote:
>> > Instead of passing the request size to direct reclaim, memcg just
>> > manually loops around reclaiming SWAP_CLUSTER_MAX pages until the
>> > charge can succeed.  That potentially wastes scan progress when huge
>> > page allocations require multiple invocations, which always have to
>> > restart from the default scan priority.
>> > 
>> > Pass the request size as a reclaim target to direct reclaim and leave
>> > it to that code to reach the goal.
>> 
>> THP charge then will ask for 512 pages to be (direct) reclaimed. That
>> is _a lot_ and I would expect long stalls to achieve this target. I
>> would also expect quick priority drop down and potential over-reclaim
>> for small and moderately sized memcgs (e.g. memcg with 1G worth of pages
>> would need to drop down below DEF_PRIORITY-2 to have a chance to scan
>> that many pages). All that done for a charge which can fallback to a
>> single page charge.
>> 
>> The current code is quite hostile to THP when we are close to the limit
>> but solving this by introducing long stalls instead doesn't sound like a
>> proper approach to me.
>
> THP latencies are actually the same when comparing high limit nr_pages
> reclaim with the current hard limit SWAP_CLUSTER_MAX reclaim, although
> system time is reduced with the high limit.
>
> High limit reclaim with SWAP_CLUSTER_MAX has better fault latency but
> it doesn't actually contain the workload - with 1G high and a 4G load,
> the consumption at the end of the run is 3.7G.
>
> So what I'm proposing works and is of equal quality from a THP POV.
> This change is complicated enough when we stick to the facts, let's
> not make up things based on gut feeling.

I think that high order non THP page allocations also benefit from this.
Such allocations don't have a small page fallback.

This may be in flux, but linux-next shows me that:
* mem_cgroup_reclaim()
  frees at least SWAP_CLUSTER_MAX (32) pages.
* try_charge() calls mem_cgroup_reclaim() indefinitely for
  costly (3) or smaller orders assuming that something is reclaimed on
  each iteration.
* try_charge() uses a loop of MEM_CGROUP_RECLAIM_RETRIES (5) for
  larger-than-costly orders.

So for larger-than-costly allocations, try_charge() should be able to
reclaim 160 (5*32) pages which satisfies an order:7 allocation.  But for
order:8+ allocations try_charge() and mem_cgroup_reclaim() are too eager
to give up without something like this.  So I think this patch is a step
in the right direction.

Coincidentally, we've been recently been experimenting with something
like this.  Though we didn't modify the interface between
mem_cgroup_reclaim() and try_to_free_mem_cgroup_pages() - instead we
looped within mem_cgroup_reclaim() until nr_pages of margin were found.
But I have no objection the proposed plumbing of nr_pages all the way
into try_to_free_mem_cgroup_pages().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests

2014-08-07 Thread Greg Thelen

On Thu, Aug 07 2014, Johannes Weiner wrote:

 On Thu, Aug 07, 2014 at 03:08:22PM +0200, Michal Hocko wrote:
 On Mon 04-08-14 17:14:54, Johannes Weiner wrote:
  Instead of passing the request size to direct reclaim, memcg just
  manually loops around reclaiming SWAP_CLUSTER_MAX pages until the
  charge can succeed.  That potentially wastes scan progress when huge
  page allocations require multiple invocations, which always have to
  restart from the default scan priority.
  
  Pass the request size as a reclaim target to direct reclaim and leave
  it to that code to reach the goal.
 
 THP charge then will ask for 512 pages to be (direct) reclaimed. That
 is _a lot_ and I would expect long stalls to achieve this target. I
 would also expect quick priority drop down and potential over-reclaim
 for small and moderately sized memcgs (e.g. memcg with 1G worth of pages
 would need to drop down below DEF_PRIORITY-2 to have a chance to scan
 that many pages). All that done for a charge which can fallback to a
 single page charge.
 
 The current code is quite hostile to THP when we are close to the limit
 but solving this by introducing long stalls instead doesn't sound like a
 proper approach to me.

 THP latencies are actually the same when comparing high limit nr_pages
 reclaim with the current hard limit SWAP_CLUSTER_MAX reclaim, although
 system time is reduced with the high limit.

 High limit reclaim with SWAP_CLUSTER_MAX has better fault latency but
 it doesn't actually contain the workload - with 1G high and a 4G load,
 the consumption at the end of the run is 3.7G.

 So what I'm proposing works and is of equal quality from a THP POV.
 This change is complicated enough when we stick to the facts, let's
 not make up things based on gut feeling.

I think that high order non THP page allocations also benefit from this.
Such allocations don't have a small page fallback.

This may be in flux, but linux-next shows me that:
* mem_cgroup_reclaim()
  frees at least SWAP_CLUSTER_MAX (32) pages.
* try_charge() calls mem_cgroup_reclaim() indefinitely for
  costly (3) or smaller orders assuming that something is reclaimed on
  each iteration.
* try_charge() uses a loop of MEM_CGROUP_RECLAIM_RETRIES (5) for
  larger-than-costly orders.

So for larger-than-costly allocations, try_charge() should be able to
reclaim 160 (5*32) pages which satisfies an order:7 allocation.  But for
order:8+ allocations try_charge() and mem_cgroup_reclaim() are too eager
to give up without something like this.  So I think this patch is a step
in the right direction.

Coincidentally, we've been recently been experimenting with something
like this.  Though we didn't modify the interface between
mem_cgroup_reclaim() and try_to_free_mem_cgroup_pages() - instead we
looped within mem_cgroup_reclaim() until nr_pages of margin were found.
But I have no objection the proposed plumbing of nr_pages all the way
into try_to_free_mem_cgroup_pages().
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] dm bufio: fully initialize shrinker

2014-07-31 Thread Greg Thelen
1d3d4437eae1 ("vmscan: per-node deferred work") added a flags field to
struct shrinker assuming that all shrinkers were zero filled.  The dm
bufio shrinker is not zero filled, which leaves arbitrary kmalloc() data
in flags.  So far the only defined flags bit is SHRINKER_NUMA_AWARE.
But there are proposed patches which add other bits to shrinker.flags
(e.g. memcg awareness).

Rather than simply initializing the shrinker, this patch uses kzalloc()
when allocating the dm_bufio_client to ensure that the embedded shrinker
and any other similar structures are zeroed.

This fixes theoretical over aggressive shrinking of dm bufio objects.
If the uninitialized dm_bufio_client.shrinker.flags contains
SHRINKER_NUMA_AWARE then shrink_slab() would call the dm shrinker for
each numa node rather than just once.  This has been broken since 3.12.

Signed-off-by: Greg Thelen 
---
 drivers/md/dm-bufio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 4e84095833db..d724459860d9 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -1541,7 +1541,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct 
block_device *bdev, unsign
BUG_ON(block_size < 1 << SECTOR_SHIFT ||
   (block_size & (block_size - 1)));
 
-   c = kmalloc(sizeof(*c), GFP_KERNEL);
+   c = kzalloc(sizeof(*c), GFP_KERNEL);
if (!c) {
r = -ENOMEM;
goto bad_client;
-- 
2.0.0.526.g5318336

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] dm bufio: fully initialize shrinker

2014-07-31 Thread Greg Thelen
1d3d4437eae1 (vmscan: per-node deferred work) added a flags field to
struct shrinker assuming that all shrinkers were zero filled.  The dm
bufio shrinker is not zero filled, which leaves arbitrary kmalloc() data
in flags.  So far the only defined flags bit is SHRINKER_NUMA_AWARE.
But there are proposed patches which add other bits to shrinker.flags
(e.g. memcg awareness).

Rather than simply initializing the shrinker, this patch uses kzalloc()
when allocating the dm_bufio_client to ensure that the embedded shrinker
and any other similar structures are zeroed.

This fixes theoretical over aggressive shrinking of dm bufio objects.
If the uninitialized dm_bufio_client.shrinker.flags contains
SHRINKER_NUMA_AWARE then shrink_slab() would call the dm shrinker for
each numa node rather than just once.  This has been broken since 3.12.

Signed-off-by: Greg Thelen gthe...@google.com
---
 drivers/md/dm-bufio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 4e84095833db..d724459860d9 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -1541,7 +1541,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct 
block_device *bdev, unsign
BUG_ON(block_size  1  SECTOR_SHIFT ||
   (block_size  (block_size - 1)));
 
-   c = kmalloc(sizeof(*c), GFP_KERNEL);
+   c = kzalloc(sizeof(*c), GFP_KERNEL);
if (!c) {
r = -ENOMEM;
goto bad_client;
-- 
2.0.0.526.g5318336

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/5] Virtual Memory Resource Controller for cgroups

2014-07-09 Thread Greg Thelen
On Wed, Jul 9, 2014 at 9:36 AM, Vladimir Davydov  wrote:
> Hi Tim,
>
> On Wed, Jul 09, 2014 at 08:08:07AM -0700, Tim Hockin wrote:
>> How is this different from RLIMIT_AS?  You specifically mentioned it
>> earlier but you don't explain how this is different.
>
> The main difference is that RLIMIT_AS is per process while this
> controller is per cgroup. RLIMIT_AS doesn't allow us to limit VSIZE for
> a group of unrelated or cooperating through shmem processes.
>
> Also RLIMIT_AS accounts for total VM usage (including file mappings),
> while this only charges private writable and shared mappings, whose
> faulted-in pages always occupy mem+swap and therefore cannot be just
> synced and dropped like file pages. In other words, this controller
> works exactly as the global overcommit control.
>
>> From my perspective, this is pointless.  There's plenty of perfectly
>> correct software that mmaps files without concern for VSIZE, because
>> they never fault most of those pages in.
>
> But there's also software that correctly handles ENOMEM returned by
> mmap. For example, mongodb keeps growing its buffers until mmap fails.
> Therefore, if there's no overcommit control, it will be OOM-killed
> sooner or later, which may be pretty annoying. And we did have customers
> complaining about that.

Is mongodb's buffer growth causing the oom kills?

If yes, I wonder if apps, like mongodb, that want ENOMEM should (1)
use MAP_POPULATE and (2) we change vm_map_pgoff() to propagate
mm_populate() ENOMEM failures back to mmap()?

>> From my observations it is not generally possible to predict an
>> average VSIZE limit that would satisfy your concerns *and* not kill
>> lots of valid apps.
>
> Yes, it's difficult. Actually, we can only guess. Nevertheless, we
> predict and set the VSIZE limit system-wide by default.
>
>> It sounds like what you want is to limit or even disable swap usage.
>
> I want to avoid OOM kill if it's possible to return ENOMEM. OOM can be
> painful. It can kill lots of innocent processes. Of course, the user can
> protect some processes by setting oom_score_adj, but this is difficult
> and requires time and expertise, so an average user won't do that.
>
>> Given your example, your hypothetical user would probably be better of
>> getting an OOM kill early so she can fix her job spec to request more
>> memory.
>
> In my example the user won't get OOM kill *early*...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 0/5] Virtual Memory Resource Controller for cgroups

2014-07-09 Thread Greg Thelen
On Wed, Jul 9, 2014 at 9:36 AM, Vladimir Davydov vdavy...@parallels.com wrote:
 Hi Tim,

 On Wed, Jul 09, 2014 at 08:08:07AM -0700, Tim Hockin wrote:
 How is this different from RLIMIT_AS?  You specifically mentioned it
 earlier but you don't explain how this is different.

 The main difference is that RLIMIT_AS is per process while this
 controller is per cgroup. RLIMIT_AS doesn't allow us to limit VSIZE for
 a group of unrelated or cooperating through shmem processes.

 Also RLIMIT_AS accounts for total VM usage (including file mappings),
 while this only charges private writable and shared mappings, whose
 faulted-in pages always occupy mem+swap and therefore cannot be just
 synced and dropped like file pages. In other words, this controller
 works exactly as the global overcommit control.

 From my perspective, this is pointless.  There's plenty of perfectly
 correct software that mmaps files without concern for VSIZE, because
 they never fault most of those pages in.

 But there's also software that correctly handles ENOMEM returned by
 mmap. For example, mongodb keeps growing its buffers until mmap fails.
 Therefore, if there's no overcommit control, it will be OOM-killed
 sooner or later, which may be pretty annoying. And we did have customers
 complaining about that.

Is mongodb's buffer growth causing the oom kills?

If yes, I wonder if apps, like mongodb, that want ENOMEM should (1)
use MAP_POPULATE and (2) we change vm_map_pgoff() to propagate
mm_populate() ENOMEM failures back to mmap()?

 From my observations it is not generally possible to predict an
 average VSIZE limit that would satisfy your concerns *and* not kill
 lots of valid apps.

 Yes, it's difficult. Actually, we can only guess. Nevertheless, we
 predict and set the VSIZE limit system-wide by default.

 It sounds like what you want is to limit or even disable swap usage.

 I want to avoid OOM kill if it's possible to return ENOMEM. OOM can be
 painful. It can kill lots of innocent processes. Of course, the user can
 protect some processes by setting oom_score_adj, but this is difficult
 and requires time and expertise, so an average user won't do that.

 Given your example, your hypothetical user would probably be better of
 getting an OOM kill early so she can fix her job spec to request more
 memory.

 In my example the user won't get OOM kill *early*...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] memcg: remove lookup_cgroup_page() prototype

2014-06-19 Thread Greg Thelen
6b208e3f6e35 ("mm: memcg: remove unused node/section info from
pc->flags") deleted the lookup_cgroup_page() function but left a
prototype for it.

Kill the vestigial prototype.

Signed-off-by: Greg Thelen 
---
 include/linux/page_cgroup.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 777a524716db..0ff470de3c12 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -42,7 +42,6 @@ static inline void __init page_cgroup_init(void)
 #endif
 
 struct page_cgroup *lookup_page_cgroup(struct page *page);
-struct page *lookup_cgroup_page(struct page_cgroup *pc);
 
 #define TESTPCGFLAG(uname, lname)  \
 static inline int PageCgroup##uname(struct page_cgroup *pc)\
-- 
2.0.0.526.g5318336

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] memcg: remove lookup_cgroup_page() prototype

2014-06-19 Thread Greg Thelen
6b208e3f6e35 (mm: memcg: remove unused node/section info from
pc-flags) deleted the lookup_cgroup_page() function but left a
prototype for it.

Kill the vestigial prototype.

Signed-off-by: Greg Thelen gthe...@google.com
---
 include/linux/page_cgroup.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 777a524716db..0ff470de3c12 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -42,7 +42,6 @@ static inline void __init page_cgroup_init(void)
 #endif
 
 struct page_cgroup *lookup_page_cgroup(struct page *page);
-struct page *lookup_cgroup_page(struct page_cgroup *pc);
 
 #define TESTPCGFLAG(uname, lname)  \
 static inline int PageCgroup##uname(struct page_cgroup *pc)\
-- 
2.0.0.526.g5318336

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim

2014-06-10 Thread Greg Thelen

On Tue, Jun 10 2014, Johannes Weiner  wrote:

> On Mon, Jun 09, 2014 at 03:52:51PM -0700, Greg Thelen wrote:
>> 
>> On Fri, Jun 06 2014, Michal Hocko  wrote:
>> 
>> > Some users (e.g. Google) would like to have stronger semantic than low
>> > limit offers currently. The fallback mode is not desirable and they
>> > prefer hitting OOM killer rather than ignoring low limit for protected
>> > groups. There are other possible usecases which can benefit from hard
>> > guarantees. I can imagine workloads where setting low_limit to the same
>> > value as hard_limit to prevent from any reclaim at all makes a lot of
>> > sense because reclaim is much more disrupting than restart of the load.
>> >
>> > This patch adds a new per memcg memory.reclaim_strategy knob which
>> > tells what to do in a situation when memory reclaim cannot do any
>> > progress because all groups in the reclaimed hierarchy are within their
>> > low_limit. There are two options available:
>> >- low_limit_best_effort - the current mode when reclaim falls
>> >  back to the even reclaim of all groups in the reclaimed
>> >  hierarchy
>> >- low_limit_guarantee - groups within low_limit are never
>> >  reclaimed and OOM killer is triggered instead. OOM message
>> >  will mention the fact that the OOM was triggered due to
>> >  low_limit reclaim protection.
>> 
>> To (a) be consistent with existing hard and soft limits APIs and (b)
>> allow use of both best effort and guarantee memory limits, I wonder if
>> it's best to offer three per memcg limits, rather than two limits (hard,
>> low_limit) and a related reclaim_strategy knob.  The three limits I'm
>> thinking about are:
>> 
>> 1) hard_limit (aka the existing limit_in_bytes cgroupfs file).  No
>>change needed here.  This is an upper bound on a memcg hierarchy's
>>memory consumption (assuming use_hierarchy=1).
>
> This creates internal pressure.  Outside reclaim is not affected by
> it, but internal charges can not exceed this limit.  This is set to
> hard limit the maximum memory consumption of a group (max).
>
>> 2) best_effort_limit (aka desired working set).  This allow an
>>application or administrator to provide a hint to the kernel about
>>desired working set size.  Before oom'ing the kernel is allowed to
>>reclaim below this limit.  I think the current soft_limit_in_bytes
>>claims to provide this.  If we prefer to deprecate
>>soft_limit_in_bytes, then a new desired_working_set_in_bytes (or a
>>hopefully better named) API seems reasonable.
>
> This controls how external pressure applies to the group.
>
> But it's conceivable that we'd like to have the equivalent of such a
> soft limit for *internal* pressure.  Set below the hard limit, this
> internal soft limit would have charges trigger direct reclaim in the
> memcg but allow them to continue to the hard limit.  This would create
> a situation wherein the allocating tasks are not killed, but throttled
> under reclaim, which gives the administrator a window to detect the
> situation with vmpressure and possibly intervene.  Because as it
> stands, once the current hard limit is hit things can go down pretty
> fast and the window for reacting to vmpressure readings is often too
> small.  This would offer a more gradual deterioration.  It would be
> set to the upper end of the working set size range (high).
>
> I think for many users such an internal soft limit would actually be
> preferred over the current hard limit, as they'd rather have some
> reclaim throttling than an OOM kill when the group reaches its upper
> bound.  The current hard limit would be reserved for more advanced or
> paid cases, where the admin would rather see a memcg get OOM killed
> than exceed a certain size.
>
> Then, as you proposed, we'd have the soft limit for external pressure,
> where the kernel only reclaims groups within that limit in order to
> avoid OOM kills.  It would be set to the estimated lower end of the
> working set size range (low).
>
>> 3) low_limit_guarantee which is a lower bound of memory usage.  A memcg
>>would prefer to be oom killed rather than operate below this
>>threshold.  Default value is zero to preserve compatibility with
>>existing apps.
>
> And this would be the external pressure hard limit, which would be set
> to the absolute minimum requirement of the group (min).
>
> Either because it would be hopelessly thrashing without it, or because
> this guaranteed memory is actually paid for.  Again, I would expect
> many users to not eve

Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim

2014-06-10 Thread Greg Thelen

On Tue, Jun 10 2014, Johannes Weiner han...@cmpxchg.org wrote:

 On Mon, Jun 09, 2014 at 03:52:51PM -0700, Greg Thelen wrote:
 
 On Fri, Jun 06 2014, Michal Hocko mho...@suse.cz wrote:
 
  Some users (e.g. Google) would like to have stronger semantic than low
  limit offers currently. The fallback mode is not desirable and they
  prefer hitting OOM killer rather than ignoring low limit for protected
  groups. There are other possible usecases which can benefit from hard
  guarantees. I can imagine workloads where setting low_limit to the same
  value as hard_limit to prevent from any reclaim at all makes a lot of
  sense because reclaim is much more disrupting than restart of the load.
 
  This patch adds a new per memcg memory.reclaim_strategy knob which
  tells what to do in a situation when memory reclaim cannot do any
  progress because all groups in the reclaimed hierarchy are within their
  low_limit. There are two options available:
 - low_limit_best_effort - the current mode when reclaim falls
   back to the even reclaim of all groups in the reclaimed
   hierarchy
 - low_limit_guarantee - groups within low_limit are never
   reclaimed and OOM killer is triggered instead. OOM message
   will mention the fact that the OOM was triggered due to
   low_limit reclaim protection.
 
 To (a) be consistent with existing hard and soft limits APIs and (b)
 allow use of both best effort and guarantee memory limits, I wonder if
 it's best to offer three per memcg limits, rather than two limits (hard,
 low_limit) and a related reclaim_strategy knob.  The three limits I'm
 thinking about are:
 
 1) hard_limit (aka the existing limit_in_bytes cgroupfs file).  No
change needed here.  This is an upper bound on a memcg hierarchy's
memory consumption (assuming use_hierarchy=1).

 This creates internal pressure.  Outside reclaim is not affected by
 it, but internal charges can not exceed this limit.  This is set to
 hard limit the maximum memory consumption of a group (max).

 2) best_effort_limit (aka desired working set).  This allow an
application or administrator to provide a hint to the kernel about
desired working set size.  Before oom'ing the kernel is allowed to
reclaim below this limit.  I think the current soft_limit_in_bytes
claims to provide this.  If we prefer to deprecate
soft_limit_in_bytes, then a new desired_working_set_in_bytes (or a
hopefully better named) API seems reasonable.

 This controls how external pressure applies to the group.

 But it's conceivable that we'd like to have the equivalent of such a
 soft limit for *internal* pressure.  Set below the hard limit, this
 internal soft limit would have charges trigger direct reclaim in the
 memcg but allow them to continue to the hard limit.  This would create
 a situation wherein the allocating tasks are not killed, but throttled
 under reclaim, which gives the administrator a window to detect the
 situation with vmpressure and possibly intervene.  Because as it
 stands, once the current hard limit is hit things can go down pretty
 fast and the window for reacting to vmpressure readings is often too
 small.  This would offer a more gradual deterioration.  It would be
 set to the upper end of the working set size range (high).

 I think for many users such an internal soft limit would actually be
 preferred over the current hard limit, as they'd rather have some
 reclaim throttling than an OOM kill when the group reaches its upper
 bound.  The current hard limit would be reserved for more advanced or
 paid cases, where the admin would rather see a memcg get OOM killed
 than exceed a certain size.

 Then, as you proposed, we'd have the soft limit for external pressure,
 where the kernel only reclaims groups within that limit in order to
 avoid OOM kills.  It would be set to the estimated lower end of the
 working set size range (low).

 3) low_limit_guarantee which is a lower bound of memory usage.  A memcg
would prefer to be oom killed rather than operate below this
threshold.  Default value is zero to preserve compatibility with
existing apps.

 And this would be the external pressure hard limit, which would be set
 to the absolute minimum requirement of the group (min).

 Either because it would be hopelessly thrashing without it, or because
 this guaranteed memory is actually paid for.  Again, I would expect
 many users to not even set this minimum guarantee but solely use the
 external soft limit (low) instead.

 Logically hard_limit = best_effort_limit = low_limit_guarantee.

 max = high = low = min

 I think we should be able to express all desired usecases with these
 four limits, including the advanced configurations, while making it
 easy for many users to set up groups without being a) dead certain
 about their memory consumption or b) prepared for frequent OOM kills,
 while still allowing them to properly utilize their machines.

 What do you think?

Sounds good

Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim

2014-06-09 Thread Greg Thelen

On Fri, Jun 06 2014, Michal Hocko  wrote:

> Some users (e.g. Google) would like to have stronger semantic than low
> limit offers currently. The fallback mode is not desirable and they
> prefer hitting OOM killer rather than ignoring low limit for protected
> groups. There are other possible usecases which can benefit from hard
> guarantees. I can imagine workloads where setting low_limit to the same
> value as hard_limit to prevent from any reclaim at all makes a lot of
> sense because reclaim is much more disrupting than restart of the load.
>
> This patch adds a new per memcg memory.reclaim_strategy knob which
> tells what to do in a situation when memory reclaim cannot do any
> progress because all groups in the reclaimed hierarchy are within their
> low_limit. There are two options available:
>   - low_limit_best_effort - the current mode when reclaim falls
> back to the even reclaim of all groups in the reclaimed
> hierarchy
>   - low_limit_guarantee - groups within low_limit are never
> reclaimed and OOM killer is triggered instead. OOM message
> will mention the fact that the OOM was triggered due to
> low_limit reclaim protection.

To (a) be consistent with existing hard and soft limits APIs and (b)
allow use of both best effort and guarantee memory limits, I wonder if
it's best to offer three per memcg limits, rather than two limits (hard,
low_limit) and a related reclaim_strategy knob.  The three limits I'm
thinking about are:

1) hard_limit (aka the existing limit_in_bytes cgroupfs file).  No
   change needed here.  This is an upper bound on a memcg hierarchy's
   memory consumption (assuming use_hierarchy=1).

2) best_effort_limit (aka desired working set).  This allow an
   application or administrator to provide a hint to the kernel about
   desired working set size.  Before oom'ing the kernel is allowed to
   reclaim below this limit.  I think the current soft_limit_in_bytes
   claims to provide this.  If we prefer to deprecate
   soft_limit_in_bytes, then a new desired_working_set_in_bytes (or a
   hopefully better named) API seems reasonable.

3) low_limit_guarantee which is a lower bound of memory usage.  A memcg
   would prefer to be oom killed rather than operate below this
   threshold.  Default value is zero to preserve compatibility with
   existing apps.

Logically hard_limit >= best_effort_limit >= low_limit_guarantee.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim

2014-06-09 Thread Greg Thelen

On Fri, Jun 06 2014, Michal Hocko mho...@suse.cz wrote:

 Some users (e.g. Google) would like to have stronger semantic than low
 limit offers currently. The fallback mode is not desirable and they
 prefer hitting OOM killer rather than ignoring low limit for protected
 groups. There are other possible usecases which can benefit from hard
 guarantees. I can imagine workloads where setting low_limit to the same
 value as hard_limit to prevent from any reclaim at all makes a lot of
 sense because reclaim is much more disrupting than restart of the load.

 This patch adds a new per memcg memory.reclaim_strategy knob which
 tells what to do in a situation when memory reclaim cannot do any
 progress because all groups in the reclaimed hierarchy are within their
 low_limit. There are two options available:
   - low_limit_best_effort - the current mode when reclaim falls
 back to the even reclaim of all groups in the reclaimed
 hierarchy
   - low_limit_guarantee - groups within low_limit are never
 reclaimed and OOM killer is triggered instead. OOM message
 will mention the fact that the OOM was triggered due to
 low_limit reclaim protection.

To (a) be consistent with existing hard and soft limits APIs and (b)
allow use of both best effort and guarantee memory limits, I wonder if
it's best to offer three per memcg limits, rather than two limits (hard,
low_limit) and a related reclaim_strategy knob.  The three limits I'm
thinking about are:

1) hard_limit (aka the existing limit_in_bytes cgroupfs file).  No
   change needed here.  This is an upper bound on a memcg hierarchy's
   memory consumption (assuming use_hierarchy=1).

2) best_effort_limit (aka desired working set).  This allow an
   application or administrator to provide a hint to the kernel about
   desired working set size.  Before oom'ing the kernel is allowed to
   reclaim below this limit.  I think the current soft_limit_in_bytes
   claims to provide this.  If we prefer to deprecate
   soft_limit_in_bytes, then a new desired_working_set_in_bytes (or a
   hopefully better named) API seems reasonable.

3) low_limit_guarantee which is a lower bound of memory usage.  A memcg
   would prefer to be oom killed rather than operate below this
   threshold.  Default value is zero to preserve compatibility with
   existing apps.

Logically hard_limit = best_effort_limit = low_limit_guarantee.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/4] memcg: Low-limit reclaim

2014-05-28 Thread Greg Thelen

On Wed, May 28 2014, Johannes Weiner  wrote:

> On Wed, May 28, 2014 at 04:21:44PM +0200, Michal Hocko wrote:
>> On Wed 28-05-14 09:49:05, Johannes Weiner wrote:
>> > On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote:
>> > > Hi Andrew, Johannes,
>> > > 
>> > > On Mon 28-04-14 14:26:41, Michal Hocko wrote:
>> > > > This patchset introduces such low limit that is functionally similar
>> > > > to a minimum guarantee. Memcgs which are under their lowlimit are not
>> > > > considered eligible for the reclaim (both global and hardlimit) unless
>> > > > all groups under the reclaimed hierarchy are below the low limit when
>> > > > all of them are considered eligible.
>> > > > 
>> > > > The previous version of the patchset posted as a RFC
>> > > > (http://marc.info/?l=linux-mm=138677140628677=2) suggested a
>> > > > hard guarantee without any fallback. More discussions led me to
>> > > > reconsidering the default behavior and come up a more relaxed one. The
>> > > > hard requirement can be added later based on a use case which really
>> > > > requires. It would be controlled by memory.reclaim_flags knob which
>> > > > would specify whether to OOM or fallback (default) when all groups are
>> > > > bellow low limit.
>> > > 
>> > > It seems that we are not in a full agreement about the default behavior
>> > > yet. Johannes seems to be more for hard guarantee while I would like to
>> > > see the weaker approach first and move to the stronger model later.
>> > > Johannes, is this absolutely no-go for you? Do you think it is seriously
>> > > handicapping the semantic of the new knob?
>> > 
>> > Well we certainly can't start OOMing where we previously didn't,
>> > that's called a regression and automatically limits our options.
>> > 
>> > Any unexpected OOMs will be much more acceptable from a new feature
>> > than from configuration that previously "worked" and then stopped.
>> 
>> Yes and we are not talking about regressions, are we?
>> 
>> > > My main motivation for the weaker model is that it is hard to see all
>> > > the corner case right now and once we hit them I would like to see a
>> > > graceful fallback rather than fatal action like OOM killer. Besides that
>> > > the usaceses I am mostly interested in are OK with fallback when the
>> > > alternative would be OOM killer. I also feel that introducing a knob
>> > > with a weaker semantic which can be made stronger later is a sensible
>> > > way to go.
>> > 
>> > We can't make it stronger, but we can make it weaker. 
>> 
>> Why cannot we make it stronger by a knob/configuration option?
>
> Why can't we make it weaker by a knob?  Why should we design the
> default for unforeseeable cornercases rather than make the default
> make sense for existing cases and give cornercases a fallback once
> they show up?

My 2c...  The following works for my use cases:
1) introduce memory.low_limit_in_bytes (default=0 thus no default change
   from older kernels)
2) interested users will set low_limit_in_bytes to non-zero value.
   Memory protected by low limit should be as migratable/reclaimable as
   mlock memory.  If a zone full of mlock memory causes oom kills, then
   so should the low limit.

If we find corner cases where low_limit_in_bytes is too strict, then we
could discuss a new knob to relax it.  But I think we should start with
a strict low-limit.  If the oom killer gets tied in knots due to low
limit, then I'd like to explore fixing the oom killer before relaxing
low limit.

Disclaimer: new use cases will certainly appear with various
requirements.  But an oom-killing low_limit_in_bytes seems like a
generic opt-in feature, so I think it's worthwhile.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/4] memcg: Low-limit reclaim

2014-05-28 Thread Greg Thelen

On Wed, May 28 2014, Johannes Weiner han...@cmpxchg.org wrote:

 On Wed, May 28, 2014 at 04:21:44PM +0200, Michal Hocko wrote:
 On Wed 28-05-14 09:49:05, Johannes Weiner wrote:
  On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote:
   Hi Andrew, Johannes,
   
   On Mon 28-04-14 14:26:41, Michal Hocko wrote:
This patchset introduces such low limit that is functionally similar
to a minimum guarantee. Memcgs which are under their lowlimit are not
considered eligible for the reclaim (both global and hardlimit) unless
all groups under the reclaimed hierarchy are below the low limit when
all of them are considered eligible.

The previous version of the patchset posted as a RFC
(http://marc.info/?l=linux-mmm=138677140628677w=2) suggested a
hard guarantee without any fallback. More discussions led me to
reconsidering the default behavior and come up a more relaxed one. The
hard requirement can be added later based on a use case which really
requires. It would be controlled by memory.reclaim_flags knob which
would specify whether to OOM or fallback (default) when all groups are
bellow low limit.
   
   It seems that we are not in a full agreement about the default behavior
   yet. Johannes seems to be more for hard guarantee while I would like to
   see the weaker approach first and move to the stronger model later.
   Johannes, is this absolutely no-go for you? Do you think it is seriously
   handicapping the semantic of the new knob?
  
  Well we certainly can't start OOMing where we previously didn't,
  that's called a regression and automatically limits our options.
  
  Any unexpected OOMs will be much more acceptable from a new feature
  than from configuration that previously worked and then stopped.
 
 Yes and we are not talking about regressions, are we?
 
   My main motivation for the weaker model is that it is hard to see all
   the corner case right now and once we hit them I would like to see a
   graceful fallback rather than fatal action like OOM killer. Besides that
   the usaceses I am mostly interested in are OK with fallback when the
   alternative would be OOM killer. I also feel that introducing a knob
   with a weaker semantic which can be made stronger later is a sensible
   way to go.
  
  We can't make it stronger, but we can make it weaker. 
 
 Why cannot we make it stronger by a knob/configuration option?

 Why can't we make it weaker by a knob?  Why should we design the
 default for unforeseeable cornercases rather than make the default
 make sense for existing cases and give cornercases a fallback once
 they show up?

My 2c...  The following works for my use cases:
1) introduce memory.low_limit_in_bytes (default=0 thus no default change
   from older kernels)
2) interested users will set low_limit_in_bytes to non-zero value.
   Memory protected by low limit should be as migratable/reclaimable as
   mlock memory.  If a zone full of mlock memory causes oom kills, then
   so should the low limit.

If we find corner cases where low_limit_in_bytes is too strict, then we
could discuss a new knob to relax it.  But I think we should start with
a strict low-limit.  If the oom killer gets tied in knots due to low
limit, then I'd like to explore fixing the oom killer before relaxing
low limit.

Disclaimer: new use cases will certainly appear with various
requirements.  But an oom-killing low_limit_in_bytes seems like a
generic opt-in feature, so I think it's worthwhile.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] memcg: deprecate memory.force_empty knob

2014-05-16 Thread Greg Thelen

On Tue, May 13 2014, Michal Hocko  wrote:

> force_empty has been introduced primarily to drop memory before it gets
> reparented on the group removal. This alone doesn't sound fully
> justified because reparented pages which are not in use can be reclaimed
> also later when there is a memory pressure on the parent level.
>
> Mark the knob CFTYPE_INSANE which tells the cgroup core that it
> shouldn't create the knob with the experimental sane_behavior. Other
> users will get informed about the deprecation and asked to tell us more
> because I do not expect most users will use sane_behavior cgroups mode
> very soon.
> Anyway I expect that most users will be simply cgroup remove handlers
> which do that since ever without having any good reason for it.
>
> If somebody really cares because reparented pages, which would be
> dropped otherwise, push out more important ones then we should fix the
> reparenting code and put pages to the tail.

I should mention a case where I've needed to use memory.force_empty: to
synchronously flush stats from child to parent.  Without force_empty
memory.stat is temporarily inconsistent until async css_offline
reparents charges.  Here is an example on v3.14 showing that
parent/memory.stat contents are in-flux immediately after rmdir of
parent/child.

$ cat /test
#!/bin/bash

# Create parent and child.  Add some non-reclaimable anon rss to child,
# then move running task to parent.
mkdir p p/c
(echo $BASHPID > p/c/cgroup.procs && exec sleep 1d) &
pid=$!
sleep 1
echo $pid > p/cgroup.procs 

grep 'rss ' {p,p/c}/memory.stat
if [[ $1 == force ]]; then
  echo 1 > p/c/memory.force_empty
fi
rmdir p/c

echo 'For a small time the p/c memory has not been reparented to p.'
grep 'rss ' {p,p/c}/memory.stat

sleep 1
echo 'After waiting all memory has been reparented'
grep 'rss ' {p,p/c}/memory.stat

kill $pid
rmdir p


-- First, demonstrate that just rmdir, without memory.force_empty,
   temporarily hides reparented child memory stats.

$ /test
p/memory.stat:rss 0
p/memory.stat:total_rss 69632
p/c/memory.stat:rss 69632
p/c/memory.stat:total_rss 69632
For a small time the p/c memory has not been reparented to p.
p/memory.stat:rss 0
p/memory.stat:total_rss 0
grep: p/c/memory.stat: No such file or directory
After waiting all memory has been reparented
p/memory.stat:rss 69632
p/memory.stat:total_rss 69632
grep: p/c/memory.stat: No such file or directory
/test: Terminated  ( echo $BASHPID > p/c/cgroup.procs && exec sleep 
1d )

-- Demonstrate that using memory.force_empty before rmdir, behaves more
   sensibly.  Stats for reparented child memory are not hidden.

$ /test force
p/memory.stat:rss 0
p/memory.stat:total_rss 69632
p/c/memory.stat:rss 69632
p/c/memory.stat:total_rss 69632
For a small time the p/c memory has not been reparented to p.
p/memory.stat:rss 69632
p/memory.stat:total_rss 69632
grep: p/c/memory.stat: No such file or directory
After waiting all memory has been reparented
p/memory.stat:rss 69632
p/memory.stat:total_rss 69632
grep: p/c/memory.stat: No such file or directory
/test: Terminated  ( echo $BASHPID > p/c/cgroup.procs && exec sleep 
1d )
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] memcg: deprecate memory.force_empty knob

2014-05-16 Thread Greg Thelen

On Tue, May 13 2014, Michal Hocko mho...@suse.cz wrote:

 force_empty has been introduced primarily to drop memory before it gets
 reparented on the group removal. This alone doesn't sound fully
 justified because reparented pages which are not in use can be reclaimed
 also later when there is a memory pressure on the parent level.

 Mark the knob CFTYPE_INSANE which tells the cgroup core that it
 shouldn't create the knob with the experimental sane_behavior. Other
 users will get informed about the deprecation and asked to tell us more
 because I do not expect most users will use sane_behavior cgroups mode
 very soon.
 Anyway I expect that most users will be simply cgroup remove handlers
 which do that since ever without having any good reason for it.

 If somebody really cares because reparented pages, which would be
 dropped otherwise, push out more important ones then we should fix the
 reparenting code and put pages to the tail.

I should mention a case where I've needed to use memory.force_empty: to
synchronously flush stats from child to parent.  Without force_empty
memory.stat is temporarily inconsistent until async css_offline
reparents charges.  Here is an example on v3.14 showing that
parent/memory.stat contents are in-flux immediately after rmdir of
parent/child.

$ cat /test
#!/bin/bash

# Create parent and child.  Add some non-reclaimable anon rss to child,
# then move running task to parent.
mkdir p p/c
(echo $BASHPID  p/c/cgroup.procs  exec sleep 1d) 
pid=$!
sleep 1
echo $pid  p/cgroup.procs 

grep 'rss ' {p,p/c}/memory.stat
if [[ $1 == force ]]; then
  echo 1  p/c/memory.force_empty
fi
rmdir p/c

echo 'For a small time the p/c memory has not been reparented to p.'
grep 'rss ' {p,p/c}/memory.stat

sleep 1
echo 'After waiting all memory has been reparented'
grep 'rss ' {p,p/c}/memory.stat

kill $pid
rmdir p


-- First, demonstrate that just rmdir, without memory.force_empty,
   temporarily hides reparented child memory stats.

$ /test
p/memory.stat:rss 0
p/memory.stat:total_rss 69632
p/c/memory.stat:rss 69632
p/c/memory.stat:total_rss 69632
For a small time the p/c memory has not been reparented to p.
p/memory.stat:rss 0
p/memory.stat:total_rss 0
grep: p/c/memory.stat: No such file or directory
After waiting all memory has been reparented
p/memory.stat:rss 69632
p/memory.stat:total_rss 69632
grep: p/c/memory.stat: No such file or directory
/test: Terminated  ( echo $BASHPID  p/c/cgroup.procs  exec sleep 
1d )

-- Demonstrate that using memory.force_empty before rmdir, behaves more
   sensibly.  Stats for reparented child memory are not hidden.

$ /test force
p/memory.stat:rss 0
p/memory.stat:total_rss 69632
p/c/memory.stat:rss 69632
p/c/memory.stat:total_rss 69632
For a small time the p/c memory has not been reparented to p.
p/memory.stat:rss 69632
p/memory.stat:total_rss 69632
grep: p/c/memory.stat: No such file or directory
After waiting all memory has been reparented
p/memory.stat:rss 69632
p/memory.stat:total_rss 69632
grep: p/c/memory.stat: No such file or directory
/test: Terminated  ( echo $BASHPID  p/c/cgroup.procs  exec sleep 
1d )
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v3 2/6] mm, compaction: return failed migration target pages back to freelist

2014-05-07 Thread Greg Thelen

On Wed, May 07 2014, Andrew Morton  wrote:

> On Tue, 6 May 2014 19:22:43 -0700 (PDT) David Rientjes  
> wrote:
>
>> Memory compaction works by having a "freeing scanner" scan from one end of a 
>> zone which isolates pages as migration targets while another "migrating 
>> scanner" 
>> scans from the other end of the same zone which isolates pages for migration.
>> 
>> When page migration fails for an isolated page, the target page is returned 
>> to 
>> the system rather than the freelist built by the freeing scanner.  This may 
>> require the freeing scanner to continue scanning memory after suitable 
>> migration 
>> targets have already been returned to the system needlessly.
>> 
>> This patch returns destination pages to the freeing scanner freelist when 
>> page 
>> migration fails.  This prevents unnecessary work done by the freeing scanner 
>> but 
>> also encourages memory to be as compacted as possible at the end of the zone.
>> 
>> Reported-by: Greg Thelen 
>
> What did Greg actually report?  IOW, what if any observable problem is
> being fixed here?

I detected the problem at runtime seeing that ext4 metadata pages (esp
the ones read by "sbi->s_group_desc[i] = sb_bread(sb, block)") were
constantly visited by compaction calls of migrate_pages().  These pages
had a non-zero b_count which caused fallback_migrate_page() ->
try_to_release_page() -> try_to_free_buffers() to fail.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v3 2/6] mm, compaction: return failed migration target pages back to freelist

2014-05-07 Thread Greg Thelen

On Wed, May 07 2014, Andrew Morton a...@linux-foundation.org wrote:

 On Tue, 6 May 2014 19:22:43 -0700 (PDT) David Rientjes rient...@google.com 
 wrote:

 Memory compaction works by having a freeing scanner scan from one end of a 
 zone which isolates pages as migration targets while another migrating 
 scanner 
 scans from the other end of the same zone which isolates pages for migration.
 
 When page migration fails for an isolated page, the target page is returned 
 to 
 the system rather than the freelist built by the freeing scanner.  This may 
 require the freeing scanner to continue scanning memory after suitable 
 migration 
 targets have already been returned to the system needlessly.
 
 This patch returns destination pages to the freeing scanner freelist when 
 page 
 migration fails.  This prevents unnecessary work done by the freeing scanner 
 but 
 also encourages memory to be as compacted as possible at the end of the zone.
 
 Reported-by: Greg Thelen gthe...@google.com

 What did Greg actually report?  IOW, what if any observable problem is
 being fixed here?

I detected the problem at runtime seeing that ext4 metadata pages (esp
the ones read by sbi-s_group_desc[i] = sb_bread(sb, block)) were
constantly visited by compaction calls of migrate_pages().  These pages
had a non-zero b_count which caused fallback_migrate_page() -
try_to_release_page() - try_to_free_buffers() to fail.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/4] memcg: Low-limit reclaim

2014-04-29 Thread Greg Thelen

On Mon, Apr 28 2014, Roman Gushchin  wrote:

> 28.04.2014, 16:27, "Michal Hocko" :
>> The series is based on top of the current mmotm tree. Once the series
>> gets accepted I will post a patch which will mark the soft limit as
>> deprecated with a note that it will be eventually dropped. Let me know
>> if you would prefer to have such a patch a part of the series.
>>
>> Thoughts?
>
>
> Looks good to me.
>
> The only question is: are there any ideas how the hierarchy support
> will be used in this case in practice?
> Will someone set low limit for non-leaf cgroups? Why?
>
> Thanks,
> Roman

I imagine that a hosting service may want to give X MB to a top level
memcg (/a) with sub-jobs (/a/b, /a/c) which may(not) have their own
low-limits.

Examples:

case_1) only set low limit on /a.  /a/b and /a/c may overcommit /a's
memory (b.limit_in_bytes + c.limit_in_bytes > a.limit_in_bytes).

case_2) low limits on all memcg.  But not overcommitting low_limits
(b.low_limit_in_in_bytes + c.low_limit_in_in_bytes <=
a.low_limit_in_in_bytes).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/4] memcg: Low-limit reclaim

2014-04-29 Thread Greg Thelen

On Mon, Apr 28 2014, Roman Gushchin kl...@yandex-team.ru wrote:

 28.04.2014, 16:27, Michal Hocko mho...@suse.cz:
 The series is based on top of the current mmotm tree. Once the series
 gets accepted I will post a patch which will mark the soft limit as
 deprecated with a note that it will be eventually dropped. Let me know
 if you would prefer to have such a patch a part of the series.

 Thoughts?


 Looks good to me.

 The only question is: are there any ideas how the hierarchy support
 will be used in this case in practice?
 Will someone set low limit for non-leaf cgroups? Why?

 Thanks,
 Roman

I imagine that a hosting service may want to give X MB to a top level
memcg (/a) with sub-jobs (/a/b, /a/c) which may(not) have their own
low-limits.

Examples:

case_1) only set low limit on /a.  /a/b and /a/c may overcommit /a's
memory (b.limit_in_bytes + c.limit_in_bytes  a.limit_in_bytes).

case_2) low limits on all memcg.  But not overcommitting low_limits
(b.low_limit_in_in_bytes + c.low_limit_in_in_bytes =
a.low_limit_in_in_bytes).
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm v2.1] mm: get rid of __GFP_KMEMCG

2014-04-02 Thread Greg Thelen

On Tue, Apr 01 2014, Vladimir Davydov  wrote:

> Currently to allocate a page that should be charged to kmemcg (e.g.
> threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page
> allocated is then to be freed by free_memcg_kmem_pages. Apart from
> looking asymmetrical, this also requires intrusion to the general
> allocation path. So let's introduce separate functions that will
> alloc/free pages charged to kmemcg.
>
> The new functions are called alloc_kmem_pages and free_kmem_pages. They
> should be used when the caller actually would like to use kmalloc, but
> has to fall back to the page allocator for the allocation is large. They
> only differ from alloc_pages and free_pages in that besides allocating
> or freeing pages they also charge them to the kmem resource counter of
> the current memory cgroup.
>
> Signed-off-by: Vladimir Davydov 

One comment nit below, otherwise looks good to me.

Acked-by: Greg Thelen 

> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Glauber Costa 
> Cc: Christoph Lameter 
> Cc: Pekka Enberg 
> ---
> Changes in v2.1:
>  - add missing kmalloc_order forward declaration; lacking it caused
>compilation breakage with CONFIG_TRACING=n
>
>  include/linux/gfp.h |   10 ---
>  include/linux/memcontrol.h  |2 +-
>  include/linux/slab.h|   11 +---
>  include/linux/thread_info.h |2 --
>  include/trace/events/gfpflags.h |1 -
>  kernel/fork.c   |6 ++---
>  mm/page_alloc.c |   56 
> ---
>  mm/slab_common.c|   12 +
>  mm/slub.c   |6 ++---
>  9 files changed, 61 insertions(+), 45 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 39b81dc7d01a..d382db71e300 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -31,7 +31,6 @@ struct vm_area_struct;
>  #define ___GFP_HARDWALL  0x2u
>  #define ___GFP_THISNODE  0x4u
>  #define ___GFP_RECLAIMABLE   0x8u
> -#define ___GFP_KMEMCG0x10u
>  #define ___GFP_NOTRACK   0x20u
>  #define ___GFP_NO_KSWAPD 0x40u
>  #define ___GFP_OTHER_NODE0x80u
> @@ -91,7 +90,6 @@ struct vm_area_struct;
>  
>  #define __GFP_NO_KSWAPD  ((__force gfp_t)___GFP_NO_KSWAPD)
>  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of 
> other node */
> -#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from 
> a memcg-accounted resource */
>  #define __GFP_WRITE  ((__force gfp_t)___GFP_WRITE)   /* Allocator intends to 
> dirty page */
>  
>  /*
> @@ -353,6 +351,10 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int 
> order,
>  #define alloc_page_vma_node(gfp_mask, vma, addr, node)   \
>   alloc_pages_vma(gfp_mask, 0, vma, addr, node)
>  
> +extern struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order);
> +extern struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask,
> +   unsigned int order);
> +
>  extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
>  extern unsigned long get_zeroed_page(gfp_t gfp_mask);
>  
> @@ -372,8 +374,8 @@ extern void free_pages(unsigned long addr, unsigned int 
> order);
>  extern void free_hot_cold_page(struct page *page, int cold);
>  extern void free_hot_cold_page_list(struct list_head *list, int cold);
>  
> -extern void __free_memcg_kmem_pages(struct page *page, unsigned int order);
> -extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
> +extern void __free_kmem_pages(struct page *page, unsigned int order);
> +extern void free_kmem_pages(unsigned long addr, unsigned int order);
>  
>  #define __free_page(page) __free_pages((page), 0)
>  #define free_page(addr) free_pages((addr), 0)
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 29068dd26c3d..13acdb5259f5 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -543,7 +543,7 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup 
> **memcg, int order)
>* res_counter_charge_nofail, but we hope those allocations are rare,
>* and won't be worth the trouble.
>*/

Just a few lines higher in first memcg_kmem_newpage_charge() comment,
there is a leftover reference to GFP_KMEMCG which should be removed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm v2.1] mm: get rid of __GFP_KMEMCG

2014-04-02 Thread Greg Thelen

On Tue, Apr 01 2014, Vladimir Davydov vdavy...@parallels.com wrote:

 Currently to allocate a page that should be charged to kmemcg (e.g.
 threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page
 allocated is then to be freed by free_memcg_kmem_pages. Apart from
 looking asymmetrical, this also requires intrusion to the general
 allocation path. So let's introduce separate functions that will
 alloc/free pages charged to kmemcg.

 The new functions are called alloc_kmem_pages and free_kmem_pages. They
 should be used when the caller actually would like to use kmalloc, but
 has to fall back to the page allocator for the allocation is large. They
 only differ from alloc_pages and free_pages in that besides allocating
 or freeing pages they also charge them to the kmem resource counter of
 the current memory cgroup.

 Signed-off-by: Vladimir Davydov vdavy...@parallels.com

One comment nit below, otherwise looks good to me.

Acked-by: Greg Thelen gthe...@google.com

 Cc: Johannes Weiner han...@cmpxchg.org
 Cc: Michal Hocko mho...@suse.cz
 Cc: Glauber Costa glom...@gmail.com
 Cc: Christoph Lameter c...@linux-foundation.org
 Cc: Pekka Enberg penb...@kernel.org
 ---
 Changes in v2.1:
  - add missing kmalloc_order forward declaration; lacking it caused
compilation breakage with CONFIG_TRACING=n

  include/linux/gfp.h |   10 ---
  include/linux/memcontrol.h  |2 +-
  include/linux/slab.h|   11 +---
  include/linux/thread_info.h |2 --
  include/trace/events/gfpflags.h |1 -
  kernel/fork.c   |6 ++---
  mm/page_alloc.c |   56 
 ---
  mm/slab_common.c|   12 +
  mm/slub.c   |6 ++---
  9 files changed, 61 insertions(+), 45 deletions(-)

 diff --git a/include/linux/gfp.h b/include/linux/gfp.h
 index 39b81dc7d01a..d382db71e300 100644
 --- a/include/linux/gfp.h
 +++ b/include/linux/gfp.h
 @@ -31,7 +31,6 @@ struct vm_area_struct;
  #define ___GFP_HARDWALL  0x2u
  #define ___GFP_THISNODE  0x4u
  #define ___GFP_RECLAIMABLE   0x8u
 -#define ___GFP_KMEMCG0x10u
  #define ___GFP_NOTRACK   0x20u
  #define ___GFP_NO_KSWAPD 0x40u
  #define ___GFP_OTHER_NODE0x80u
 @@ -91,7 +90,6 @@ struct vm_area_struct;
  
  #define __GFP_NO_KSWAPD  ((__force gfp_t)___GFP_NO_KSWAPD)
  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of 
 other node */
 -#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from 
 a memcg-accounted resource */
  #define __GFP_WRITE  ((__force gfp_t)___GFP_WRITE)   /* Allocator intends to 
 dirty page */
  
  /*
 @@ -353,6 +351,10 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int 
 order,
  #define alloc_page_vma_node(gfp_mask, vma, addr, node)   \
   alloc_pages_vma(gfp_mask, 0, vma, addr, node)
  
 +extern struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order);
 +extern struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask,
 +   unsigned int order);
 +
  extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
  extern unsigned long get_zeroed_page(gfp_t gfp_mask);
  
 @@ -372,8 +374,8 @@ extern void free_pages(unsigned long addr, unsigned int 
 order);
  extern void free_hot_cold_page(struct page *page, int cold);
  extern void free_hot_cold_page_list(struct list_head *list, int cold);
  
 -extern void __free_memcg_kmem_pages(struct page *page, unsigned int order);
 -extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
 +extern void __free_kmem_pages(struct page *page, unsigned int order);
 +extern void free_kmem_pages(unsigned long addr, unsigned int order);
  
  #define __free_page(page) __free_pages((page), 0)
  #define free_page(addr) free_pages((addr), 0)
 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index 29068dd26c3d..13acdb5259f5 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -543,7 +543,7 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup 
 **memcg, int order)
* res_counter_charge_nofail, but we hope those allocations are rare,
* and won't be worth the trouble.
*/

Just a few lines higher in first memcg_kmem_newpage_charge() comment,
there is a leftover reference to GFP_KMEMCG which should be removed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc,shm: increase default size for shmmax

2014-04-01 Thread Greg Thelen

On Tue, Apr 01 2014, Kamezawa Hiroyuki  wrote:

>> On Tue, Apr 01 2014, Davidlohr Bueso  wrote:
>> 
>>> On Tue, 2014-04-01 at 19:56 -0400, KOSAKI Motohiro wrote:
>>>>>>> Ah-hah, that's interesting info.
>>>>>>>
>>>>>>> Let's make the default 64GB?
>>>>>>
>>>>>> 64GB is infinity at that time, but it no longer near infinity today. I 
>>>>>> like
>>>>>> very large or total memory proportional number.
>>>>>
>>>>> So I still like 0 for unlimited. Nice, clean and much easier to look at
>>>>> than ULONG_MAX. And since we cannot disable shm through SHMMIN, I really
>>>>> don't see any disadvantages, as opposed to some other arbitrary value.
>>>>> Furthermore it wouldn't break userspace: any existing sysctl would
>>>>> continue to work, and if not set, the user never has to worry about this
>>>>> tunable again.
>>>>>
>>>>> Please let me know if you all agree with this...
>>>>
>>>> Surething. Why not. :)
>>>
>>> *sigh* actually, the plot thickens a bit with SHMALL (total size of shm
>>> segments system wide, in pages). Currently by default:
>>>
>>> #define SHMALL (SHMMAX/getpagesize()*(SHMMNI/16))
>>>
>>> This deals with physical memory, at least admins are recommended to set
>>> it to some large percentage of ram / pagesize. So I think that if we
>>> loose control over the default value, users can potentially DoS the
>>> system, or at least cause excessive swapping if not manually set, but
>>> then again the same goes for anon mem... so do we care?
>> 
> (2014/04/02 10:08), Greg Thelen wrote:
>> 
>> At least when there's an egregious anon leak the oom killer has the
>> power to free the memory by killing until the memory is unreferenced.
>> This isn't true for shm or tmpfs.  So shm is more effective than anon at
>> crushing a machine.
>
> Hm..sysctl.kernel.shm_rmid_forced won't work with oom-killer ?
>
> http://www.openwall.com/lists/kernel-hardening/2011/07/26/7
>
> I like to handle this kind of issue under memcg but hmm..tmpfs's limit is half
> of memory at default.

Ah, yes. I forgot about shm_rmid_forced.  Thanks.  It would give the oom
killer ability to cleanup shm (as it does with anon) when
shm_rmid_forced=1.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm v2 1/2] sl[au]b: charge slabs to kmemcg explicitly

2014-04-01 Thread Greg Thelen

On Tue, Apr 01 2014, Vladimir Davydov  wrote:

> We have only a few places where we actually want to charge kmem so
> instead of intruding into the general page allocation path with
> __GFP_KMEMCG it's better to explictly charge kmem there. All kmem
> charges will be easier to follow that way.
>
> This is a step towards removing __GFP_KMEMCG. It removes __GFP_KMEMCG
> from memcg caches' allocflags. Instead it makes slab allocation path
> call memcg_charge_kmem directly getting memcg to charge from the cache's
> memcg params.
>
> This also eliminates any possibility of misaccounting an allocation
> going from one memcg's cache to another memcg, because now we always
> charge slabs against the memcg the cache belongs to. That's why this
> patch removes the big comment to memcg_kmem_get_cache.
>
> Signed-off-by: Vladimir Davydov 

Acked-by: Greg Thelen 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc,shm: increase default size for shmmax

2014-04-01 Thread Greg Thelen

On Tue, Apr 01 2014, Davidlohr Bueso  wrote:

> On Tue, 2014-04-01 at 19:56 -0400, KOSAKI Motohiro wrote:
>> >> > Ah-hah, that's interesting info.
>> >> >
>> >> > Let's make the default 64GB?
>> >>
>> >> 64GB is infinity at that time, but it no longer near infinity today. I 
>> >> like
>> >> very large or total memory proportional number.
>> >
>> > So I still like 0 for unlimited. Nice, clean and much easier to look at
>> > than ULONG_MAX. And since we cannot disable shm through SHMMIN, I really
>> > don't see any disadvantages, as opposed to some other arbitrary value.
>> > Furthermore it wouldn't break userspace: any existing sysctl would
>> > continue to work, and if not set, the user never has to worry about this
>> > tunable again.
>> >
>> > Please let me know if you all agree with this...
>> 
>> Surething. Why not. :)
>
> *sigh* actually, the plot thickens a bit with SHMALL (total size of shm
> segments system wide, in pages). Currently by default:
>
> #define SHMALL (SHMMAX/getpagesize()*(SHMMNI/16))
>
> This deals with physical memory, at least admins are recommended to set
> it to some large percentage of ram / pagesize. So I think that if we
> loose control over the default value, users can potentially DoS the
> system, or at least cause excessive swapping if not manually set, but
> then again the same goes for anon mem... so do we care?

At least when there's an egregious anon leak the oom killer has the
power to free the memory by killing until the memory is unreferenced.
This isn't true for shm or tmpfs.  So shm is more effective than anon at
crushing a machine.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm v2 2/2] mm: get rid of __GFP_KMEMCG

2014-04-01 Thread Greg Thelen

On Tue, Apr 01 2014, Vladimir Davydov  wrote:

> Currently to allocate a page that should be charged to kmemcg (e.g.
> threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page
> allocated is then to be freed by free_memcg_kmem_pages. Apart from
> looking asymmetrical, this also requires intrusion to the general
> allocation path. So let's introduce separate functions that will
> alloc/free pages charged to kmemcg.
>
> The new functions are called alloc_kmem_pages and free_kmem_pages. They
> should be used when the caller actually would like to use kmalloc, but
> has to fall back to the page allocator for the allocation is large. They
> only differ from alloc_pages and free_pages in that besides allocating
> or freeing pages they also charge them to the kmem resource counter of
> the current memory cgroup.
>
> Signed-off-by: Vladimir Davydov 
> ---
>  include/linux/gfp.h |   10 ---
>  include/linux/memcontrol.h  |2 +-
>  include/linux/slab.h|   11 
>  include/linux/thread_info.h |2 --
>  include/trace/events/gfpflags.h |1 -
>  kernel/fork.c   |6 ++---
>  mm/page_alloc.c |   56 
> ---
>  mm/slab_common.c|   12 +
>  mm/slub.c   |6 ++---
>  9 files changed, 60 insertions(+), 46 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 39b81dc7d01a..d382db71e300 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -31,7 +31,6 @@ struct vm_area_struct;
>  #define ___GFP_HARDWALL  0x2u
>  #define ___GFP_THISNODE  0x4u
>  #define ___GFP_RECLAIMABLE   0x8u
> -#define ___GFP_KMEMCG0x10u
>  #define ___GFP_NOTRACK   0x20u
>  #define ___GFP_NO_KSWAPD 0x40u
>  #define ___GFP_OTHER_NODE0x80u
> @@ -91,7 +90,6 @@ struct vm_area_struct;
>  
>  #define __GFP_NO_KSWAPD  ((__force gfp_t)___GFP_NO_KSWAPD)
>  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of 
> other node */
> -#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from 
> a memcg-accounted resource */
>  #define __GFP_WRITE  ((__force gfp_t)___GFP_WRITE)   /* Allocator intends to 
> dirty page */
>  
>  /*
> @@ -353,6 +351,10 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int 
> order,
>  #define alloc_page_vma_node(gfp_mask, vma, addr, node)   \
>   alloc_pages_vma(gfp_mask, 0, vma, addr, node)
>  
> +extern struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order);
> +extern struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask,
> +   unsigned int order);
> +
>  extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
>  extern unsigned long get_zeroed_page(gfp_t gfp_mask);
>  
> @@ -372,8 +374,8 @@ extern void free_pages(unsigned long addr, unsigned int 
> order);
>  extern void free_hot_cold_page(struct page *page, int cold);
>  extern void free_hot_cold_page_list(struct list_head *list, int cold);
>  
> -extern void __free_memcg_kmem_pages(struct page *page, unsigned int order);
> -extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
> +extern void __free_kmem_pages(struct page *page, unsigned int order);
> +extern void free_kmem_pages(unsigned long addr, unsigned int order);
>  
>  #define __free_page(page) __free_pages((page), 0)
>  #define free_page(addr) free_pages((addr), 0)
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 29068dd26c3d..13acdb5259f5 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -543,7 +543,7 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup 
> **memcg, int order)
>* res_counter_charge_nofail, but we hope those allocations are rare,
>* and won't be worth the trouble.
>*/
> - if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
> + if (gfp & __GFP_NOFAIL)
>   return true;
>   if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
>   return true;
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 3dd389aa91c7..6d6959292e00 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -358,17 +358,6 @@ kmem_cache_alloc_node_trace(struct kmem_cache *s,
>  #include 
>  #endif
>  
> -static __always_inline void *
> -kmalloc_order(size_t size, gfp_t flags, unsigned int order)
> -{
> - void *ret;
> -
> - flags |= (__GFP_COMP | __GFP_KMEMCG);
> - ret = (void *) __get_free_pages(flags, order);
> - kmemleak_alloc(ret, size, 1, flags);
> - return ret;
> -}
> -

Removing this from the header file breaks builds without
CONFIG_TRACING.
Example:
% make allnoconfig && make -j4 mm/
[...]
include/linux/slab.h: In function ‘kmalloc_order_trace’:
include/linux/slab.h:367:2: error: 

Re: [PATCH -mm v2 2/2] mm: get rid of __GFP_KMEMCG

2014-04-01 Thread Greg Thelen

On Tue, Apr 01 2014, Vladimir Davydov vdavy...@parallels.com wrote:

 Currently to allocate a page that should be charged to kmemcg (e.g.
 threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page
 allocated is then to be freed by free_memcg_kmem_pages. Apart from
 looking asymmetrical, this also requires intrusion to the general
 allocation path. So let's introduce separate functions that will
 alloc/free pages charged to kmemcg.

 The new functions are called alloc_kmem_pages and free_kmem_pages. They
 should be used when the caller actually would like to use kmalloc, but
 has to fall back to the page allocator for the allocation is large. They
 only differ from alloc_pages and free_pages in that besides allocating
 or freeing pages they also charge them to the kmem resource counter of
 the current memory cgroup.

 Signed-off-by: Vladimir Davydov vdavy...@parallels.com
 ---
  include/linux/gfp.h |   10 ---
  include/linux/memcontrol.h  |2 +-
  include/linux/slab.h|   11 
  include/linux/thread_info.h |2 --
  include/trace/events/gfpflags.h |1 -
  kernel/fork.c   |6 ++---
  mm/page_alloc.c |   56 
 ---
  mm/slab_common.c|   12 +
  mm/slub.c   |6 ++---
  9 files changed, 60 insertions(+), 46 deletions(-)

 diff --git a/include/linux/gfp.h b/include/linux/gfp.h
 index 39b81dc7d01a..d382db71e300 100644
 --- a/include/linux/gfp.h
 +++ b/include/linux/gfp.h
 @@ -31,7 +31,6 @@ struct vm_area_struct;
  #define ___GFP_HARDWALL  0x2u
  #define ___GFP_THISNODE  0x4u
  #define ___GFP_RECLAIMABLE   0x8u
 -#define ___GFP_KMEMCG0x10u
  #define ___GFP_NOTRACK   0x20u
  #define ___GFP_NO_KSWAPD 0x40u
  #define ___GFP_OTHER_NODE0x80u
 @@ -91,7 +90,6 @@ struct vm_area_struct;
  
  #define __GFP_NO_KSWAPD  ((__force gfp_t)___GFP_NO_KSWAPD)
  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of 
 other node */
 -#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from 
 a memcg-accounted resource */
  #define __GFP_WRITE  ((__force gfp_t)___GFP_WRITE)   /* Allocator intends to 
 dirty page */
  
  /*
 @@ -353,6 +351,10 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int 
 order,
  #define alloc_page_vma_node(gfp_mask, vma, addr, node)   \
   alloc_pages_vma(gfp_mask, 0, vma, addr, node)
  
 +extern struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order);
 +extern struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask,
 +   unsigned int order);
 +
  extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
  extern unsigned long get_zeroed_page(gfp_t gfp_mask);
  
 @@ -372,8 +374,8 @@ extern void free_pages(unsigned long addr, unsigned int 
 order);
  extern void free_hot_cold_page(struct page *page, int cold);
  extern void free_hot_cold_page_list(struct list_head *list, int cold);
  
 -extern void __free_memcg_kmem_pages(struct page *page, unsigned int order);
 -extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
 +extern void __free_kmem_pages(struct page *page, unsigned int order);
 +extern void free_kmem_pages(unsigned long addr, unsigned int order);
  
  #define __free_page(page) __free_pages((page), 0)
  #define free_page(addr) free_pages((addr), 0)
 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index 29068dd26c3d..13acdb5259f5 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -543,7 +543,7 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup 
 **memcg, int order)
* res_counter_charge_nofail, but we hope those allocations are rare,
* and won't be worth the trouble.
*/
 - if (!(gfp  __GFP_KMEMCG) || (gfp  __GFP_NOFAIL))
 + if (gfp  __GFP_NOFAIL)
   return true;
   if (in_interrupt() || (!current-mm) || (current-flags  PF_KTHREAD))
   return true;
 diff --git a/include/linux/slab.h b/include/linux/slab.h
 index 3dd389aa91c7..6d6959292e00 100644
 --- a/include/linux/slab.h
 +++ b/include/linux/slab.h
 @@ -358,17 +358,6 @@ kmem_cache_alloc_node_trace(struct kmem_cache *s,
  #include linux/slub_def.h
  #endif
  
 -static __always_inline void *
 -kmalloc_order(size_t size, gfp_t flags, unsigned int order)
 -{
 - void *ret;
 -
 - flags |= (__GFP_COMP | __GFP_KMEMCG);
 - ret = (void *) __get_free_pages(flags, order);
 - kmemleak_alloc(ret, size, 1, flags);
 - return ret;
 -}
 -

Removing this from the header file breaks builds without
CONFIG_TRACING.
Example:
% make allnoconfig  make -j4 mm/
[...]
include/linux/slab.h: In function ‘kmalloc_order_trace’:
include/linux/slab.h:367:2: error: implicit declaration of function 
‘kmalloc_order’ 

Re: [PATCH] ipc,shm: increase default size for shmmax

2014-04-01 Thread Greg Thelen

On Tue, Apr 01 2014, Davidlohr Bueso davidl...@hp.com wrote:

 On Tue, 2014-04-01 at 19:56 -0400, KOSAKI Motohiro wrote:
   Ah-hah, that's interesting info.
  
   Let's make the default 64GB?
 
  64GB is infinity at that time, but it no longer near infinity today. I 
  like
  very large or total memory proportional number.
 
  So I still like 0 for unlimited. Nice, clean and much easier to look at
  than ULONG_MAX. And since we cannot disable shm through SHMMIN, I really
  don't see any disadvantages, as opposed to some other arbitrary value.
  Furthermore it wouldn't break userspace: any existing sysctl would
  continue to work, and if not set, the user never has to worry about this
  tunable again.
 
  Please let me know if you all agree with this...
 
 Surething. Why not. :)

 *sigh* actually, the plot thickens a bit with SHMALL (total size of shm
 segments system wide, in pages). Currently by default:

 #define SHMALL (SHMMAX/getpagesize()*(SHMMNI/16))

 This deals with physical memory, at least admins are recommended to set
 it to some large percentage of ram / pagesize. So I think that if we
 loose control over the default value, users can potentially DoS the
 system, or at least cause excessive swapping if not manually set, but
 then again the same goes for anon mem... so do we care?

At least when there's an egregious anon leak the oom killer has the
power to free the memory by killing until the memory is unreferenced.
This isn't true for shm or tmpfs.  So shm is more effective than anon at
crushing a machine.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm v2 1/2] sl[au]b: charge slabs to kmemcg explicitly

2014-04-01 Thread Greg Thelen

On Tue, Apr 01 2014, Vladimir Davydov vdavy...@parallels.com wrote:

 We have only a few places where we actually want to charge kmem so
 instead of intruding into the general page allocation path with
 __GFP_KMEMCG it's better to explictly charge kmem there. All kmem
 charges will be easier to follow that way.

 This is a step towards removing __GFP_KMEMCG. It removes __GFP_KMEMCG
 from memcg caches' allocflags. Instead it makes slab allocation path
 call memcg_charge_kmem directly getting memcg to charge from the cache's
 memcg params.

 This also eliminates any possibility of misaccounting an allocation
 going from one memcg's cache to another memcg, because now we always
 charge slabs against the memcg the cache belongs to. That's why this
 patch removes the big comment to memcg_kmem_get_cache.

 Signed-off-by: Vladimir Davydov vdavy...@parallels.com

Acked-by: Greg Thelen gthe...@google.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc,shm: increase default size for shmmax

2014-04-01 Thread Greg Thelen

On Tue, Apr 01 2014, Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote:

 On Tue, Apr 01 2014, Davidlohr Bueso davidl...@hp.com wrote:
 
 On Tue, 2014-04-01 at 19:56 -0400, KOSAKI Motohiro wrote:
 Ah-hah, that's interesting info.

 Let's make the default 64GB?

 64GB is infinity at that time, but it no longer near infinity today. I 
 like
 very large or total memory proportional number.

 So I still like 0 for unlimited. Nice, clean and much easier to look at
 than ULONG_MAX. And since we cannot disable shm through SHMMIN, I really
 don't see any disadvantages, as opposed to some other arbitrary value.
 Furthermore it wouldn't break userspace: any existing sysctl would
 continue to work, and if not set, the user never has to worry about this
 tunable again.

 Please let me know if you all agree with this...

 Surething. Why not. :)

 *sigh* actually, the plot thickens a bit with SHMALL (total size of shm
 segments system wide, in pages). Currently by default:

 #define SHMALL (SHMMAX/getpagesize()*(SHMMNI/16))

 This deals with physical memory, at least admins are recommended to set
 it to some large percentage of ram / pagesize. So I think that if we
 loose control over the default value, users can potentially DoS the
 system, or at least cause excessive swapping if not manually set, but
 then again the same goes for anon mem... so do we care?
 
 (2014/04/02 10:08), Greg Thelen wrote:
 
 At least when there's an egregious anon leak the oom killer has the
 power to free the memory by killing until the memory is unreferenced.
 This isn't true for shm or tmpfs.  So shm is more effective than anon at
 crushing a machine.

 Hm..sysctl.kernel.shm_rmid_forced won't work with oom-killer ?

 http://www.openwall.com/lists/kernel-hardening/2011/07/26/7

 I like to handle this kind of issue under memcg but hmm..tmpfs's limit is half
 of memory at default.

Ah, yes. I forgot about shm_rmid_forced.  Thanks.  It would give the oom
killer ability to cleanup shm (as it does with anon) when
shm_rmid_forced=1.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm 1/4] sl[au]b: do not charge large allocations to memcg

2014-03-27 Thread Greg Thelen
On Thu, Mar 27, 2014 at 12:37 AM, Vladimir Davydov
 wrote:
> Hi Greg,
>
> On 03/27/2014 08:31 AM, Greg Thelen wrote:
>> On Wed, Mar 26 2014, Vladimir Davydov  wrote:
>>
>>> We don't track any random page allocation, so we shouldn't track kmalloc
>>> that falls back to the page allocator.
>> This seems like a change which will leads to confusing (and arguably
>> improper) kernel behavior.  I prefer the behavior prior to this patch.
>>
>> Before this change both of the following allocations are charged to
>> memcg (assuming kmem accounting is enabled):
>>  a = kmalloc(KMALLOC_MAX_CACHE_SIZE, GFP_KERNEL)
>>  b = kmalloc(KMALLOC_MAX_CACHE_SIZE + 1, GFP_KERNEL)
>>
>> After this change only 'a' is charged; 'b' goes directly to page
>> allocator which no longer does accounting.
>
> Why do we need to charge 'b' in the first place? Can the userspace
> trigger such allocations massively? If there can only be one or two such
> allocations from a cgroup, is there any point in charging them?

Of the top of my head I don't know of any >8KIB kmalloc()s so I can't
say if they're directly triggerable by user space en masse.  But we
recently ran into some order:3 allocations in networking.  The
networking allocations used a non-generic kmem_cache (rather than
kmalloc which started this discussion).  For details, see ed98df3361f0
("net: use __GFP_NORETRY for high order allocations").  I can't say if
such allocations exist in device drivers, but given the networking
example, it's conceivable that they may (or will) exist.

With slab this isn't a problem because sla has kmalloc kmem_caches for
all supported allocation sizes.  However, slub shows this issue for
any kmalloc() allocations larger than 8KIB (at least on x86_64).  It
seems like a strange directly to take kmem accounting to say that
kmalloc allocations are kmem limited, but only if they are either less
than a threshold size or done with slab.  Simply increasing the size
of a data structure doesn't seem like it should automatically cause
the memory to become exempt from kmem limits.

> In fact, do we actually need to charge every random kmem allocation? I
> guess not. For instance, filesystems often allocate data shared among
> all the FS users. It's wrong to charge such allocations to a particular
> memcg, IMO. That said the next step is going to be adding a per kmem
> cache flag specifying if allocations from this cache should be charged
> so that accounting will work only for those caches that are marked so
> explicitly.

It's a question of what direction to approach kmem slab accounting
from: either opt-out (as the code currently is), or opt-in (with per
kmem_cache flags as you suggest).  I agree that some structures end up
being shared (e.g. filesystem block bit map structures).  In an
opt-out system these are charged to a memcg initially and remain
charged there until the memcg is deleted at which point the shared
objects are reparented to a shared location.  While this isn't
perfect, it's unclear if it's better or worse than analyzing each
class of allocation and deciding if they should be opt'd-in.  One
could (though I'm not) make the case that even dentries are easily
shareable between containers and thus shouldn't be accounted to a
single memcg.  But given user space's ability to DoS a machine with
dentires, they should be accounted.

> There is one more argument for removing kmalloc_large accounting - we
> don't have an easy way to track such allocations, which prevents us from
> reparenting kmemcg charges on css offline. Of course, we could link
> kmalloc_large pages in some sort of per-memcg list which would allow us
> to find them on css offline, but I don't think such a complication is
> justified.

I assume that reparenting of such non kmem_cache allocations (e.g.
large kmalloc) is difficult because such pages refer to the memcg,
which we're trying to delete and the memcg has no index of such pages.
 If such zombie memcg are undesirable, then an alternative to indexing
the pages is to define a kmem context object which such large pages
point to.  The kmem context would be reparented without needing to
adjust the individual large pages.  But there are plenty of options.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm 1/4] sl[au]b: do not charge large allocations to memcg

2014-03-27 Thread Greg Thelen
On Thu, Mar 27, 2014 at 12:37 AM, Vladimir Davydov
vdavy...@parallels.com wrote:
 Hi Greg,

 On 03/27/2014 08:31 AM, Greg Thelen wrote:
 On Wed, Mar 26 2014, Vladimir Davydov vdavy...@parallels.com wrote:

 We don't track any random page allocation, so we shouldn't track kmalloc
 that falls back to the page allocator.
 This seems like a change which will leads to confusing (and arguably
 improper) kernel behavior.  I prefer the behavior prior to this patch.

 Before this change both of the following allocations are charged to
 memcg (assuming kmem accounting is enabled):
  a = kmalloc(KMALLOC_MAX_CACHE_SIZE, GFP_KERNEL)
  b = kmalloc(KMALLOC_MAX_CACHE_SIZE + 1, GFP_KERNEL)

 After this change only 'a' is charged; 'b' goes directly to page
 allocator which no longer does accounting.

 Why do we need to charge 'b' in the first place? Can the userspace
 trigger such allocations massively? If there can only be one or two such
 allocations from a cgroup, is there any point in charging them?

Of the top of my head I don't know of any 8KIB kmalloc()s so I can't
say if they're directly triggerable by user space en masse.  But we
recently ran into some order:3 allocations in networking.  The
networking allocations used a non-generic kmem_cache (rather than
kmalloc which started this discussion).  For details, see ed98df3361f0
(net: use __GFP_NORETRY for high order allocations).  I can't say if
such allocations exist in device drivers, but given the networking
example, it's conceivable that they may (or will) exist.

With slab this isn't a problem because sla has kmalloc kmem_caches for
all supported allocation sizes.  However, slub shows this issue for
any kmalloc() allocations larger than 8KIB (at least on x86_64).  It
seems like a strange directly to take kmem accounting to say that
kmalloc allocations are kmem limited, but only if they are either less
than a threshold size or done with slab.  Simply increasing the size
of a data structure doesn't seem like it should automatically cause
the memory to become exempt from kmem limits.

 In fact, do we actually need to charge every random kmem allocation? I
 guess not. For instance, filesystems often allocate data shared among
 all the FS users. It's wrong to charge such allocations to a particular
 memcg, IMO. That said the next step is going to be adding a per kmem
 cache flag specifying if allocations from this cache should be charged
 so that accounting will work only for those caches that are marked so
 explicitly.

It's a question of what direction to approach kmem slab accounting
from: either opt-out (as the code currently is), or opt-in (with per
kmem_cache flags as you suggest).  I agree that some structures end up
being shared (e.g. filesystem block bit map structures).  In an
opt-out system these are charged to a memcg initially and remain
charged there until the memcg is deleted at which point the shared
objects are reparented to a shared location.  While this isn't
perfect, it's unclear if it's better or worse than analyzing each
class of allocation and deciding if they should be opt'd-in.  One
could (though I'm not) make the case that even dentries are easily
shareable between containers and thus shouldn't be accounted to a
single memcg.  But given user space's ability to DoS a machine with
dentires, they should be accounted.

 There is one more argument for removing kmalloc_large accounting - we
 don't have an easy way to track such allocations, which prevents us from
 reparenting kmemcg charges on css offline. Of course, we could link
 kmalloc_large pages in some sort of per-memcg list which would allow us
 to find them on css offline, but I don't think such a complication is
 justified.

I assume that reparenting of such non kmem_cache allocations (e.g.
large kmalloc) is difficult because such pages refer to the memcg,
which we're trying to delete and the memcg has no index of such pages.
 If such zombie memcg are undesirable, then an alternative to indexing
the pages is to define a kmem context object which such large pages
point to.  The kmem context would be reparented without needing to
adjust the individual large pages.  But there are plenty of options.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm 1/4] sl[au]b: do not charge large allocations to memcg

2014-03-26 Thread Greg Thelen

On Wed, Mar 26 2014, Vladimir Davydov  wrote:

> We don't track any random page allocation, so we shouldn't track kmalloc
> that falls back to the page allocator.

This seems like a change which will leads to confusing (and arguably
improper) kernel behavior.  I prefer the behavior prior to this patch.

Before this change both of the following allocations are charged to
memcg (assuming kmem accounting is enabled):
 a = kmalloc(KMALLOC_MAX_CACHE_SIZE, GFP_KERNEL)
 b = kmalloc(KMALLOC_MAX_CACHE_SIZE + 1, GFP_KERNEL)

After this change only 'a' is charged; 'b' goes directly to page
allocator which no longer does accounting.

> Signed-off-by: Vladimir Davydov 
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Glauber Costa 
> Cc: Christoph Lameter 
> Cc: Pekka Enberg 
> ---
>  include/linux/slab.h |2 +-
>  mm/memcontrol.c  |   27 +--
>  mm/slub.c|4 ++--
>  3 files changed, 4 insertions(+), 29 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 3dd389aa91c7..8a928ff71d93 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -363,7 +363,7 @@ kmalloc_order(size_t size, gfp_t flags, unsigned int 
> order)
>  {
>   void *ret;
>  
> - flags |= (__GFP_COMP | __GFP_KMEMCG);
> + flags |= __GFP_COMP;
>   ret = (void *) __get_free_pages(flags, order);
>   kmemleak_alloc(ret, size, 1, flags);
>   return ret;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b4b6aef562fa..81a162d01d4d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3528,35 +3528,10 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct 
> mem_cgroup **_memcg, int order)
>  
>   *_memcg = NULL;
>  
> - /*
> -  * Disabling accounting is only relevant for some specific memcg
> -  * internal allocations. Therefore we would initially not have such
> -  * check here, since direct calls to the page allocator that are marked
> -  * with GFP_KMEMCG only happen outside memcg core. We are mostly
> -  * concerned with cache allocations, and by having this test at
> -  * memcg_kmem_get_cache, we are already able to relay the allocation to
> -  * the root cache and bypass the memcg cache altogether.
> -  *
> -  * There is one exception, though: the SLUB allocator does not create
> -  * large order caches, but rather service large kmallocs directly from
> -  * the page allocator. Therefore, the following sequence when backed by
> -  * the SLUB allocator:
> -  *
> -  *  memcg_stop_kmem_account();
> -  *  kmalloc()
> -  *  memcg_resume_kmem_account();
> -  *
> -  * would effectively ignore the fact that we should skip accounting,
> -  * since it will drive us directly to this function without passing
> -  * through the cache selector memcg_kmem_get_cache. Such large
> -  * allocations are extremely rare but can happen, for instance, for the
> -  * cache arrays. We bring this test here.
> -  */
> - if (!current->mm || current->memcg_kmem_skip_account)
> + if (!current->mm)
>   return true;
>  
>   memcg = get_mem_cgroup_from_mm(current->mm);
> -
>   if (!memcg_can_account_kmem(memcg)) {
>   css_put(>css);
>   return true;
> diff --git a/mm/slub.c b/mm/slub.c
> index 5e234f1f8853..c2e58a787443 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3325,7 +3325,7 @@ static void *kmalloc_large_node(size_t size, gfp_t 
> flags, int node)
>   struct page *page;
>   void *ptr = NULL;
>  
> - flags |= __GFP_COMP | __GFP_NOTRACK | __GFP_KMEMCG;
> + flags |= __GFP_COMP | __GFP_NOTRACK;
>   page = alloc_pages_node(node, flags, get_order(size));
>   if (page)
>   ptr = page_address(page);
> @@ -3395,7 +3395,7 @@ void kfree(const void *x)
>   if (unlikely(!PageSlab(page))) {
>   BUG_ON(!PageCompound(page));
>   kfree_hook(x);
> - __free_memcg_kmem_pages(page, compound_order(page));
> + __free_pages(page, compound_order(page));
>   return;
>   }
>   slab_free(page->slab_cache, page, object, _RET_IP_);
> -- 
> 1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm 1/4] sl[au]b: do not charge large allocations to memcg

2014-03-26 Thread Greg Thelen

On Wed, Mar 26 2014, Vladimir Davydov vdavy...@parallels.com wrote:

 We don't track any random page allocation, so we shouldn't track kmalloc
 that falls back to the page allocator.

This seems like a change which will leads to confusing (and arguably
improper) kernel behavior.  I prefer the behavior prior to this patch.

Before this change both of the following allocations are charged to
memcg (assuming kmem accounting is enabled):
 a = kmalloc(KMALLOC_MAX_CACHE_SIZE, GFP_KERNEL)
 b = kmalloc(KMALLOC_MAX_CACHE_SIZE + 1, GFP_KERNEL)

After this change only 'a' is charged; 'b' goes directly to page
allocator which no longer does accounting.

 Signed-off-by: Vladimir Davydov vdavy...@parallels.com
 Cc: Johannes Weiner han...@cmpxchg.org
 Cc: Michal Hocko mho...@suse.cz
 Cc: Glauber Costa glom...@gmail.com
 Cc: Christoph Lameter c...@linux-foundation.org
 Cc: Pekka Enberg penb...@kernel.org
 ---
  include/linux/slab.h |2 +-
  mm/memcontrol.c  |   27 +--
  mm/slub.c|4 ++--
  3 files changed, 4 insertions(+), 29 deletions(-)

 diff --git a/include/linux/slab.h b/include/linux/slab.h
 index 3dd389aa91c7..8a928ff71d93 100644
 --- a/include/linux/slab.h
 +++ b/include/linux/slab.h
 @@ -363,7 +363,7 @@ kmalloc_order(size_t size, gfp_t flags, unsigned int 
 order)
  {
   void *ret;
  
 - flags |= (__GFP_COMP | __GFP_KMEMCG);
 + flags |= __GFP_COMP;
   ret = (void *) __get_free_pages(flags, order);
   kmemleak_alloc(ret, size, 1, flags);
   return ret;
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index b4b6aef562fa..81a162d01d4d 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -3528,35 +3528,10 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct 
 mem_cgroup **_memcg, int order)
  
   *_memcg = NULL;
  
 - /*
 -  * Disabling accounting is only relevant for some specific memcg
 -  * internal allocations. Therefore we would initially not have such
 -  * check here, since direct calls to the page allocator that are marked
 -  * with GFP_KMEMCG only happen outside memcg core. We are mostly
 -  * concerned with cache allocations, and by having this test at
 -  * memcg_kmem_get_cache, we are already able to relay the allocation to
 -  * the root cache and bypass the memcg cache altogether.
 -  *
 -  * There is one exception, though: the SLUB allocator does not create
 -  * large order caches, but rather service large kmallocs directly from
 -  * the page allocator. Therefore, the following sequence when backed by
 -  * the SLUB allocator:
 -  *
 -  *  memcg_stop_kmem_account();
 -  *  kmalloc(large_number)
 -  *  memcg_resume_kmem_account();
 -  *
 -  * would effectively ignore the fact that we should skip accounting,
 -  * since it will drive us directly to this function without passing
 -  * through the cache selector memcg_kmem_get_cache. Such large
 -  * allocations are extremely rare but can happen, for instance, for the
 -  * cache arrays. We bring this test here.
 -  */
 - if (!current-mm || current-memcg_kmem_skip_account)
 + if (!current-mm)
   return true;
  
   memcg = get_mem_cgroup_from_mm(current-mm);
 -
   if (!memcg_can_account_kmem(memcg)) {
   css_put(memcg-css);
   return true;
 diff --git a/mm/slub.c b/mm/slub.c
 index 5e234f1f8853..c2e58a787443 100644
 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -3325,7 +3325,7 @@ static void *kmalloc_large_node(size_t size, gfp_t 
 flags, int node)
   struct page *page;
   void *ptr = NULL;
  
 - flags |= __GFP_COMP | __GFP_NOTRACK | __GFP_KMEMCG;
 + flags |= __GFP_COMP | __GFP_NOTRACK;
   page = alloc_pages_node(node, flags, get_order(size));
   if (page)
   ptr = page_address(page);
 @@ -3395,7 +3395,7 @@ void kfree(const void *x)
   if (unlikely(!PageSlab(page))) {
   BUG_ON(!PageCompound(page));
   kfree_hook(x);
 - __free_memcg_kmem_pages(page, compound_order(page));
 + __free_pages(page, compound_order(page));
   return;
   }
   slab_free(page-slab_cache, page, object, _RET_IP_);
 -- 
 1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 0/4] memcg: Low-limit reclaim

2014-02-03 Thread Greg Thelen
On Mon, Feb 03 2014, Michal Hocko wrote:

> On Thu 30-01-14 16:28:27, Greg Thelen wrote:
>> On Thu, Jan 30 2014, Michal Hocko wrote:
>> 
>> > On Wed 29-01-14 11:08:46, Greg Thelen wrote:
>> > [...]
>> >> The series looks useful.  We (Google) have been using something similar.
>> >> In practice such a low_limit (or memory guarantee), doesn't nest very
>> >> well.
>> >> 
>> >> Example:
>> >>   - parent_memcg: limit 500, low_limit 500, usage 500
>> >> 1 privately charged non-reclaimable page (e.g. mlock, slab)
>> >>   - child_memcg: limit 500, low_limit 500, usage 499
>> >
>> > I am not sure this is a good example. Your setup basically say that no
>> > single page should be reclaimed. I can imagine this might be useful in
>> > some cases and I would like to allow it but it sounds too extreme (e.g.
>> > a load which would start trashing heavily once the reclaim starts and it
>> > makes more sense to start it again rather than crowl - think about some
>> > mathematical simulation which might diverge).
>> 
>> Pages will still be reclaimed the usage_in_bytes is exceeds
>> limit_in_bytes.  I see the low_limit as a way to tell the kernel: don't
>> reclaim my memory due to external pressure, but internal pressure is
>> different.
>
> That sounds strange and very confusing to me. What if the internal
> pressure comes from children memcgs? Lowlimit is intended for protecting
> a group from reclaim and it shouldn't matter whether the reclaim is a
> result of the internal or external pressure.
>
>> >> If a streaming file cache workload (e.g. sha1sum) starts gobbling up
>> >> page cache it will lead to an oom kill instead of reclaiming. 
>> >
>> > Does it make any sense to protect all of such memory although it is
>> > easily reclaimable?
>> 
>> I think protection makes sense in this case.  If I know my workload
>> needs 500 to operate well, then I reserve 500 using low_limit.  My app
>> doesn't want to run with less than its reservation.
>> 
>> >> One could argue that this is working as intended because child_memcg
>> >> was promised 500 but can only get 499.  So child_memcg is oom killed
>> >> rather than being forced to operate below its promised low limit.
>> >> 
>> >> This has led to various internal workarounds like:
>> >> - don't charge any memory to interior tree nodes (e.g. parent_memcg);
>> >>   only charge memory to cgroup leafs.  This gets tricky when dealing
>> >>   with reparented memory inherited to parent from child during cgroup
>> >>   deletion.
>> >
>> > Do those need any protection at all?
>> 
>> Interior tree nodes don't need protection from their children.  But
>> children and interior nodes need protection from siblings and parents.
>
> Why? They contains only reparented pages in the above case. Those would
> be #1 candidate for reclaim in most cases, no?

I think we're on the same page.  My example interior node has reclaimed
pages and is a #1 candidate for reclaim induced from charges against
parent_memcg, but not a candidate for reclaim due to global memory
pressure induced by a sibling of parent_memcg.

>> >> - don't set low_limit on non leafs (e.g. do not set low limit on
>> >>   parent_memcg).  This constrains the cgroup layout a bit.  Some
>> >>   customers want to purchase $MEM and setup their workload with a few
>> >>   child cgroups.  A system daemon hands out $MEM by setting low_limit
>> >>   for top-level containers (e.g. parent_memcg).  Thereafter such
>> >>   customers are able to partition their workload with sub memcg below
>> >>   child_memcg.  Example:
>> >>  parent_memcg
>> >>  \
>> >>   child_memcg
>> >> / \
>> >> server   backup
>> >
>> > I think that the low_limit makes sense where you actually want to
>> > protect something from reclaim. And backup sounds like a bad fit for
>> > that.
>> 
>> The backup job would presumably have a small low_limit, but it may still
>> have a minimum working set required to make useful forward progress.
>> 
>> Example:
>>   parent_memcg
>>   \
>>child_memcg limit 500, low_limit 500, usage 500
>>  / \
>>  |   backup   limit 10, low_limit 10, usage 10
>>  |
>>   server limit 490, low_limit 490, usage 490
>> 
>> One could 

Re: [RFC 0/4] memcg: Low-limit reclaim

2014-02-03 Thread Greg Thelen
On Mon, Feb 03 2014, Michal Hocko wrote:

 On Thu 30-01-14 16:28:27, Greg Thelen wrote:
 On Thu, Jan 30 2014, Michal Hocko wrote:
 
  On Wed 29-01-14 11:08:46, Greg Thelen wrote:
  [...]
  The series looks useful.  We (Google) have been using something similar.
  In practice such a low_limit (or memory guarantee), doesn't nest very
  well.
  
  Example:
- parent_memcg: limit 500, low_limit 500, usage 500
  1 privately charged non-reclaimable page (e.g. mlock, slab)
- child_memcg: limit 500, low_limit 500, usage 499
 
  I am not sure this is a good example. Your setup basically say that no
  single page should be reclaimed. I can imagine this might be useful in
  some cases and I would like to allow it but it sounds too extreme (e.g.
  a load which would start trashing heavily once the reclaim starts and it
  makes more sense to start it again rather than crowl - think about some
  mathematical simulation which might diverge).
 
 Pages will still be reclaimed the usage_in_bytes is exceeds
 limit_in_bytes.  I see the low_limit as a way to tell the kernel: don't
 reclaim my memory due to external pressure, but internal pressure is
 different.

 That sounds strange and very confusing to me. What if the internal
 pressure comes from children memcgs? Lowlimit is intended for protecting
 a group from reclaim and it shouldn't matter whether the reclaim is a
 result of the internal or external pressure.

  If a streaming file cache workload (e.g. sha1sum) starts gobbling up
  page cache it will lead to an oom kill instead of reclaiming. 
 
  Does it make any sense to protect all of such memory although it is
  easily reclaimable?
 
 I think protection makes sense in this case.  If I know my workload
 needs 500 to operate well, then I reserve 500 using low_limit.  My app
 doesn't want to run with less than its reservation.
 
  One could argue that this is working as intended because child_memcg
  was promised 500 but can only get 499.  So child_memcg is oom killed
  rather than being forced to operate below its promised low limit.
  
  This has led to various internal workarounds like:
  - don't charge any memory to interior tree nodes (e.g. parent_memcg);
only charge memory to cgroup leafs.  This gets tricky when dealing
with reparented memory inherited to parent from child during cgroup
deletion.
 
  Do those need any protection at all?
 
 Interior tree nodes don't need protection from their children.  But
 children and interior nodes need protection from siblings and parents.

 Why? They contains only reparented pages in the above case. Those would
 be #1 candidate for reclaim in most cases, no?

I think we're on the same page.  My example interior node has reclaimed
pages and is a #1 candidate for reclaim induced from charges against
parent_memcg, but not a candidate for reclaim due to global memory
pressure induced by a sibling of parent_memcg.

  - don't set low_limit on non leafs (e.g. do not set low limit on
parent_memcg).  This constrains the cgroup layout a bit.  Some
customers want to purchase $MEM and setup their workload with a few
child cgroups.  A system daemon hands out $MEM by setting low_limit
for top-level containers (e.g. parent_memcg).  Thereafter such
customers are able to partition their workload with sub memcg below
child_memcg.  Example:
   parent_memcg
   \
child_memcg
  / \
  server   backup
 
  I think that the low_limit makes sense where you actually want to
  protect something from reclaim. And backup sounds like a bad fit for
  that.
 
 The backup job would presumably have a small low_limit, but it may still
 have a minimum working set required to make useful forward progress.
 
 Example:
   parent_memcg
   \
child_memcg limit 500, low_limit 500, usage 500
  / \
  |   backup   limit 10, low_limit 10, usage 10
  |
   server limit 490, low_limit 490, usage 490
 
 One could argue that problems appear when
 server.low_limit+backup.lower_limit=child_memcg.limit.  So the safer
 configuration is leave some padding:
   server.low_limit + backup.low_limit + padding = child_memcg.limit
 but this just defers the problem.  As memory is reparented into parent,
 then padding must grow.

 Which all sounds like a drawback of internal vs. external pressure
 semantic which you have mentioned above.

Huh?  I probably confused matters with the internal vs external talk
above.  Forgetting about that, I'm happy with the following
configuration assuming low_limit_fallback (ll_fallback) is eventually
available.

   parent_memcg
   \
child_memcg limit 500, low_limit 500, usage 500, ll_fallback 0
  / \
  |   backup   limit 10, low_limit 10, usage 10, ll_fallback 1
  |
   server limit 490, low_limit 490, usage 490, ll_fallback 1

Thereafter customers often want some weak isolation between server and
backup.  To avoid

Re: [RFC 0/4] memcg: Low-limit reclaim

2014-01-30 Thread Greg Thelen
On Thu, Jan 30 2014, Michal Hocko wrote:

> On Wed 29-01-14 11:08:46, Greg Thelen wrote:
> [...]
>> The series looks useful.  We (Google) have been using something similar.
>> In practice such a low_limit (or memory guarantee), doesn't nest very
>> well.
>> 
>> Example:
>>   - parent_memcg: limit 500, low_limit 500, usage 500
>> 1 privately charged non-reclaimable page (e.g. mlock, slab)
>>   - child_memcg: limit 500, low_limit 500, usage 499
>
> I am not sure this is a good example. Your setup basically say that no
> single page should be reclaimed. I can imagine this might be useful in
> some cases and I would like to allow it but it sounds too extreme (e.g.
> a load which would start trashing heavily once the reclaim starts and it
> makes more sense to start it again rather than crowl - think about some
> mathematical simulation which might diverge).

Pages will still be reclaimed the usage_in_bytes is exceeds
limit_in_bytes.  I see the low_limit as a way to tell the kernel: don't
reclaim my memory due to external pressure, but internal pressure is
different.

>> If a streaming file cache workload (e.g. sha1sum) starts gobbling up
>> page cache it will lead to an oom kill instead of reclaiming. 
>
> Does it make any sense to protect all of such memory although it is
> easily reclaimable?

I think protection makes sense in this case.  If I know my workload
needs 500 to operate well, then I reserve 500 using low_limit.  My app
doesn't want to run with less than its reservation.

>> One could argue that this is working as intended because child_memcg
>> was promised 500 but can only get 499.  So child_memcg is oom killed
>> rather than being forced to operate below its promised low limit.
>> 
>> This has led to various internal workarounds like:
>> - don't charge any memory to interior tree nodes (e.g. parent_memcg);
>>   only charge memory to cgroup leafs.  This gets tricky when dealing
>>   with reparented memory inherited to parent from child during cgroup
>>   deletion.
>
> Do those need any protection at all?

Interior tree nodes don't need protection from their children.  But
children and interior nodes need protection from siblings and parents.

>> - don't set low_limit on non leafs (e.g. do not set low limit on
>>   parent_memcg).  This constrains the cgroup layout a bit.  Some
>>   customers want to purchase $MEM and setup their workload with a few
>>   child cgroups.  A system daemon hands out $MEM by setting low_limit
>>   for top-level containers (e.g. parent_memcg).  Thereafter such
>>   customers are able to partition their workload with sub memcg below
>>   child_memcg.  Example:
>>  parent_memcg
>>  \
>>   child_memcg
>> / \
>> server   backup
>
> I think that the low_limit makes sense where you actually want to
> protect something from reclaim. And backup sounds like a bad fit for
> that.

The backup job would presumably have a small low_limit, but it may still
have a minimum working set required to make useful forward progress.

Example:
  parent_memcg
  \
   child_memcg limit 500, low_limit 500, usage 500
 / \
 |   backup   limit 10, low_limit 10, usage 10
 |
  server limit 490, low_limit 490, usage 490

One could argue that problems appear when
server.low_limit+backup.lower_limit=child_memcg.limit.  So the safer
configuration is leave some padding:
  server.low_limit + backup.low_limit + padding = child_memcg.limit
but this just defers the problem.  As memory is reparented into parent,
then padding must grow.

>>   Thereafter customers often want some weak isolation between server and
>>   backup.  To avoid undesired oom kills the server/backup isolation is
>>   provided with a softer memory guarantee (e.g. soft_limit).  The soft
>>   limit acts like the low_limit until priority becomes desperate.
>
> Johannes was already suggesting that the low_limit should allow for a
> weaker semantic as well. I am not very much inclined to that but I can
> leave with a knob which would say oom_on_lowlimit (on by default but
> allowed to be set to 0). We would fallback to the full reclaim if
> no groups turn out to be reclaimable.

I like the strong semantic of your low_limit at least at level:1 cgroups
(direct children of root).  But I have also encountered situations where
a strict guarantee is too strict and a mere preference is desirable.
Perhaps the best plan is to continue with the proposed strict low_limit
and eventually provide an additional mechanism which provides weaker
guarantees (e.g. soft_limit or something else if soft_limit cannot be
altered).  These two would offer good support for a variety of use
cases.

I thinkin

Re: [RFC 0/4] memcg: Low-limit reclaim

2014-01-30 Thread Greg Thelen
On Thu, Jan 30 2014, Michal Hocko wrote:

 On Wed 29-01-14 11:08:46, Greg Thelen wrote:
 [...]
 The series looks useful.  We (Google) have been using something similar.
 In practice such a low_limit (or memory guarantee), doesn't nest very
 well.
 
 Example:
   - parent_memcg: limit 500, low_limit 500, usage 500
 1 privately charged non-reclaimable page (e.g. mlock, slab)
   - child_memcg: limit 500, low_limit 500, usage 499

 I am not sure this is a good example. Your setup basically say that no
 single page should be reclaimed. I can imagine this might be useful in
 some cases and I would like to allow it but it sounds too extreme (e.g.
 a load which would start trashing heavily once the reclaim starts and it
 makes more sense to start it again rather than crowl - think about some
 mathematical simulation which might diverge).

Pages will still be reclaimed the usage_in_bytes is exceeds
limit_in_bytes.  I see the low_limit as a way to tell the kernel: don't
reclaim my memory due to external pressure, but internal pressure is
different.

 If a streaming file cache workload (e.g. sha1sum) starts gobbling up
 page cache it will lead to an oom kill instead of reclaiming. 

 Does it make any sense to protect all of such memory although it is
 easily reclaimable?

I think protection makes sense in this case.  If I know my workload
needs 500 to operate well, then I reserve 500 using low_limit.  My app
doesn't want to run with less than its reservation.

 One could argue that this is working as intended because child_memcg
 was promised 500 but can only get 499.  So child_memcg is oom killed
 rather than being forced to operate below its promised low limit.
 
 This has led to various internal workarounds like:
 - don't charge any memory to interior tree nodes (e.g. parent_memcg);
   only charge memory to cgroup leafs.  This gets tricky when dealing
   with reparented memory inherited to parent from child during cgroup
   deletion.

 Do those need any protection at all?

Interior tree nodes don't need protection from their children.  But
children and interior nodes need protection from siblings and parents.

 - don't set low_limit on non leafs (e.g. do not set low limit on
   parent_memcg).  This constrains the cgroup layout a bit.  Some
   customers want to purchase $MEM and setup their workload with a few
   child cgroups.  A system daemon hands out $MEM by setting low_limit
   for top-level containers (e.g. parent_memcg).  Thereafter such
   customers are able to partition their workload with sub memcg below
   child_memcg.  Example:
  parent_memcg
  \
   child_memcg
 / \
 server   backup

 I think that the low_limit makes sense where you actually want to
 protect something from reclaim. And backup sounds like a bad fit for
 that.

The backup job would presumably have a small low_limit, but it may still
have a minimum working set required to make useful forward progress.

Example:
  parent_memcg
  \
   child_memcg limit 500, low_limit 500, usage 500
 / \
 |   backup   limit 10, low_limit 10, usage 10
 |
  server limit 490, low_limit 490, usage 490

One could argue that problems appear when
server.low_limit+backup.lower_limit=child_memcg.limit.  So the safer
configuration is leave some padding:
  server.low_limit + backup.low_limit + padding = child_memcg.limit
but this just defers the problem.  As memory is reparented into parent,
then padding must grow.

   Thereafter customers often want some weak isolation between server and
   backup.  To avoid undesired oom kills the server/backup isolation is
   provided with a softer memory guarantee (e.g. soft_limit).  The soft
   limit acts like the low_limit until priority becomes desperate.

 Johannes was already suggesting that the low_limit should allow for a
 weaker semantic as well. I am not very much inclined to that but I can
 leave with a knob which would say oom_on_lowlimit (on by default but
 allowed to be set to 0). We would fallback to the full reclaim if
 no groups turn out to be reclaimable.

I like the strong semantic of your low_limit at least at level:1 cgroups
(direct children of root).  But I have also encountered situations where
a strict guarantee is too strict and a mere preference is desirable.
Perhaps the best plan is to continue with the proposed strict low_limit
and eventually provide an additional mechanism which provides weaker
guarantees (e.g. soft_limit or something else if soft_limit cannot be
altered).  These two would offer good support for a variety of use
cases.

I thinking of something like:

bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg,
struct mem_cgroup *root,
int priority)
{
do {
if (memcg == root)
break;
if (!res_counter_low_limit_excess(memcg-res))
return false;
if ((priority = DEF_PRIORITY

Re: [RFC 0/4] memcg: Low-limit reclaim

2014-01-29 Thread Greg Thelen
On Wed, Dec 11 2013, Michal Hocko wrote:

> Hi,
> previous discussions have shown that soft limits cannot be reformed
> (http://lwn.net/Articles/555249/). This series introduces an alternative
> approach to protecting memory allocated to processes executing within
> a memory cgroup controller. It is based on a new tunable that was
> discussed with Johannes and Tejun held during the last kernel summit.
>
> This patchset introduces such low limit that is functionally similar to a
> minimum guarantee. Memcgs which are under their lowlimit are not considered
> eligible for the reclaim (both global and hardlimit). The default value of
> the limit is 0 so all groups are eligible by default and an interested
> party has to explicitly set the limit.
>
> The primary use case is to protect an amount of memory allocated to a
> workload without it being reclaimed by an unrelated activity. In some
> cases this requirement can be fulfilled by mlock but it is not suitable
> for many loads and generally requires application awareness. Such
> application awareness can be complex. It effectively forbids the
> use of memory overcommit as the application must explicitly manage
> memory residency.
> With low limits, such workloads can be placed in a memcg with a low
> limit that protects the estimated working set.
>
> Another use case might be unreclaimable groups. Some loads might be so
> sensitive to reclaim that it is better to kill and start it again (or
> since checkpoint) rather than trash. This would be trivial with low
> limit set to unlimited and the OOM killer will handle the situation as
> required (e.g. kill and restart).
>
> The hierarchical behavior of the lowlimit is described in the first
> patch. It is followed by a direct reclaim fix which is necessary to
> handle situation when a no group is eligible because all groups are
> below low limit. This is not a big deal for hardlimit reclaim because
> we simply retry the reclaim few times and then trigger memcg OOM killer
> path. It would blow up in the global case when we would loop without
> doing any progress or trigger OOM killer. I would consider configuration
> leading to this state invalid but we should handle that gracefully.
>
> The third patch finally allows setting the lowlimit.
>
> The last patch tries expedites OOM if it is clear that no group is
> eligible for reclaim. It basically breaks out of loops in the direct
> reclaim and lets kswapd sleep because it wouldn't do any progress anyway.
>
> Thoughts?
>
> Short log says:
> Michal Hocko (4):
>   memcg, mm: introduce lowlimit reclaim
>   mm, memcg: allow OOM if no memcg is eligible during direct reclaim
>   memcg: Allow setting low_limit
>   mm, memcg: expedite OOM if no memcg is reclaimable
>
> And a diffstat
>  include/linux/memcontrol.h  | 14 +++
>  include/linux/res_counter.h | 40 ++
>  kernel/res_counter.c|  2 ++
>  mm/memcontrol.c | 60 
> -
>  mm/vmscan.c | 59 +---
>  5 files changed, 170 insertions(+), 5 deletions(-)

The series looks useful.  We (Google) have been using something similar.
In practice such a low_limit (or memory guarantee), doesn't nest very
well.

Example:
  - parent_memcg: limit 500, low_limit 500, usage 500
1 privately charged non-reclaimable page (e.g. mlock, slab)
  - child_memcg: limit 500, low_limit 500, usage 499

If a streaming file cache workload (e.g. sha1sum) starts gobbling up
page cache it will lead to an oom kill instead of reclaiming.  One could
argue that this is working as intended because child_memcg was promised
500 but can only get 499.  So child_memcg is oom killed rather than
being forced to operate below its promised low limit.

This has led to various internal workarounds like:
- don't charge any memory to interior tree nodes (e.g. parent_memcg);
  only charge memory to cgroup leafs.  This gets tricky when dealing
  with reparented memory inherited to parent from child during cgroup
  deletion.
- don't set low_limit on non leafs (e.g. do not set low limit on
  parent_memcg).  This constrains the cgroup layout a bit.  Some
  customers want to purchase $MEM and setup their workload with a few
  child cgroups.  A system daemon hands out $MEM by setting low_limit
  for top-level containers (e.g. parent_memcg).  Thereafter such
  customers are able to partition their workload with sub memcg below
  child_memcg.  Example:
 parent_memcg
 \
  child_memcg
/ \
server   backup
  Thereafter customers often want some weak isolation between server and
  backup.  To avoid undesired oom kills the server/backup isolation is
  provided with a softer memory guarantee (e.g. soft_limit).  The soft
  limit acts like the low_limit until priority becomes desperate.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a 

Re: [RFC 0/4] memcg: Low-limit reclaim

2014-01-29 Thread Greg Thelen
On Wed, Dec 11 2013, Michal Hocko wrote:

 Hi,
 previous discussions have shown that soft limits cannot be reformed
 (http://lwn.net/Articles/555249/). This series introduces an alternative
 approach to protecting memory allocated to processes executing within
 a memory cgroup controller. It is based on a new tunable that was
 discussed with Johannes and Tejun held during the last kernel summit.

 This patchset introduces such low limit that is functionally similar to a
 minimum guarantee. Memcgs which are under their lowlimit are not considered
 eligible for the reclaim (both global and hardlimit). The default value of
 the limit is 0 so all groups are eligible by default and an interested
 party has to explicitly set the limit.

 The primary use case is to protect an amount of memory allocated to a
 workload without it being reclaimed by an unrelated activity. In some
 cases this requirement can be fulfilled by mlock but it is not suitable
 for many loads and generally requires application awareness. Such
 application awareness can be complex. It effectively forbids the
 use of memory overcommit as the application must explicitly manage
 memory residency.
 With low limits, such workloads can be placed in a memcg with a low
 limit that protects the estimated working set.

 Another use case might be unreclaimable groups. Some loads might be so
 sensitive to reclaim that it is better to kill and start it again (or
 since checkpoint) rather than trash. This would be trivial with low
 limit set to unlimited and the OOM killer will handle the situation as
 required (e.g. kill and restart).

 The hierarchical behavior of the lowlimit is described in the first
 patch. It is followed by a direct reclaim fix which is necessary to
 handle situation when a no group is eligible because all groups are
 below low limit. This is not a big deal for hardlimit reclaim because
 we simply retry the reclaim few times and then trigger memcg OOM killer
 path. It would blow up in the global case when we would loop without
 doing any progress or trigger OOM killer. I would consider configuration
 leading to this state invalid but we should handle that gracefully.

 The third patch finally allows setting the lowlimit.

 The last patch tries expedites OOM if it is clear that no group is
 eligible for reclaim. It basically breaks out of loops in the direct
 reclaim and lets kswapd sleep because it wouldn't do any progress anyway.

 Thoughts?

 Short log says:
 Michal Hocko (4):
   memcg, mm: introduce lowlimit reclaim
   mm, memcg: allow OOM if no memcg is eligible during direct reclaim
   memcg: Allow setting low_limit
   mm, memcg: expedite OOM if no memcg is reclaimable

 And a diffstat
  include/linux/memcontrol.h  | 14 +++
  include/linux/res_counter.h | 40 ++
  kernel/res_counter.c|  2 ++
  mm/memcontrol.c | 60 
 -
  mm/vmscan.c | 59 +---
  5 files changed, 170 insertions(+), 5 deletions(-)

The series looks useful.  We (Google) have been using something similar.
In practice such a low_limit (or memory guarantee), doesn't nest very
well.

Example:
  - parent_memcg: limit 500, low_limit 500, usage 500
1 privately charged non-reclaimable page (e.g. mlock, slab)
  - child_memcg: limit 500, low_limit 500, usage 499

If a streaming file cache workload (e.g. sha1sum) starts gobbling up
page cache it will lead to an oom kill instead of reclaiming.  One could
argue that this is working as intended because child_memcg was promised
500 but can only get 499.  So child_memcg is oom killed rather than
being forced to operate below its promised low limit.

This has led to various internal workarounds like:
- don't charge any memory to interior tree nodes (e.g. parent_memcg);
  only charge memory to cgroup leafs.  This gets tricky when dealing
  with reparented memory inherited to parent from child during cgroup
  deletion.
- don't set low_limit on non leafs (e.g. do not set low limit on
  parent_memcg).  This constrains the cgroup layout a bit.  Some
  customers want to purchase $MEM and setup their workload with a few
  child cgroups.  A system daemon hands out $MEM by setting low_limit
  for top-level containers (e.g. parent_memcg).  Thereafter such
  customers are able to partition their workload with sub memcg below
  child_memcg.  Example:
 parent_memcg
 \
  child_memcg
/ \
server   backup
  Thereafter customers often want some weak isolation between server and
  backup.  To avoid undesired oom kills the server/backup isolation is
  provided with a softer memory guarantee (e.g. soft_limit).  The soft
  limit acts like the low_limit until priority becomes desperate.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH] ipc: introduce ipc_valid_object() helper to sort out IPC_RMID races

2013-12-17 Thread Greg Thelen
On Tue, Dec 17 2013, Rafael Aquini wrote:

> After the locking semantics for the SysV IPC API got improved, a couple of
> IPC_RMID race windows were opened because we ended up dropping the
> 'kern_ipc_perm.deleted' check performed way down in ipc_lock().
> The spotted races got sorted out by re-introducing the old test within
> the racy critical sections.
>
> This patch introduces ipc_valid_object() to consolidate the way we cope with
> IPC_RMID races by using the same abstraction across the API implementation.
>
> Signed-off-by: Rafael Aquini 

Acked-by: Greg Thelen 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc: introduce ipc_valid_object() helper to sort out IPC_RMID races

2013-12-17 Thread Greg Thelen
On Tue, Dec 17 2013, Rafael Aquini wrote:

 After the locking semantics for the SysV IPC API got improved, a couple of
 IPC_RMID race windows were opened because we ended up dropping the
 'kern_ipc_perm.deleted' check performed way down in ipc_lock().
 The spotted races got sorted out by re-introducing the old test within
 the racy critical sections.

 This patch introduces ipc_valid_object() to consolidate the way we cope with
 IPC_RMID races by using the same abstraction across the API implementation.

 Signed-off-by: Rafael Aquini aqu...@redhat.com

Acked-by: Greg Thelen gthe...@google.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ipc,shm: fix shm_file deletion races

2013-11-18 Thread Greg Thelen
When IPC_RMID races with other shm operations there's potential for
use-after-free of the shm object's associated file (shm_file).

Here's the race before this patch:
  TASK 1 TASK 2
  -- --
  shm_rmid()
ipc_lock_object()
 shmctl()
 shp = shm_obtain_object_check()

shm_destroy()
  shum_unlock()
  fput(shp->shm_file)
 ipc_lock_object()
 shmem_lock(shp->shm_file)
 

The oops is caused because shm_destroy() calls fput() after dropping the
ipc_lock.  fput() clears the file's f_inode, f_path.dentry, and
f_path.mnt, which causes various NULL pointer references in task 2.  I
reliably see the oops in task 2 if with shmlock, shmu

This patch fixes the races by:
1) set shm_file=NULL in shm_destroy() while holding ipc_object_lock().
2) modify at risk operations to check shm_file while holding
   ipc_object_lock().

Example workloads, which each trigger oops...

Workload 1:
  while true; do
id=$(shmget 1 4096)
shm_rmid $id &
shmlock $id &
wait
  done

  The oops stack shows accessing NULL f_inode due to racing fput:
_raw_spin_lock
shmem_lock
SyS_shmctl

Workload 2:
  while true; do
id=$(shmget 1 4096)
shmat $id 4096 &
shm_rmid $id &
wait
  done

  The oops stack is similar to workload 1 due to NULL f_inode:
touch_atime
shmem_mmap
shm_mmap
mmap_region
do_mmap_pgoff
do_shmat
SyS_shmat

Workload 3:
  while true; do
id=$(shmget 1 4096)
shmlock $id
shm_rmid $id &
shmunlock $id &
wait
  done

  The oops stack shows second fput tripping on an NULL f_inode.  The
  first fput() completed via from shm_destroy(), but a racing thread did
  a get_file() and queued this fput():
locks_remove_flock
__fput
fput
task_work_run
do_notify_resume
int_signal

Fixes: c2c737a0461e ("ipc,shm: shorten critical region for shmat")
Fixes: 2caacaa82a51 ("ipc,shm: shorten critical region for shmctl")
Signed-off-by: Greg Thelen 
Cc:   # 3.10.17+ 3.11.6+
---
 ipc/shm.c | 28 +++-
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index d69739610fd4..0bdf21c6814e 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -208,15 +208,18 @@ static void shm_open(struct vm_area_struct *vma)
  */
 static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
 {
+   struct file *shm_file;
+
+   shm_file = shp->shm_file;
+   shp->shm_file = NULL;
ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
shm_rmid(ns, shp);
shm_unlock(shp);
-   if (!is_file_hugepages(shp->shm_file))
-   shmem_lock(shp->shm_file, 0, shp->mlock_user);
+   if (!is_file_hugepages(shm_file))
+   shmem_lock(shm_file, 0, shp->mlock_user);
else if (shp->mlock_user)
-   user_shm_unlock(file_inode(shp->shm_file)->i_size,
-   shp->mlock_user);
-   fput (shp->shm_file);
+   user_shm_unlock(file_inode(shm_file)->i_size, shp->mlock_user);
+   fput(shm_file);
ipc_rcu_putref(shp, shm_rcu_free);
 }
 
@@ -983,6 +986,13 @@ SYSCALL_DEFINE3(shmctl, int, shmid, int, cmd, struct 
shmid_ds __user *, buf)
}
 
shm_file = shp->shm_file;
+
+   /* check if shm_destroy() is tearing down shp */
+   if (shm_file == NULL) {
+   err = -EIDRM;
+   goto out_unlock0;
+   }
+
if (is_file_hugepages(shm_file))
goto out_unlock0;
 
@@ -1101,6 +,14 @@ long do_shmat(int shmid, char __user *shmaddr, int 
shmflg, ulong *raddr,
goto out_unlock;
 
ipc_lock_object(>shm_perm);
+
+   /* check if shm_destroy() is tearing down shp */
+   if (shp->shm_file == NULL) {
+   ipc_unlock_object(>shm_perm);
+   err = -EIDRM;
+   goto out_unlock;
+   }
+
path = shp->shm_file->f_path;
path_get();
shp->shm_nattch++;
-- 
1.8.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ipc,shm: fix shm_file deletion races

2013-11-18 Thread Greg Thelen
When IPC_RMID races with other shm operations there's potential for
use-after-free of the shm object's associated file (shm_file).

Here's the race before this patch:
  TASK 1 TASK 2
  -- --
  shm_rmid()
ipc_lock_object()
 shmctl()
 shp = shm_obtain_object_check()

shm_destroy()
  shum_unlock()
  fput(shp-shm_file)
 ipc_lock_object()
 shmem_lock(shp-shm_file)
 OOPS

The oops is caused because shm_destroy() calls fput() after dropping the
ipc_lock.  fput() clears the file's f_inode, f_path.dentry, and
f_path.mnt, which causes various NULL pointer references in task 2.  I
reliably see the oops in task 2 if with shmlock, shmu

This patch fixes the races by:
1) set shm_file=NULL in shm_destroy() while holding ipc_object_lock().
2) modify at risk operations to check shm_file while holding
   ipc_object_lock().

Example workloads, which each trigger oops...

Workload 1:
  while true; do
id=$(shmget 1 4096)
shm_rmid $id 
shmlock $id 
wait
  done

  The oops stack shows accessing NULL f_inode due to racing fput:
_raw_spin_lock
shmem_lock
SyS_shmctl

Workload 2:
  while true; do
id=$(shmget 1 4096)
shmat $id 4096 
shm_rmid $id 
wait
  done

  The oops stack is similar to workload 1 due to NULL f_inode:
touch_atime
shmem_mmap
shm_mmap
mmap_region
do_mmap_pgoff
do_shmat
SyS_shmat

Workload 3:
  while true; do
id=$(shmget 1 4096)
shmlock $id
shm_rmid $id 
shmunlock $id 
wait
  done

  The oops stack shows second fput tripping on an NULL f_inode.  The
  first fput() completed via from shm_destroy(), but a racing thread did
  a get_file() and queued this fput():
locks_remove_flock
__fput
fput
task_work_run
do_notify_resume
int_signal

Fixes: c2c737a0461e (ipc,shm: shorten critical region for shmat)
Fixes: 2caacaa82a51 (ipc,shm: shorten critical region for shmctl)
Signed-off-by: Greg Thelen gthe...@google.com
Cc: sta...@vger.kernel.org  # 3.10.17+ 3.11.6+
---
 ipc/shm.c | 28 +++-
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index d69739610fd4..0bdf21c6814e 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -208,15 +208,18 @@ static void shm_open(struct vm_area_struct *vma)
  */
 static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
 {
+   struct file *shm_file;
+
+   shm_file = shp-shm_file;
+   shp-shm_file = NULL;
ns-shm_tot -= (shp-shm_segsz + PAGE_SIZE - 1)  PAGE_SHIFT;
shm_rmid(ns, shp);
shm_unlock(shp);
-   if (!is_file_hugepages(shp-shm_file))
-   shmem_lock(shp-shm_file, 0, shp-mlock_user);
+   if (!is_file_hugepages(shm_file))
+   shmem_lock(shm_file, 0, shp-mlock_user);
else if (shp-mlock_user)
-   user_shm_unlock(file_inode(shp-shm_file)-i_size,
-   shp-mlock_user);
-   fput (shp-shm_file);
+   user_shm_unlock(file_inode(shm_file)-i_size, shp-mlock_user);
+   fput(shm_file);
ipc_rcu_putref(shp, shm_rcu_free);
 }
 
@@ -983,6 +986,13 @@ SYSCALL_DEFINE3(shmctl, int, shmid, int, cmd, struct 
shmid_ds __user *, buf)
}
 
shm_file = shp-shm_file;
+
+   /* check if shm_destroy() is tearing down shp */
+   if (shm_file == NULL) {
+   err = -EIDRM;
+   goto out_unlock0;
+   }
+
if (is_file_hugepages(shm_file))
goto out_unlock0;
 
@@ -1101,6 +,14 @@ long do_shmat(int shmid, char __user *shmaddr, int 
shmflg, ulong *raddr,
goto out_unlock;
 
ipc_lock_object(shp-shm_perm);
+
+   /* check if shm_destroy() is tearing down shp */
+   if (shp-shm_file == NULL) {
+   ipc_unlock_object(shp-shm_perm);
+   err = -EIDRM;
+   goto out_unlock;
+   }
+
path = shp-shm_file-f_path;
path_get(path);
shp-shm_nattch++;
-- 
1.8.4.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/3] percpu: add test module for various percpu operations

2013-11-07 Thread Greg Thelen
On Mon, Nov 04 2013, Andrew Morton wrote:

> On Sun, 27 Oct 2013 10:30:15 -0700 Greg Thelen  wrote:
>
>> Tests various percpu operations.
>
> Could you please take a look at the 32-bit build (this is i386):
>
> lib/percpu_test.c: In function 'percpu_test_init':
> lib/percpu_test.c:61: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:61: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:61: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:61: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:61: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:61: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:70: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:70: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:70: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:70: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:70: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:70: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:89: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:89: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:89: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:89: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:89: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:89: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:97: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:97: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:97: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:97: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:97: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:97: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:112: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:112: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:112: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:112: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:112: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:112: warning: integer constant is too large for 'long' type

I was using gcc 4.6 which apparently adds LL suffix as needed.  Though
there were some other code problems with 32 bit beyond missing suffixes.
Fixed version below tested with both gcc 4.4 and gcc 4.6 on 32 and 64
bit x86.

---8<---

>From a95bb1ce42b4492644fa10c7c80fd9bbd7bf23b9 Mon Sep 17 00:00:00 2001
In-Reply-To: <20131104160918.0c571b410cf165e9c4b4a...@linux-foundation.org>
References: <20131104160918.0c571b410cf165e9c4b4a...@linux-foundation.org>
From: Greg Thelen 
Date: Sun, 27 Oct 2013 10:30:15 -0700
Subject: [PATCH v2] percpu: add test module for various percpu operations

Tests various percpu operations.

Enable with CONFIG_PERCPU_TEST=m.

Signed-off-by: Greg Thelen 
Acked-by: Tejun Heo 
---
Changelog since v1:
- use %lld/x which allows for less casting
- fix 32 bit build by casting large constants

 lib/Kconfig.debug |   9 
 lib/Makefile  |   2 +
 lib/percpu_test.c | 138 ++
 3 files changed, 149 insertions(+)
 create mode 100644 lib/percpu_test.c

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 094f3152ec2b..1891eb271adf 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1472,6 +1472,15 @@ config INTERVAL_TREE_TEST
help
  A benchmark measuring the performance of the interval tree library
 
+config PERCPU_TEST
+   tristate "Per cpu operations test"
+   depends on m && DEBUG_KERNEL
+   help
+ Enable this option to build test module which validates per-cpu
+ operations.
+
+ If unsure, say N.
+
 config ATOMIC64_SELFTEST
bool "Perform an atomic64_t self-test at boot"
help
diff --git a/lib/Makefile b/lib/Makefile
index f3bb2cb98adf..bb016e116ba4 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -157,6 +157,8 @@ obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o
 
 interval_tree_test-objs := interval_tree_test_main.o interval_tree.o
 
+obj-$(CONFIG_PERCPU_TEST) += percpu_test.o
+
 obj-$(CONFIG_ASN1) += asn1_decoder.o
 
 obj-$(CONFIG_FONT_SUPPORT) += fonts/
diff --git a/lib/percpu_test.c b/lib/percpu_test.c
new file mode 100644
index ..0b5d14dadd1a
--- /dev/

Re: [PATCH v2 1/3] percpu: add test module for various percpu operations

2013-11-07 Thread Greg Thelen
On Mon, Nov 04 2013, Andrew Morton wrote:

 On Sun, 27 Oct 2013 10:30:15 -0700 Greg Thelen gthe...@google.com wrote:

 Tests various percpu operations.

 Could you please take a look at the 32-bit build (this is i386):

 lib/percpu_test.c: In function 'percpu_test_init':
 lib/percpu_test.c:61: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:61: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:61: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:61: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:61: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:61: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:70: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:70: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:70: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:70: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:70: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:70: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:89: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:89: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:89: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:89: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:89: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:89: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:97: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:97: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:97: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:97: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:97: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:97: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:112: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:112: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:112: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:112: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:112: warning: integer constant is too large for 'long' type
 lib/percpu_test.c:112: warning: integer constant is too large for 'long' type

I was using gcc 4.6 which apparently adds LL suffix as needed.  Though
there were some other code problems with 32 bit beyond missing suffixes.
Fixed version below tested with both gcc 4.4 and gcc 4.6 on 32 and 64
bit x86.

---8---

From a95bb1ce42b4492644fa10c7c80fd9bbd7bf23b9 Mon Sep 17 00:00:00 2001
In-Reply-To: 20131104160918.0c571b410cf165e9c4b4a...@linux-foundation.org
References: 20131104160918.0c571b410cf165e9c4b4a...@linux-foundation.org
From: Greg Thelen gthe...@google.com
Date: Sun, 27 Oct 2013 10:30:15 -0700
Subject: [PATCH v2] percpu: add test module for various percpu operations

Tests various percpu operations.

Enable with CONFIG_PERCPU_TEST=m.

Signed-off-by: Greg Thelen gthe...@google.com
Acked-by: Tejun Heo t...@kernel.org
---
Changelog since v1:
- use %lld/x which allows for less casting
- fix 32 bit build by casting large constants

 lib/Kconfig.debug |   9 
 lib/Makefile  |   2 +
 lib/percpu_test.c | 138 ++
 3 files changed, 149 insertions(+)
 create mode 100644 lib/percpu_test.c

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 094f3152ec2b..1891eb271adf 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1472,6 +1472,15 @@ config INTERVAL_TREE_TEST
help
  A benchmark measuring the performance of the interval tree library
 
+config PERCPU_TEST
+   tristate Per cpu operations test
+   depends on m  DEBUG_KERNEL
+   help
+ Enable this option to build test module which validates per-cpu
+ operations.
+
+ If unsure, say N.
+
 config ATOMIC64_SELFTEST
bool Perform an atomic64_t self-test at boot
help
diff --git a/lib/Makefile b/lib/Makefile
index f3bb2cb98adf..bb016e116ba4 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -157,6 +157,8 @@ obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o
 
 interval_tree_test-objs := interval_tree_test_main.o interval_tree.o
 
+obj-$(CONFIG_PERCPU_TEST) += percpu_test.o
+
 obj-$(CONFIG_ASN1) += asn1_decoder.o
 
 obj-$(CONFIG_FONT_SUPPORT) += fonts/
diff --git a/lib/percpu_test.c b/lib/percpu_test.c
new file mode 100644
index ..0b5d14dadd1a
--- /dev/null
+++ b/lib/percpu_test.c
@@ -0,0 +1,138 @@
+#include linux/module.h
+
+/* validate @native and @pcp counter values match @expected */
+#define CHECK(native, pcp, expected

[PATCH v2 3/3] memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend casting

2013-10-27 Thread Greg Thelen
As of v3.11-9444-g3ea67d0 "memcg: add per cgroup writeback pages
accounting" memcg counter errors are possible when moving charged
memory to a different memcg.  Charge movement occurs when processing
writes to memory.force_empty, moving tasks to a memcg with
memcg.move_charge_at_immigrate=1, or memcg deletion.  An example
showing error after memory.force_empty:
  $ cd /sys/fs/cgroup/memory
  $ mkdir x
  $ rm /data/tmp/file
  $ (echo $BASHPID >> x/tasks && exec mmap_writer /data/tmp/file 1M) &
  [1] 13600
  $ grep ^mapped x/memory.stat
  mapped_file 1048576
  $ echo 13600 > tasks
  $ echo 1 > x/memory.force_empty
  $ grep ^mapped x/memory.stat
  mapped_file 4503599627370496

mapped_file should end with 0.
  4503599627370496 == 0x10,,, == 0x100,, pages
  1048576  == 0x10,   == 0x100 pages

This issue only affects the source memcg on 64 bit machines; the
destination memcg counters are correct.  So the rmdir case is not too
important because such counters are soon disappearing with the entire
memcg.  But the memcg.force_empty and
memory.move_charge_at_immigrate=1 cases are larger problems as the
bogus counters are visible for the (possibly long) remaining life of
the source memcg.

The problem is due to memcg use of __this_cpu_from(.., -nr_pages),
which is subtly wrong because it subtracts the unsigned int nr_pages
(either -1 or -512 for THP) from a signed long percpu counter.  When
nr_pages=-1, -nr_pages=0x.  On 64 bit machines
stat->count[idx] is signed 64 bit.  So memcg's attempt to simply
decrement a count (e.g. from 1 to 0) boils down to:
  long count = 1
  unsigned int nr_pages = 1
  count += -nr_pages  /* -nr_pages == 0x, */
  count is now 0x1,, instead of 0

The fix is to subtract the unsigned page count rather than adding its
negation.  This only works once "percpu: fix this_cpu_sub() subtrahend
casting for unsigneds" is applied to fix this_cpu_sub().

Signed-off-by: Greg Thelen 
Acked-by: Tejun Heo 
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aa8185c..b7ace0f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup 
*from,
 {
/* Update stat data for mem_cgroup */
preempt_disable();
-   __this_cpu_add(from->stat->count[idx], -nr_pages);
+   __this_cpu_sub(from->stat->count[idx], nr_pages);
__this_cpu_add(to->stat->count[idx], nr_pages);
preempt_enable();
 }
-- 
1.8.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/3] percpu: add test module for various percpu operations

2013-10-27 Thread Greg Thelen
Tests various percpu operations.

Enable with CONFIG_PERCPU_TEST=m.

Signed-off-by: Greg Thelen 
Acked-by: Tejun Heo 
---
 lib/Kconfig.debug |   9 
 lib/Makefile  |   2 +
 lib/percpu_test.c | 138 ++
 3 files changed, 149 insertions(+)
 create mode 100644 lib/percpu_test.c

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 06344d9..9fdb452 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1472,6 +1472,15 @@ config INTERVAL_TREE_TEST
help
  A benchmark measuring the performance of the interval tree library
 
+config PERCPU_TEST
+   tristate "Per cpu operations test"
+   depends on m && DEBUG_KERNEL
+   help
+ Enable this option to build test module which validates per-cpu
+ operations.
+
+ If unsure, say N.
+
 config ATOMIC64_SELFTEST
bool "Perform an atomic64_t self-test at boot"
help
diff --git a/lib/Makefile b/lib/Makefile
index f3bb2cb..bb016e1 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -157,6 +157,8 @@ obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o
 
 interval_tree_test-objs := interval_tree_test_main.o interval_tree.o
 
+obj-$(CONFIG_PERCPU_TEST) += percpu_test.o
+
 obj-$(CONFIG_ASN1) += asn1_decoder.o
 
 obj-$(CONFIG_FONT_SUPPORT) += fonts/
diff --git a/lib/percpu_test.c b/lib/percpu_test.c
new file mode 100644
index 000..fcca49e
--- /dev/null
+++ b/lib/percpu_test.c
@@ -0,0 +1,138 @@
+#include 
+
+/* validate @native and @pcp counter values match @expected */
+#define CHECK(native, pcp, expected)\
+   do {\
+   WARN((native) != (expected),\
+"raw %ld (0x%lx) != expected %ld (0x%lx)", \
+(long)(native), (long)(native),\
+(long)(expected), (long)(expected));   \
+   WARN(__this_cpu_read(pcp) != (expected),\
+"pcp %ld (0x%lx) != expected %ld (0x%lx)", \
+(long)__this_cpu_read(pcp), (long)__this_cpu_read(pcp), \
+(long)(expected), (long)(expected));   \
+   } while (0)
+
+static DEFINE_PER_CPU(long, long_counter);
+static DEFINE_PER_CPU(unsigned long, ulong_counter);
+
+static int __init percpu_test_init(void)
+{
+   /*
+* volatile prevents compiler from optimizing it uses, otherwise the
+* +ul_one and -ul_one below would replace with inc/dec instructions.
+*/
+   volatile unsigned int ui_one = 1;
+   long l = 0;
+   unsigned long ul = 0;
+
+   pr_info("percpu test start\n");
+
+   preempt_disable();
+
+   l += -1;
+   __this_cpu_add(long_counter, -1);
+   CHECK(l, long_counter, -1);
+
+   l += 1;
+   __this_cpu_add(long_counter, 1);
+   CHECK(l, long_counter, 0);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul += 1UL;
+   __this_cpu_add(ulong_counter, 1UL);
+   CHECK(ul, ulong_counter, 1);
+
+   ul += -1UL;
+   __this_cpu_add(ulong_counter, -1UL);
+   CHECK(ul, ulong_counter, 0);
+
+   ul += -(unsigned long)1;
+   __this_cpu_add(ulong_counter, -(unsigned long)1);
+   CHECK(ul, ulong_counter, -1);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul -= 1;
+   __this_cpu_dec(ulong_counter);
+   CHECK(ul, ulong_counter, 0x);
+   CHECK(ul, ulong_counter, -1);
+
+   l += -ui_one;
+   __this_cpu_add(long_counter, -ui_one);
+   CHECK(l, long_counter, 0x);
+
+   l += ui_one;
+   __this_cpu_add(long_counter, ui_one);
+   CHECK(l, long_counter, 0x1);
+
+
+   l = 0;
+   __this_cpu_write(long_counter, 0);
+
+   l -= ui_one;
+   __this_cpu_sub(long_counter, ui_one);
+   CHECK(l, long_counter, -1);
+
+   l = 0;
+   __this_cpu_write(long_counter, 0);
+
+   l += ui_one;
+   __this_cpu_add(long_counter, ui_one);
+   CHECK(l, long_counter, 1);
+
+   l += -ui_one;
+   __this_cpu_add(long_counter, -ui_one);
+   CHECK(l, long_counter, 0x1);
+
+   l = 0;
+   __this_cpu_write(long_counter, 0);
+
+   l -= ui_one;
+   this_cpu_sub(long_counter, ui_one);
+   CHECK(l, long_counter, -1);
+   CHECK(l, long_counter, 0x);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul += ui_one;
+   __this_cpu_add(ulong_counter, ui_one);
+   CHECK(ul, ulong_counter, 1);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul -= ui_one;
+   __this_cpu_sub(ulong_counter, ui_one);
+   CHECK(ul, ulong_counter, -1);
+   CHECK(ul, ulong_counter, 0x);
+
+   ul = 3;
+   __this_cpu_w

[PATCH v2 2/3] percpu: fix this_cpu_sub() subtrahend casting for unsigneds

2013-10-27 Thread Greg Thelen
this_cpu_sub() is implemented as negation and addition.

This patch casts the adjustment to the counter type before negation to
sign extend the adjustment.  This helps in cases where the counter
type is wider than an unsigned adjustment.  An alternative to this
patch is to declare such operations unsupported, but it seemed useful
to avoid surprises.

This patch specifically helps the following example:
  unsigned int delta = 1
  preempt_disable()
  this_cpu_write(long_counter, 0)
  this_cpu_sub(long_counter, delta)
  preempt_enable()

Before this change long_counter on a 64 bit machine ends with value
0x, rather than 0x.  This is because
this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta),
which is basically:
  long_counter = 0 + 0x

Also apply the same cast to:
  __this_cpu_sub()
  __this_cpu_sub_return()
  this_cpu_sub_return()

All percpu_test.ko passes, especially the following cases which
previously failed:

  l -= ui_one;
  __this_cpu_sub(long_counter, ui_one);
  CHECK(l, long_counter, -1);

  l -= ui_one;
  this_cpu_sub(long_counter, ui_one);
  CHECK(l, long_counter, -1);
  CHECK(l, long_counter, 0x);

  ul -= ui_one;
  __this_cpu_sub(ulong_counter, ui_one);
  CHECK(ul, ulong_counter, -1);
  CHECK(ul, ulong_counter, 0x);

  ul = this_cpu_sub_return(ulong_counter, ui_one);
  CHECK(ul, ulong_counter, 2);

  ul = __this_cpu_sub_return(ulong_counter, ui_one);
  CHECK(ul, ulong_counter, 1);

Signed-off-by: Greg Thelen 
Acked-by: Tejun Heo 
---
 arch/x86/include/asm/percpu.h | 3 ++-
 include/linux/percpu.h| 8 
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 0da5200..b3e18f8 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -128,7 +128,8 @@ do {
\
 do {   \
typedef typeof(var) pao_T__;\
const int pao_ID__ = (__builtin_constant_p(val) &&  \
- ((val) == 1 || (val) == -1)) ? (val) : 0; \
+ ((val) == 1 || (val) == -1)) ?\
+   (int)(val) : 0; \
if (0) {\
pao_T__ pao_tmp__;  \
pao_tmp__ = (val);  \
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index cc88172..c74088a 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -332,7 +332,7 @@ do {
\
 #endif
 
 #ifndef this_cpu_sub
-# define this_cpu_sub(pcp, val)this_cpu_add((pcp), -(val))
+# define this_cpu_sub(pcp, val)this_cpu_add((pcp), 
-(typeof(pcp))(val))
 #endif
 
 #ifndef this_cpu_inc
@@ -418,7 +418,7 @@ do {
\
 # define this_cpu_add_return(pcp, val) 
__pcpu_size_call_return2(this_cpu_add_return_, pcp, val)
 #endif
 
-#define this_cpu_sub_return(pcp, val)  this_cpu_add_return(pcp, -(val))
+#define this_cpu_sub_return(pcp, val)  this_cpu_add_return(pcp, 
-(typeof(pcp))(val))
 #define this_cpu_inc_return(pcp)   this_cpu_add_return(pcp, 1)
 #define this_cpu_dec_return(pcp)   this_cpu_add_return(pcp, -1)
 
@@ -586,7 +586,7 @@ do {
\
 #endif
 
 #ifndef __this_cpu_sub
-# define __this_cpu_sub(pcp, val)  __this_cpu_add((pcp), -(val))
+# define __this_cpu_sub(pcp, val)  __this_cpu_add((pcp), 
-(typeof(pcp))(val))
 #endif
 
 #ifndef __this_cpu_inc
@@ -668,7 +668,7 @@ do {
\
__pcpu_size_call_return2(__this_cpu_add_return_, pcp, val)
 #endif
 
-#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, 
-(val))
+#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, 
-(typeof(pcp))(val))
 #define __this_cpu_inc_return(pcp) __this_cpu_add_return(pcp, 1)
 #define __this_cpu_dec_return(pcp) __this_cpu_add_return(pcp, -1)
 
-- 
1.8.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 0/3] fix unsigned pcp adjustments

2013-10-27 Thread Greg Thelen
As of v3.11-9444-g3ea67d0 "memcg: add per cgroup writeback pages accounting"
memcg use of __this_cpu_add(counter, -nr_pages) leads to incorrect statistic
values because the negated nr_pages is not sign extended (counter is long,
nr_pages is unsigned int).  The memcg fix is __this_cpu_sub(counter, nr_pages).
But that doesn't simply work because __this_cpu_sub(counter, nr_pages) was
implemented as __this_cpu_add(counter, -nr_pages) which suffers the same
problem.  Example:
  unsigned int delta = 1
  preempt_disable()
  this_cpu_write(long_counter, 0)
  this_cpu_sub(long_counter, delta)
  preempt_enable()

Before this change long_counter on a 64 bit machine ends with value 0x,
rather than 0x.  This is because this_cpu_sub(pcp, delta) boils
down to:
  long_counter = 0 + 0x

v3.12-rc6 shows that only new memcg code is affected by this problem - the new
mem_cgroup_move_account_page_stat() is the only place where an unsigned
adjustment is used.  All other callers (e.g. shrink_dcache_sb) already use a
signed adjustment, so no problems before v3.12.  Though I did not audit the
stable kernel trees, so there could be something hiding in there.

Patch 1 creates a test module for percpu operations which demonstrates the
__this_cpu_sub() problems.  This patch is independent can be discarded if there
is no interest.

Patch 2 fixes __this_cpu_sub() to work with unsigned adjustments.

Patch 3 uses __this_cpu_sub() in memcg.

An alternative smaller solution is for memcg to use:
  __this_cpu_add(counter, -(int)nr_pages)
admitting that __this_cpu_add/sub() doesn't work with unsigned adjustments.  But
I felt like fixing the core services to prevent this in the future.

Changes from V1:
- more accurate patch titles, patch logs, and test module description now
  referring to per cpu operations rather than per cpu counters.
- move small test code update from patch 2 to patch 1 (where the test is
  introduced).

Greg Thelen (3):
  percpu: add test module for various percpu operations
  percpu: fix this_cpu_sub() subtrahend casting for unsigneds
  memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend
casting

 arch/x86/include/asm/percpu.h |   3 +-
 include/linux/percpu.h|   8 +--
 lib/Kconfig.debug |   9 +++
 lib/Makefile  |   2 +
 lib/percpu_test.c | 138 ++
 mm/memcontrol.c   |   2 +-
 6 files changed, 156 insertions(+), 6 deletions(-)
 create mode 100644 lib/percpu_test.c

-- 
1.8.4.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] percpu counter: cast this_cpu_sub() adjustment

2013-10-27 Thread Greg Thelen
On Sun, Oct 27 2013, Greg Thelen wrote:

> this_cpu_sub() is implemented as negation and addition.
>
> This patch casts the adjustment to the counter type before negation to
> sign extend the adjustment.  This helps in cases where the counter
> type is wider than an unsigned adjustment.  An alternative to this
> patch is to declare such operations unsupported, but it seemed useful
> to avoid surprises.
>
> This patch specifically helps the following example:
>   unsigned int delta = 1
>   preempt_disable()
>   this_cpu_write(long_counter, 0)
>   this_cpu_sub(long_counter, delta)
>   preempt_enable()
>
> Before this change long_counter on a 64 bit machine ends with value
> 0x, rather than 0x.  This is because
> this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta),
> which is basically:
>   long_counter = 0 + 0x
>
> Also apply the same cast to:
>   __this_cpu_sub()
>   this_cpu_sub_return()
>   and __this_cpu_sub_return()
>
> All percpu_test.ko passes, especially the following cases which
> previously failed:
>
>   l -= ui_one;
>   __this_cpu_sub(long_counter, ui_one);
>   CHECK(l, long_counter, -1);
>
>   l -= ui_one;
>   this_cpu_sub(long_counter, ui_one);
>   CHECK(l, long_counter, -1);
>   CHECK(l, long_counter, 0x);
>
>   ul -= ui_one;
>   __this_cpu_sub(ulong_counter, ui_one);
>   CHECK(ul, ulong_counter, -1);
>   CHECK(ul, ulong_counter, 0x);
>
>   ul = this_cpu_sub_return(ulong_counter, ui_one);
>   CHECK(ul, ulong_counter, 2);
>
>   ul = __this_cpu_sub_return(ulong_counter, ui_one);
>   CHECK(ul, ulong_counter, 1);
>
> Signed-off-by: Greg Thelen 
> ---
>  arch/x86/include/asm/percpu.h | 3 ++-
>  include/linux/percpu.h| 8 
>  lib/percpu_test.c | 2 +-
>  3 files changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
> index 0da5200..b3e18f8 100644
> --- a/arch/x86/include/asm/percpu.h
> +++ b/arch/x86/include/asm/percpu.h
> @@ -128,7 +128,8 @@ do {  
> \
>  do { \
>   typedef typeof(var) pao_T__;\
>   const int pao_ID__ = (__builtin_constant_p(val) &&  \
> -   ((val) == 1 || (val) == -1)) ? (val) : 0; \
> +   ((val) == 1 || (val) == -1)) ?\
> + (int)(val) : 0; \
>   if (0) {\
>   pao_T__ pao_tmp__;  \
>   pao_tmp__ = (val);  \
> diff --git a/include/linux/percpu.h b/include/linux/percpu.h
> index cc88172..c74088a 100644
> --- a/include/linux/percpu.h
> +++ b/include/linux/percpu.h
> @@ -332,7 +332,7 @@ do {  
> \
>  #endif
>  
>  #ifndef this_cpu_sub
> -# define this_cpu_sub(pcp, val)  this_cpu_add((pcp), -(val))
> +# define this_cpu_sub(pcp, val)  this_cpu_add((pcp), 
> -(typeof(pcp))(val))
>  #endif
>  
>  #ifndef this_cpu_inc
> @@ -418,7 +418,7 @@ do {  
> \
>  # define this_cpu_add_return(pcp, val)   
> __pcpu_size_call_return2(this_cpu_add_return_, pcp, val)
>  #endif
>  
> -#define this_cpu_sub_return(pcp, val)this_cpu_add_return(pcp, -(val))
> +#define this_cpu_sub_return(pcp, val)this_cpu_add_return(pcp, 
> -(typeof(pcp))(val))
>  #define this_cpu_inc_return(pcp) this_cpu_add_return(pcp, 1)
>  #define this_cpu_dec_return(pcp) this_cpu_add_return(pcp, -1)
>  
> @@ -586,7 +586,7 @@ do {  
> \
>  #endif
>  
>  #ifndef __this_cpu_sub
> -# define __this_cpu_sub(pcp, val)__this_cpu_add((pcp), -(val))
> +# define __this_cpu_sub(pcp, val)__this_cpu_add((pcp), 
> -(typeof(pcp))(val))
>  #endif
>  
>  #ifndef __this_cpu_inc
> @@ -668,7 +668,7 @@ do {  
> \
>   __pcpu_size_call_return2(__this_cpu_add_return_, pcp, val)
>  #endif
>  
> -#define __this_cpu_sub_return(pcp, val)  __this_cpu_add_return(pcp, 
> -(val))
> +#define __this_cpu_sub_return(pcp, val)  __this_cpu_add_return(pcp, 
> -(typeof(pcp))(val))
>  #define __this_cpu_inc_return(pcp)   __this_cpu_add_return(pcp, 1)
>  #define __this_cpu_d

Re: [PATCH 2/3] percpu counter: cast this_cpu_sub() adjustment

2013-10-27 Thread Greg Thelen
On Sun, Oct 27 2013, Tejun Heo wrote:

> On Sun, Oct 27, 2013 at 05:04:29AM -0700, Andrew Morton wrote:
>> On Sun, 27 Oct 2013 07:22:55 -0400 Tejun Heo  wrote:
>> 
>> > We probably want to cc stable for this and the next one.  How should
>> > these be routed?  I can take these through percpu tree or mm works
>> > too.  Either way, it'd be best to route them together.
>> 
>> Yes, all three look like -stable material to me.  I'll grab them later
>> in the week if you haven't ;)
>
> Tried to apply to percpu but the third one is a fix for a patch which
> was added to -mm during v3.12-rc1, so these are yours. :)

I don't object to stable for the first two non-memcg patches, but it's
probably unnecessary.  I should have made it more clear, but an audit of
v3.12-rc6 shows that only new memcg code is affected - the new
mem_cgroup_move_account_page_stat() is the only place where an unsigned
adjustment is used.  All other callers (e.g. shrink_dcache_sb) already
use a signed adjustment, so no problems before v3.12.  Though I did not
audit the stable kernel trees, so there could be something hiding in
there.

>> The names of the first two patches distress me.  They rather clearly
>> assert that the code affects percpu_counter.[ch], but that is not the case. 
>> Massaging is needed to fix that up.
>
> Yeah, something like the following would be better
>
>  percpu: add test module for various percpu operations
>  percpu: fix this_cpu_sub() subtrahend casting for unsigneds
>  memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend 
> casting

No objection to renaming.  Let me know if you want these reposed with
updated titles.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] memcg: use __this_cpu_sub to decrement stats

2013-10-27 Thread Greg Thelen
As of v3.11-9444-g3ea67d0 "memcg: add per cgroup writeback pages
accounting" memcg counter errors are possible when moving charged
memory to a different memcg.  Charge movement occurs when processing
writes to memory.force_empty, moving tasks to a memcg with
memcg.move_charge_at_immigrate=1, or memcg deletion.  An example
showing error after memory.force_empty:
  $ cd /sys/fs/cgroup/memory
  $ mkdir x
  $ rm /data/tmp/file
  $ (echo $BASHPID >> x/tasks && exec mmap_writer /data/tmp/file 1M) &
  [1] 13600
  $ grep ^mapped x/memory.stat
  mapped_file 1048576
  $ echo 13600 > tasks
  $ echo 1 > x/memory.force_empty
  $ grep ^mapped x/memory.stat
  mapped_file 4503599627370496

mapped_file should end with 0.
  4503599627370496 == 0x10,,, == 0x100,, pages
  1048576  == 0x10,   == 0x100 pages

This issue only affects the source memcg on 64 bit machines; the
destination memcg counters are correct.  So the rmdir case is not too
important because such counters are soon disappearing with the entire
memcg.  But the memcg.force_empty and
memory.move_charge_at_immigrate=1 cases are larger problems as the
bogus counters are visible for the (possibly long) remaining life of
the source memcg.

The problem is due to memcg use of __this_cpu_from(.., -nr_pages),
which is subtly wrong because it subtracts the unsigned int nr_pages
(either -1 or -512 for THP) from a signed long percpu counter.  When
nr_pages=-1, -nr_pages=0x.  On 64 bit machines
stat->count[idx] is signed 64 bit.  So memcg's attempt to simply
decrement a count (e.g. from 1 to 0) boils down to:
  long count = 1
  unsigned int nr_pages = 1
  count += -nr_pages  /* -nr_pages == 0x, */
  count is now 0x1,, instead of 0

The fix is to subtract the unsigned page count rather than adding its
negation.  This only works with the "percpu counter: cast
this_cpu_sub() adjustment" patch which fixes this_cpu_sub().

Signed-off-by: Greg Thelen 
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aa8185c..b7ace0f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup 
*from,
 {
/* Update stat data for mem_cgroup */
preempt_disable();
-   __this_cpu_add(from->stat->count[idx], -nr_pages);
+   __this_cpu_sub(from->stat->count[idx], nr_pages);
__this_cpu_add(to->stat->count[idx], nr_pages);
preempt_enable();
 }
-- 
1.8.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/3] fix unsigned pcp adjustments

2013-10-27 Thread Greg Thelen
As of v3.11-9444-g3ea67d0 "memcg: add per cgroup writeback pages accounting"
memcg use of __this_cpu_add(counter, -nr_pages) leads to incorrect statistic
values because the negated nr_pages is not sign extended (counter is long,
nr_pages is unsigned int).  The memcg fix is __this_cpu_sub(counter, nr_pages).
But that doesn't simply work because __this_cpu_sub(counter, nr_pages) was
implemented as __this_cpu_add(counter, -nr_pages) which suffers the same
problem.  Example:
  unsigned int delta = 1
  preempt_disable()
  this_cpu_write(long_counter, 0)
  this_cpu_sub(long_counter, delta)
  preempt_enable()

Before this change long_counter on a 64 bit machine ends with value 0x,
rather than 0x.  This is because this_cpu_sub(pcp, delta) boils
down to:
  long_counter = 0 + 0x

Patch 1 creates a test module for percpu counters operations which demonstrates
the __this_cpu_sub() problems.  This patch is independent can be discarded if
there is no interest.

Patch 2 fixes __this_cpu_sub() to work with unsigned adjustments.

Patch 3 uses __this_cpu_sub() in memcg.

An alternative smaller solution is for memcg to use:
  __this_cpu_add(counter, -(int)nr_pages)
admitting that __this_cpu_add/sub() doesn't work with unsigned adjustments.  But
I felt like fixing the core services to prevent this in the future.

Greg Thelen (3):
  percpu counter: test module
  percpu counter: cast this_cpu_sub() adjustment
  memcg: use __this_cpu_sub to decrement stats

 arch/x86/include/asm/percpu.h |   3 +-
 include/linux/percpu.h|   8 +--
 lib/Kconfig.debug |   9 +++
 lib/Makefile  |   2 +
 lib/percpu_test.c | 138 ++
 mm/memcontrol.c   |   2 +-
 6 files changed, 156 insertions(+), 6 deletions(-)
 create mode 100644 lib/percpu_test.c

-- 
1.8.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] percpu counter: cast this_cpu_sub() adjustment

2013-10-27 Thread Greg Thelen
this_cpu_sub() is implemented as negation and addition.

This patch casts the adjustment to the counter type before negation to
sign extend the adjustment.  This helps in cases where the counter
type is wider than an unsigned adjustment.  An alternative to this
patch is to declare such operations unsupported, but it seemed useful
to avoid surprises.

This patch specifically helps the following example:
  unsigned int delta = 1
  preempt_disable()
  this_cpu_write(long_counter, 0)
  this_cpu_sub(long_counter, delta)
  preempt_enable()

Before this change long_counter on a 64 bit machine ends with value
0x, rather than 0x.  This is because
this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta),
which is basically:
  long_counter = 0 + 0x

Also apply the same cast to:
  __this_cpu_sub()
  this_cpu_sub_return()
  and __this_cpu_sub_return()

All percpu_test.ko passes, especially the following cases which
previously failed:

  l -= ui_one;
  __this_cpu_sub(long_counter, ui_one);
  CHECK(l, long_counter, -1);

  l -= ui_one;
  this_cpu_sub(long_counter, ui_one);
  CHECK(l, long_counter, -1);
  CHECK(l, long_counter, 0x);

  ul -= ui_one;
  __this_cpu_sub(ulong_counter, ui_one);
  CHECK(ul, ulong_counter, -1);
  CHECK(ul, ulong_counter, 0x);

  ul = this_cpu_sub_return(ulong_counter, ui_one);
  CHECK(ul, ulong_counter, 2);

  ul = __this_cpu_sub_return(ulong_counter, ui_one);
  CHECK(ul, ulong_counter, 1);

Signed-off-by: Greg Thelen 
---
 arch/x86/include/asm/percpu.h | 3 ++-
 include/linux/percpu.h| 8 
 lib/percpu_test.c | 2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 0da5200..b3e18f8 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -128,7 +128,8 @@ do {
\
 do {   \
typedef typeof(var) pao_T__;\
const int pao_ID__ = (__builtin_constant_p(val) &&  \
- ((val) == 1 || (val) == -1)) ? (val) : 0; \
+ ((val) == 1 || (val) == -1)) ?\
+   (int)(val) : 0; \
if (0) {\
pao_T__ pao_tmp__;  \
pao_tmp__ = (val);  \
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index cc88172..c74088a 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -332,7 +332,7 @@ do {
\
 #endif
 
 #ifndef this_cpu_sub
-# define this_cpu_sub(pcp, val)this_cpu_add((pcp), -(val))
+# define this_cpu_sub(pcp, val)this_cpu_add((pcp), 
-(typeof(pcp))(val))
 #endif
 
 #ifndef this_cpu_inc
@@ -418,7 +418,7 @@ do {
\
 # define this_cpu_add_return(pcp, val) 
__pcpu_size_call_return2(this_cpu_add_return_, pcp, val)
 #endif
 
-#define this_cpu_sub_return(pcp, val)  this_cpu_add_return(pcp, -(val))
+#define this_cpu_sub_return(pcp, val)  this_cpu_add_return(pcp, 
-(typeof(pcp))(val))
 #define this_cpu_inc_return(pcp)   this_cpu_add_return(pcp, 1)
 #define this_cpu_dec_return(pcp)   this_cpu_add_return(pcp, -1)
 
@@ -586,7 +586,7 @@ do {
\
 #endif
 
 #ifndef __this_cpu_sub
-# define __this_cpu_sub(pcp, val)  __this_cpu_add((pcp), -(val))
+# define __this_cpu_sub(pcp, val)  __this_cpu_add((pcp), 
-(typeof(pcp))(val))
 #endif
 
 #ifndef __this_cpu_inc
@@ -668,7 +668,7 @@ do {
\
__pcpu_size_call_return2(__this_cpu_add_return_, pcp, val)
 #endif
 
-#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, 
-(val))
+#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, 
-(typeof(pcp))(val))
 #define __this_cpu_inc_return(pcp) __this_cpu_add_return(pcp, 1)
 #define __this_cpu_dec_return(pcp) __this_cpu_add_return(pcp, -1)
 
diff --git a/lib/percpu_test.c b/lib/percpu_test.c
index 1ebeb44..8ab4231 100644
--- a/lib/percpu_test.c
+++ b/lib/percpu_test.c
@@ -118,7 +118,7 @@ static int __init percpu_test_init(void)
CHECK(ul, ulong_counter, 2);
 
ul = __this_cpu_sub_return(ulong_counter, ui_one);
-   CHECK(ul, ulong_counter, 0);
+   CHECK(ul, ulong_counter, 1);
 
preempt_enable();
 
-- 
1.8.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More major

[PATCH 1/3] percpu counter: test module

2013-10-27 Thread Greg Thelen
Tests various percpu operations.

Enable with CONFIG_PERCPU_TEST=m.

Signed-off-by: Greg Thelen 
---
 lib/Kconfig.debug |   9 
 lib/Makefile  |   2 +
 lib/percpu_test.c | 138 ++
 3 files changed, 149 insertions(+)
 create mode 100644 lib/percpu_test.c

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 06344d9..cee589d 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1472,6 +1472,15 @@ config INTERVAL_TREE_TEST
help
  A benchmark measuring the performance of the interval tree library
 
+config PERCPU_TEST
+   tristate "Per cpu counter test"
+   depends on m && DEBUG_KERNEL
+   help
+ Enable this option to build test modules with validates per-cpu
+ counter operations.
+
+ If unsure, say N.
+
 config ATOMIC64_SELFTEST
bool "Perform an atomic64_t self-test at boot"
help
diff --git a/lib/Makefile b/lib/Makefile
index f3bb2cb..bb016e1 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -157,6 +157,8 @@ obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o
 
 interval_tree_test-objs := interval_tree_test_main.o interval_tree.o
 
+obj-$(CONFIG_PERCPU_TEST) += percpu_test.o
+
 obj-$(CONFIG_ASN1) += asn1_decoder.o
 
 obj-$(CONFIG_FONT_SUPPORT) += fonts/
diff --git a/lib/percpu_test.c b/lib/percpu_test.c
new file mode 100644
index 000..1ebeb44
--- /dev/null
+++ b/lib/percpu_test.c
@@ -0,0 +1,138 @@
+#include 
+
+/* validate @native and @pcp counter values match @expected */
+#define CHECK(native, pcp, expected)\
+   do {\
+   WARN((native) != (expected),\
+"raw %ld (0x%lx) != expected %ld (0x%lx)", \
+(long)(native), (long)(native),\
+(long)(expected), (long)(expected));   \
+   WARN(__this_cpu_read(pcp) != (expected),\
+"pcp %ld (0x%lx) != expected %ld (0x%lx)", \
+(long)__this_cpu_read(pcp), (long)__this_cpu_read(pcp), \
+(long)(expected), (long)(expected));   \
+   } while (0)
+
+static DEFINE_PER_CPU(long, long_counter);
+static DEFINE_PER_CPU(unsigned long, ulong_counter);
+
+static int __init percpu_test_init(void)
+{
+   /*
+* volatile prevents compiler from optimizing it uses, otherwise the
+* +ul_one and -ul_one below would replace with inc/dec instructions.
+*/
+   volatile unsigned int ui_one = 1;
+   long l = 0;
+   unsigned long ul = 0;
+
+   pr_info("percpu test start\n");
+
+   preempt_disable();
+
+   l += -1;
+   __this_cpu_add(long_counter, -1);
+   CHECK(l, long_counter, -1);
+
+   l += 1;
+   __this_cpu_add(long_counter, 1);
+   CHECK(l, long_counter, 0);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul += 1UL;
+   __this_cpu_add(ulong_counter, 1UL);
+   CHECK(ul, ulong_counter, 1);
+
+   ul += -1UL;
+   __this_cpu_add(ulong_counter, -1UL);
+   CHECK(ul, ulong_counter, 0);
+
+   ul += -(unsigned long)1;
+   __this_cpu_add(ulong_counter, -(unsigned long)1);
+   CHECK(ul, ulong_counter, -1);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul -= 1;
+   __this_cpu_dec(ulong_counter);
+   CHECK(ul, ulong_counter, 0x);
+   CHECK(ul, ulong_counter, -1);
+
+   l += -ui_one;
+   __this_cpu_add(long_counter, -ui_one);
+   CHECK(l, long_counter, 0x);
+
+   l += ui_one;
+   __this_cpu_add(long_counter, ui_one);
+   CHECK(l, long_counter, 0x1);
+
+
+   l = 0;
+   __this_cpu_write(long_counter, 0);
+
+   l -= ui_one;
+   __this_cpu_sub(long_counter, ui_one);
+   CHECK(l, long_counter, -1);
+
+   l = 0;
+   __this_cpu_write(long_counter, 0);
+
+   l += ui_one;
+   __this_cpu_add(long_counter, ui_one);
+   CHECK(l, long_counter, 1);
+
+   l += -ui_one;
+   __this_cpu_add(long_counter, -ui_one);
+   CHECK(l, long_counter, 0x1);
+
+   l = 0;
+   __this_cpu_write(long_counter, 0);
+
+   l -= ui_one;
+   this_cpu_sub(long_counter, ui_one);
+   CHECK(l, long_counter, -1);
+   CHECK(l, long_counter, 0x);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul += ui_one;
+   __this_cpu_add(ulong_counter, ui_one);
+   CHECK(ul, ulong_counter, 1);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul -= ui_one;
+   __this_cpu_sub(ulong_counter, ui_one);
+   CHECK(ul, ulong_counter, -1);
+   CHECK(ul, ulong_counter, 0x);
+
+   ul = 3;
+   __this_cpu_write(ulong_counter, 3)

[PATCH 2/3] percpu counter: cast this_cpu_sub() adjustment

2013-10-27 Thread Greg Thelen
this_cpu_sub() is implemented as negation and addition.

This patch casts the adjustment to the counter type before negation to
sign extend the adjustment.  This helps in cases where the counter
type is wider than an unsigned adjustment.  An alternative to this
patch is to declare such operations unsupported, but it seemed useful
to avoid surprises.

This patch specifically helps the following example:
  unsigned int delta = 1
  preempt_disable()
  this_cpu_write(long_counter, 0)
  this_cpu_sub(long_counter, delta)
  preempt_enable()

Before this change long_counter on a 64 bit machine ends with value
0x, rather than 0x.  This is because
this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta),
which is basically:
  long_counter = 0 + 0x

Also apply the same cast to:
  __this_cpu_sub()
  this_cpu_sub_return()
  and __this_cpu_sub_return()

All percpu_test.ko passes, especially the following cases which
previously failed:

  l -= ui_one;
  __this_cpu_sub(long_counter, ui_one);
  CHECK(l, long_counter, -1);

  l -= ui_one;
  this_cpu_sub(long_counter, ui_one);
  CHECK(l, long_counter, -1);
  CHECK(l, long_counter, 0x);

  ul -= ui_one;
  __this_cpu_sub(ulong_counter, ui_one);
  CHECK(ul, ulong_counter, -1);
  CHECK(ul, ulong_counter, 0x);

  ul = this_cpu_sub_return(ulong_counter, ui_one);
  CHECK(ul, ulong_counter, 2);

  ul = __this_cpu_sub_return(ulong_counter, ui_one);
  CHECK(ul, ulong_counter, 1);

Signed-off-by: Greg Thelen gthe...@google.com
---
 arch/x86/include/asm/percpu.h | 3 ++-
 include/linux/percpu.h| 8 
 lib/percpu_test.c | 2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 0da5200..b3e18f8 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -128,7 +128,8 @@ do {
\
 do {   \
typedef typeof(var) pao_T__;\
const int pao_ID__ = (__builtin_constant_p(val)   \
- ((val) == 1 || (val) == -1)) ? (val) : 0; \
+ ((val) == 1 || (val) == -1)) ?\
+   (int)(val) : 0; \
if (0) {\
pao_T__ pao_tmp__;  \
pao_tmp__ = (val);  \
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index cc88172..c74088a 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -332,7 +332,7 @@ do {
\
 #endif
 
 #ifndef this_cpu_sub
-# define this_cpu_sub(pcp, val)this_cpu_add((pcp), -(val))
+# define this_cpu_sub(pcp, val)this_cpu_add((pcp), 
-(typeof(pcp))(val))
 #endif
 
 #ifndef this_cpu_inc
@@ -418,7 +418,7 @@ do {
\
 # define this_cpu_add_return(pcp, val) 
__pcpu_size_call_return2(this_cpu_add_return_, pcp, val)
 #endif
 
-#define this_cpu_sub_return(pcp, val)  this_cpu_add_return(pcp, -(val))
+#define this_cpu_sub_return(pcp, val)  this_cpu_add_return(pcp, 
-(typeof(pcp))(val))
 #define this_cpu_inc_return(pcp)   this_cpu_add_return(pcp, 1)
 #define this_cpu_dec_return(pcp)   this_cpu_add_return(pcp, -1)
 
@@ -586,7 +586,7 @@ do {
\
 #endif
 
 #ifndef __this_cpu_sub
-# define __this_cpu_sub(pcp, val)  __this_cpu_add((pcp), -(val))
+# define __this_cpu_sub(pcp, val)  __this_cpu_add((pcp), 
-(typeof(pcp))(val))
 #endif
 
 #ifndef __this_cpu_inc
@@ -668,7 +668,7 @@ do {
\
__pcpu_size_call_return2(__this_cpu_add_return_, pcp, val)
 #endif
 
-#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, 
-(val))
+#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, 
-(typeof(pcp))(val))
 #define __this_cpu_inc_return(pcp) __this_cpu_add_return(pcp, 1)
 #define __this_cpu_dec_return(pcp) __this_cpu_add_return(pcp, -1)
 
diff --git a/lib/percpu_test.c b/lib/percpu_test.c
index 1ebeb44..8ab4231 100644
--- a/lib/percpu_test.c
+++ b/lib/percpu_test.c
@@ -118,7 +118,7 @@ static int __init percpu_test_init(void)
CHECK(ul, ulong_counter, 2);
 
ul = __this_cpu_sub_return(ulong_counter, ui_one);
-   CHECK(ul, ulong_counter, 0);
+   CHECK(ul, ulong_counter, 1);
 
preempt_enable();
 
-- 
1.8.4.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo

[PATCH 1/3] percpu counter: test module

2013-10-27 Thread Greg Thelen
Tests various percpu operations.

Enable with CONFIG_PERCPU_TEST=m.

Signed-off-by: Greg Thelen gthe...@google.com
---
 lib/Kconfig.debug |   9 
 lib/Makefile  |   2 +
 lib/percpu_test.c | 138 ++
 3 files changed, 149 insertions(+)
 create mode 100644 lib/percpu_test.c

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 06344d9..cee589d 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1472,6 +1472,15 @@ config INTERVAL_TREE_TEST
help
  A benchmark measuring the performance of the interval tree library
 
+config PERCPU_TEST
+   tristate Per cpu counter test
+   depends on m  DEBUG_KERNEL
+   help
+ Enable this option to build test modules with validates per-cpu
+ counter operations.
+
+ If unsure, say N.
+
 config ATOMIC64_SELFTEST
bool Perform an atomic64_t self-test at boot
help
diff --git a/lib/Makefile b/lib/Makefile
index f3bb2cb..bb016e1 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -157,6 +157,8 @@ obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o
 
 interval_tree_test-objs := interval_tree_test_main.o interval_tree.o
 
+obj-$(CONFIG_PERCPU_TEST) += percpu_test.o
+
 obj-$(CONFIG_ASN1) += asn1_decoder.o
 
 obj-$(CONFIG_FONT_SUPPORT) += fonts/
diff --git a/lib/percpu_test.c b/lib/percpu_test.c
new file mode 100644
index 000..1ebeb44
--- /dev/null
+++ b/lib/percpu_test.c
@@ -0,0 +1,138 @@
+#include linux/module.h
+
+/* validate @native and @pcp counter values match @expected */
+#define CHECK(native, pcp, expected)\
+   do {\
+   WARN((native) != (expected),\
+raw %ld (0x%lx) != expected %ld (0x%lx), \
+(long)(native), (long)(native),\
+(long)(expected), (long)(expected));   \
+   WARN(__this_cpu_read(pcp) != (expected),\
+pcp %ld (0x%lx) != expected %ld (0x%lx), \
+(long)__this_cpu_read(pcp), (long)__this_cpu_read(pcp), \
+(long)(expected), (long)(expected));   \
+   } while (0)
+
+static DEFINE_PER_CPU(long, long_counter);
+static DEFINE_PER_CPU(unsigned long, ulong_counter);
+
+static int __init percpu_test_init(void)
+{
+   /*
+* volatile prevents compiler from optimizing it uses, otherwise the
+* +ul_one and -ul_one below would replace with inc/dec instructions.
+*/
+   volatile unsigned int ui_one = 1;
+   long l = 0;
+   unsigned long ul = 0;
+
+   pr_info(percpu test start\n);
+
+   preempt_disable();
+
+   l += -1;
+   __this_cpu_add(long_counter, -1);
+   CHECK(l, long_counter, -1);
+
+   l += 1;
+   __this_cpu_add(long_counter, 1);
+   CHECK(l, long_counter, 0);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul += 1UL;
+   __this_cpu_add(ulong_counter, 1UL);
+   CHECK(ul, ulong_counter, 1);
+
+   ul += -1UL;
+   __this_cpu_add(ulong_counter, -1UL);
+   CHECK(ul, ulong_counter, 0);
+
+   ul += -(unsigned long)1;
+   __this_cpu_add(ulong_counter, -(unsigned long)1);
+   CHECK(ul, ulong_counter, -1);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul -= 1;
+   __this_cpu_dec(ulong_counter);
+   CHECK(ul, ulong_counter, 0x);
+   CHECK(ul, ulong_counter, -1);
+
+   l += -ui_one;
+   __this_cpu_add(long_counter, -ui_one);
+   CHECK(l, long_counter, 0x);
+
+   l += ui_one;
+   __this_cpu_add(long_counter, ui_one);
+   CHECK(l, long_counter, 0x1);
+
+
+   l = 0;
+   __this_cpu_write(long_counter, 0);
+
+   l -= ui_one;
+   __this_cpu_sub(long_counter, ui_one);
+   CHECK(l, long_counter, -1);
+
+   l = 0;
+   __this_cpu_write(long_counter, 0);
+
+   l += ui_one;
+   __this_cpu_add(long_counter, ui_one);
+   CHECK(l, long_counter, 1);
+
+   l += -ui_one;
+   __this_cpu_add(long_counter, -ui_one);
+   CHECK(l, long_counter, 0x1);
+
+   l = 0;
+   __this_cpu_write(long_counter, 0);
+
+   l -= ui_one;
+   this_cpu_sub(long_counter, ui_one);
+   CHECK(l, long_counter, -1);
+   CHECK(l, long_counter, 0x);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul += ui_one;
+   __this_cpu_add(ulong_counter, ui_one);
+   CHECK(ul, ulong_counter, 1);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul -= ui_one;
+   __this_cpu_sub(ulong_counter, ui_one);
+   CHECK(ul, ulong_counter, -1);
+   CHECK(ul, ulong_counter, 0x);
+
+   ul = 3;
+   __this_cpu_write(ulong_counter, 3);
+
+   ul = this_cpu_sub_return

[PATCH 0/3] fix unsigned pcp adjustments

2013-10-27 Thread Greg Thelen
As of v3.11-9444-g3ea67d0 memcg: add per cgroup writeback pages accounting
memcg use of __this_cpu_add(counter, -nr_pages) leads to incorrect statistic
values because the negated nr_pages is not sign extended (counter is long,
nr_pages is unsigned int).  The memcg fix is __this_cpu_sub(counter, nr_pages).
But that doesn't simply work because __this_cpu_sub(counter, nr_pages) was
implemented as __this_cpu_add(counter, -nr_pages) which suffers the same
problem.  Example:
  unsigned int delta = 1
  preempt_disable()
  this_cpu_write(long_counter, 0)
  this_cpu_sub(long_counter, delta)
  preempt_enable()

Before this change long_counter on a 64 bit machine ends with value 0x,
rather than 0x.  This is because this_cpu_sub(pcp, delta) boils
down to:
  long_counter = 0 + 0x

Patch 1 creates a test module for percpu counters operations which demonstrates
the __this_cpu_sub() problems.  This patch is independent can be discarded if
there is no interest.

Patch 2 fixes __this_cpu_sub() to work with unsigned adjustments.

Patch 3 uses __this_cpu_sub() in memcg.

An alternative smaller solution is for memcg to use:
  __this_cpu_add(counter, -(int)nr_pages)
admitting that __this_cpu_add/sub() doesn't work with unsigned adjustments.  But
I felt like fixing the core services to prevent this in the future.

Greg Thelen (3):
  percpu counter: test module
  percpu counter: cast this_cpu_sub() adjustment
  memcg: use __this_cpu_sub to decrement stats

 arch/x86/include/asm/percpu.h |   3 +-
 include/linux/percpu.h|   8 +--
 lib/Kconfig.debug |   9 +++
 lib/Makefile  |   2 +
 lib/percpu_test.c | 138 ++
 mm/memcontrol.c   |   2 +-
 6 files changed, 156 insertions(+), 6 deletions(-)
 create mode 100644 lib/percpu_test.c

-- 
1.8.4.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] memcg: use __this_cpu_sub to decrement stats

2013-10-27 Thread Greg Thelen
As of v3.11-9444-g3ea67d0 memcg: add per cgroup writeback pages
accounting memcg counter errors are possible when moving charged
memory to a different memcg.  Charge movement occurs when processing
writes to memory.force_empty, moving tasks to a memcg with
memcg.move_charge_at_immigrate=1, or memcg deletion.  An example
showing error after memory.force_empty:
  $ cd /sys/fs/cgroup/memory
  $ mkdir x
  $ rm /data/tmp/file
  $ (echo $BASHPID  x/tasks  exec mmap_writer /data/tmp/file 1M) 
  [1] 13600
  $ grep ^mapped x/memory.stat
  mapped_file 1048576
  $ echo 13600  tasks
  $ echo 1  x/memory.force_empty
  $ grep ^mapped x/memory.stat
  mapped_file 4503599627370496

mapped_file should end with 0.
  4503599627370496 == 0x10,,, == 0x100,, pages
  1048576  == 0x10,   == 0x100 pages

This issue only affects the source memcg on 64 bit machines; the
destination memcg counters are correct.  So the rmdir case is not too
important because such counters are soon disappearing with the entire
memcg.  But the memcg.force_empty and
memory.move_charge_at_immigrate=1 cases are larger problems as the
bogus counters are visible for the (possibly long) remaining life of
the source memcg.

The problem is due to memcg use of __this_cpu_from(.., -nr_pages),
which is subtly wrong because it subtracts the unsigned int nr_pages
(either -1 or -512 for THP) from a signed long percpu counter.  When
nr_pages=-1, -nr_pages=0x.  On 64 bit machines
stat-count[idx] is signed 64 bit.  So memcg's attempt to simply
decrement a count (e.g. from 1 to 0) boils down to:
  long count = 1
  unsigned int nr_pages = 1
  count += -nr_pages  /* -nr_pages == 0x, */
  count is now 0x1,, instead of 0

The fix is to subtract the unsigned page count rather than adding its
negation.  This only works with the percpu counter: cast
this_cpu_sub() adjustment patch which fixes this_cpu_sub().

Signed-off-by: Greg Thelen gthe...@google.com
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aa8185c..b7ace0f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup 
*from,
 {
/* Update stat data for mem_cgroup */
preempt_disable();
-   __this_cpu_add(from-stat-count[idx], -nr_pages);
+   __this_cpu_sub(from-stat-count[idx], nr_pages);
__this_cpu_add(to-stat-count[idx], nr_pages);
preempt_enable();
 }
-- 
1.8.4.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] percpu counter: cast this_cpu_sub() adjustment

2013-10-27 Thread Greg Thelen
On Sun, Oct 27 2013, Tejun Heo wrote:

 On Sun, Oct 27, 2013 at 05:04:29AM -0700, Andrew Morton wrote:
 On Sun, 27 Oct 2013 07:22:55 -0400 Tejun Heo t...@kernel.org wrote:
 
  We probably want to cc stable for this and the next one.  How should
  these be routed?  I can take these through percpu tree or mm works
  too.  Either way, it'd be best to route them together.
 
 Yes, all three look like -stable material to me.  I'll grab them later
 in the week if you haven't ;)

 Tried to apply to percpu but the third one is a fix for a patch which
 was added to -mm during v3.12-rc1, so these are yours. :)

I don't object to stable for the first two non-memcg patches, but it's
probably unnecessary.  I should have made it more clear, but an audit of
v3.12-rc6 shows that only new memcg code is affected - the new
mem_cgroup_move_account_page_stat() is the only place where an unsigned
adjustment is used.  All other callers (e.g. shrink_dcache_sb) already
use a signed adjustment, so no problems before v3.12.  Though I did not
audit the stable kernel trees, so there could be something hiding in
there.

 The names of the first two patches distress me.  They rather clearly
 assert that the code affects percpu_counter.[ch], but that is not the case. 
 Massaging is needed to fix that up.

 Yeah, something like the following would be better

  percpu: add test module for various percpu operations
  percpu: fix this_cpu_sub() subtrahend casting for unsigneds
  memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend 
 casting

No objection to renaming.  Let me know if you want these reposed with
updated titles.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] percpu counter: cast this_cpu_sub() adjustment

2013-10-27 Thread Greg Thelen
On Sun, Oct 27 2013, Greg Thelen wrote:

 this_cpu_sub() is implemented as negation and addition.

 This patch casts the adjustment to the counter type before negation to
 sign extend the adjustment.  This helps in cases where the counter
 type is wider than an unsigned adjustment.  An alternative to this
 patch is to declare such operations unsupported, but it seemed useful
 to avoid surprises.

 This patch specifically helps the following example:
   unsigned int delta = 1
   preempt_disable()
   this_cpu_write(long_counter, 0)
   this_cpu_sub(long_counter, delta)
   preempt_enable()

 Before this change long_counter on a 64 bit machine ends with value
 0x, rather than 0x.  This is because
 this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta),
 which is basically:
   long_counter = 0 + 0x

 Also apply the same cast to:
   __this_cpu_sub()
   this_cpu_sub_return()
   and __this_cpu_sub_return()

 All percpu_test.ko passes, especially the following cases which
 previously failed:

   l -= ui_one;
   __this_cpu_sub(long_counter, ui_one);
   CHECK(l, long_counter, -1);

   l -= ui_one;
   this_cpu_sub(long_counter, ui_one);
   CHECK(l, long_counter, -1);
   CHECK(l, long_counter, 0x);

   ul -= ui_one;
   __this_cpu_sub(ulong_counter, ui_one);
   CHECK(ul, ulong_counter, -1);
   CHECK(ul, ulong_counter, 0x);

   ul = this_cpu_sub_return(ulong_counter, ui_one);
   CHECK(ul, ulong_counter, 2);

   ul = __this_cpu_sub_return(ulong_counter, ui_one);
   CHECK(ul, ulong_counter, 1);

 Signed-off-by: Greg Thelen gthe...@google.com
 ---
  arch/x86/include/asm/percpu.h | 3 ++-
  include/linux/percpu.h| 8 
  lib/percpu_test.c | 2 +-
  3 files changed, 7 insertions(+), 6 deletions(-)

 diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
 index 0da5200..b3e18f8 100644
 --- a/arch/x86/include/asm/percpu.h
 +++ b/arch/x86/include/asm/percpu.h
 @@ -128,7 +128,8 @@ do {  
 \
  do { \
   typedef typeof(var) pao_T__;\
   const int pao_ID__ = (__builtin_constant_p(val)   \
 -   ((val) == 1 || (val) == -1)) ? (val) : 0; \
 +   ((val) == 1 || (val) == -1)) ?\
 + (int)(val) : 0; \
   if (0) {\
   pao_T__ pao_tmp__;  \
   pao_tmp__ = (val);  \
 diff --git a/include/linux/percpu.h b/include/linux/percpu.h
 index cc88172..c74088a 100644
 --- a/include/linux/percpu.h
 +++ b/include/linux/percpu.h
 @@ -332,7 +332,7 @@ do {  
 \
  #endif
  
  #ifndef this_cpu_sub
 -# define this_cpu_sub(pcp, val)  this_cpu_add((pcp), -(val))
 +# define this_cpu_sub(pcp, val)  this_cpu_add((pcp), 
 -(typeof(pcp))(val))
  #endif
  
  #ifndef this_cpu_inc
 @@ -418,7 +418,7 @@ do {  
 \
  # define this_cpu_add_return(pcp, val)   
 __pcpu_size_call_return2(this_cpu_add_return_, pcp, val)
  #endif
  
 -#define this_cpu_sub_return(pcp, val)this_cpu_add_return(pcp, -(val))
 +#define this_cpu_sub_return(pcp, val)this_cpu_add_return(pcp, 
 -(typeof(pcp))(val))
  #define this_cpu_inc_return(pcp) this_cpu_add_return(pcp, 1)
  #define this_cpu_dec_return(pcp) this_cpu_add_return(pcp, -1)
  
 @@ -586,7 +586,7 @@ do {  
 \
  #endif
  
  #ifndef __this_cpu_sub
 -# define __this_cpu_sub(pcp, val)__this_cpu_add((pcp), -(val))
 +# define __this_cpu_sub(pcp, val)__this_cpu_add((pcp), 
 -(typeof(pcp))(val))
  #endif
  
  #ifndef __this_cpu_inc
 @@ -668,7 +668,7 @@ do {  
 \
   __pcpu_size_call_return2(__this_cpu_add_return_, pcp, val)
  #endif
  
 -#define __this_cpu_sub_return(pcp, val)  __this_cpu_add_return(pcp, 
 -(val))
 +#define __this_cpu_sub_return(pcp, val)  __this_cpu_add_return(pcp, 
 -(typeof(pcp))(val))
  #define __this_cpu_inc_return(pcp)   __this_cpu_add_return(pcp, 1)
  #define __this_cpu_dec_return(pcp)   __this_cpu_add_return(pcp, -1)
  
 diff --git a/lib/percpu_test.c b/lib/percpu_test.c
 index 1ebeb44..8ab4231 100644
 --- a/lib/percpu_test.c
 +++ b/lib/percpu_test.c
 @@ -118,7 +118,7 @@ static int __init percpu_test_init(void)
   CHECK(ul, ulong_counter, 2);
  
   ul = __this_cpu_sub_return(ulong_counter, ui_one);
 - CHECK(ul, ulong_counter, 0);
 + CHECK(ul, ulong_counter, 1);
  
   preempt_enable();

Oops.  This update to percpu_test.c

[PATCH v2 0/3] fix unsigned pcp adjustments

2013-10-27 Thread Greg Thelen
As of v3.11-9444-g3ea67d0 memcg: add per cgroup writeback pages accounting
memcg use of __this_cpu_add(counter, -nr_pages) leads to incorrect statistic
values because the negated nr_pages is not sign extended (counter is long,
nr_pages is unsigned int).  The memcg fix is __this_cpu_sub(counter, nr_pages).
But that doesn't simply work because __this_cpu_sub(counter, nr_pages) was
implemented as __this_cpu_add(counter, -nr_pages) which suffers the same
problem.  Example:
  unsigned int delta = 1
  preempt_disable()
  this_cpu_write(long_counter, 0)
  this_cpu_sub(long_counter, delta)
  preempt_enable()

Before this change long_counter on a 64 bit machine ends with value 0x,
rather than 0x.  This is because this_cpu_sub(pcp, delta) boils
down to:
  long_counter = 0 + 0x

v3.12-rc6 shows that only new memcg code is affected by this problem - the new
mem_cgroup_move_account_page_stat() is the only place where an unsigned
adjustment is used.  All other callers (e.g. shrink_dcache_sb) already use a
signed adjustment, so no problems before v3.12.  Though I did not audit the
stable kernel trees, so there could be something hiding in there.

Patch 1 creates a test module for percpu operations which demonstrates the
__this_cpu_sub() problems.  This patch is independent can be discarded if there
is no interest.

Patch 2 fixes __this_cpu_sub() to work with unsigned adjustments.

Patch 3 uses __this_cpu_sub() in memcg.

An alternative smaller solution is for memcg to use:
  __this_cpu_add(counter, -(int)nr_pages)
admitting that __this_cpu_add/sub() doesn't work with unsigned adjustments.  But
I felt like fixing the core services to prevent this in the future.

Changes from V1:
- more accurate patch titles, patch logs, and test module description now
  referring to per cpu operations rather than per cpu counters.
- move small test code update from patch 2 to patch 1 (where the test is
  introduced).

Greg Thelen (3):
  percpu: add test module for various percpu operations
  percpu: fix this_cpu_sub() subtrahend casting for unsigneds
  memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend
casting

 arch/x86/include/asm/percpu.h |   3 +-
 include/linux/percpu.h|   8 +--
 lib/Kconfig.debug |   9 +++
 lib/Makefile  |   2 +
 lib/percpu_test.c | 138 ++
 mm/memcontrol.c   |   2 +-
 6 files changed, 156 insertions(+), 6 deletions(-)
 create mode 100644 lib/percpu_test.c

-- 
1.8.4.1
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 2/3] percpu: fix this_cpu_sub() subtrahend casting for unsigneds

2013-10-27 Thread Greg Thelen
this_cpu_sub() is implemented as negation and addition.

This patch casts the adjustment to the counter type before negation to
sign extend the adjustment.  This helps in cases where the counter
type is wider than an unsigned adjustment.  An alternative to this
patch is to declare such operations unsupported, but it seemed useful
to avoid surprises.

This patch specifically helps the following example:
  unsigned int delta = 1
  preempt_disable()
  this_cpu_write(long_counter, 0)
  this_cpu_sub(long_counter, delta)
  preempt_enable()

Before this change long_counter on a 64 bit machine ends with value
0x, rather than 0x.  This is because
this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta),
which is basically:
  long_counter = 0 + 0x

Also apply the same cast to:
  __this_cpu_sub()
  __this_cpu_sub_return()
  this_cpu_sub_return()

All percpu_test.ko passes, especially the following cases which
previously failed:

  l -= ui_one;
  __this_cpu_sub(long_counter, ui_one);
  CHECK(l, long_counter, -1);

  l -= ui_one;
  this_cpu_sub(long_counter, ui_one);
  CHECK(l, long_counter, -1);
  CHECK(l, long_counter, 0x);

  ul -= ui_one;
  __this_cpu_sub(ulong_counter, ui_one);
  CHECK(ul, ulong_counter, -1);
  CHECK(ul, ulong_counter, 0x);

  ul = this_cpu_sub_return(ulong_counter, ui_one);
  CHECK(ul, ulong_counter, 2);

  ul = __this_cpu_sub_return(ulong_counter, ui_one);
  CHECK(ul, ulong_counter, 1);

Signed-off-by: Greg Thelen gthe...@google.com
Acked-by: Tejun Heo t...@kernel.org
---
 arch/x86/include/asm/percpu.h | 3 ++-
 include/linux/percpu.h| 8 
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 0da5200..b3e18f8 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -128,7 +128,8 @@ do {
\
 do {   \
typedef typeof(var) pao_T__;\
const int pao_ID__ = (__builtin_constant_p(val)   \
- ((val) == 1 || (val) == -1)) ? (val) : 0; \
+ ((val) == 1 || (val) == -1)) ?\
+   (int)(val) : 0; \
if (0) {\
pao_T__ pao_tmp__;  \
pao_tmp__ = (val);  \
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index cc88172..c74088a 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -332,7 +332,7 @@ do {
\
 #endif
 
 #ifndef this_cpu_sub
-# define this_cpu_sub(pcp, val)this_cpu_add((pcp), -(val))
+# define this_cpu_sub(pcp, val)this_cpu_add((pcp), 
-(typeof(pcp))(val))
 #endif
 
 #ifndef this_cpu_inc
@@ -418,7 +418,7 @@ do {
\
 # define this_cpu_add_return(pcp, val) 
__pcpu_size_call_return2(this_cpu_add_return_, pcp, val)
 #endif
 
-#define this_cpu_sub_return(pcp, val)  this_cpu_add_return(pcp, -(val))
+#define this_cpu_sub_return(pcp, val)  this_cpu_add_return(pcp, 
-(typeof(pcp))(val))
 #define this_cpu_inc_return(pcp)   this_cpu_add_return(pcp, 1)
 #define this_cpu_dec_return(pcp)   this_cpu_add_return(pcp, -1)
 
@@ -586,7 +586,7 @@ do {
\
 #endif
 
 #ifndef __this_cpu_sub
-# define __this_cpu_sub(pcp, val)  __this_cpu_add((pcp), -(val))
+# define __this_cpu_sub(pcp, val)  __this_cpu_add((pcp), 
-(typeof(pcp))(val))
 #endif
 
 #ifndef __this_cpu_inc
@@ -668,7 +668,7 @@ do {
\
__pcpu_size_call_return2(__this_cpu_add_return_, pcp, val)
 #endif
 
-#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, 
-(val))
+#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, 
-(typeof(pcp))(val))
 #define __this_cpu_inc_return(pcp) __this_cpu_add_return(pcp, 1)
 #define __this_cpu_dec_return(pcp) __this_cpu_add_return(pcp, -1)
 
-- 
1.8.4.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/3] percpu: add test module for various percpu operations

2013-10-27 Thread Greg Thelen
Tests various percpu operations.

Enable with CONFIG_PERCPU_TEST=m.

Signed-off-by: Greg Thelen gthe...@google.com
Acked-by: Tejun Heo t...@kernel.org
---
 lib/Kconfig.debug |   9 
 lib/Makefile  |   2 +
 lib/percpu_test.c | 138 ++
 3 files changed, 149 insertions(+)
 create mode 100644 lib/percpu_test.c

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 06344d9..9fdb452 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1472,6 +1472,15 @@ config INTERVAL_TREE_TEST
help
  A benchmark measuring the performance of the interval tree library
 
+config PERCPU_TEST
+   tristate Per cpu operations test
+   depends on m  DEBUG_KERNEL
+   help
+ Enable this option to build test module which validates per-cpu
+ operations.
+
+ If unsure, say N.
+
 config ATOMIC64_SELFTEST
bool Perform an atomic64_t self-test at boot
help
diff --git a/lib/Makefile b/lib/Makefile
index f3bb2cb..bb016e1 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -157,6 +157,8 @@ obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o
 
 interval_tree_test-objs := interval_tree_test_main.o interval_tree.o
 
+obj-$(CONFIG_PERCPU_TEST) += percpu_test.o
+
 obj-$(CONFIG_ASN1) += asn1_decoder.o
 
 obj-$(CONFIG_FONT_SUPPORT) += fonts/
diff --git a/lib/percpu_test.c b/lib/percpu_test.c
new file mode 100644
index 000..fcca49e
--- /dev/null
+++ b/lib/percpu_test.c
@@ -0,0 +1,138 @@
+#include linux/module.h
+
+/* validate @native and @pcp counter values match @expected */
+#define CHECK(native, pcp, expected)\
+   do {\
+   WARN((native) != (expected),\
+raw %ld (0x%lx) != expected %ld (0x%lx), \
+(long)(native), (long)(native),\
+(long)(expected), (long)(expected));   \
+   WARN(__this_cpu_read(pcp) != (expected),\
+pcp %ld (0x%lx) != expected %ld (0x%lx), \
+(long)__this_cpu_read(pcp), (long)__this_cpu_read(pcp), \
+(long)(expected), (long)(expected));   \
+   } while (0)
+
+static DEFINE_PER_CPU(long, long_counter);
+static DEFINE_PER_CPU(unsigned long, ulong_counter);
+
+static int __init percpu_test_init(void)
+{
+   /*
+* volatile prevents compiler from optimizing it uses, otherwise the
+* +ul_one and -ul_one below would replace with inc/dec instructions.
+*/
+   volatile unsigned int ui_one = 1;
+   long l = 0;
+   unsigned long ul = 0;
+
+   pr_info(percpu test start\n);
+
+   preempt_disable();
+
+   l += -1;
+   __this_cpu_add(long_counter, -1);
+   CHECK(l, long_counter, -1);
+
+   l += 1;
+   __this_cpu_add(long_counter, 1);
+   CHECK(l, long_counter, 0);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul += 1UL;
+   __this_cpu_add(ulong_counter, 1UL);
+   CHECK(ul, ulong_counter, 1);
+
+   ul += -1UL;
+   __this_cpu_add(ulong_counter, -1UL);
+   CHECK(ul, ulong_counter, 0);
+
+   ul += -(unsigned long)1;
+   __this_cpu_add(ulong_counter, -(unsigned long)1);
+   CHECK(ul, ulong_counter, -1);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul -= 1;
+   __this_cpu_dec(ulong_counter);
+   CHECK(ul, ulong_counter, 0x);
+   CHECK(ul, ulong_counter, -1);
+
+   l += -ui_one;
+   __this_cpu_add(long_counter, -ui_one);
+   CHECK(l, long_counter, 0x);
+
+   l += ui_one;
+   __this_cpu_add(long_counter, ui_one);
+   CHECK(l, long_counter, 0x1);
+
+
+   l = 0;
+   __this_cpu_write(long_counter, 0);
+
+   l -= ui_one;
+   __this_cpu_sub(long_counter, ui_one);
+   CHECK(l, long_counter, -1);
+
+   l = 0;
+   __this_cpu_write(long_counter, 0);
+
+   l += ui_one;
+   __this_cpu_add(long_counter, ui_one);
+   CHECK(l, long_counter, 1);
+
+   l += -ui_one;
+   __this_cpu_add(long_counter, -ui_one);
+   CHECK(l, long_counter, 0x1);
+
+   l = 0;
+   __this_cpu_write(long_counter, 0);
+
+   l -= ui_one;
+   this_cpu_sub(long_counter, ui_one);
+   CHECK(l, long_counter, -1);
+   CHECK(l, long_counter, 0x);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul += ui_one;
+   __this_cpu_add(ulong_counter, ui_one);
+   CHECK(ul, ulong_counter, 1);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul -= ui_one;
+   __this_cpu_sub(ulong_counter, ui_one);
+   CHECK(ul, ulong_counter, -1);
+   CHECK(ul, ulong_counter, 0x);
+
+   ul = 3;
+   __this_cpu_write(ulong_counter, 3

[PATCH v2 3/3] memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend casting

2013-10-27 Thread Greg Thelen
As of v3.11-9444-g3ea67d0 memcg: add per cgroup writeback pages
accounting memcg counter errors are possible when moving charged
memory to a different memcg.  Charge movement occurs when processing
writes to memory.force_empty, moving tasks to a memcg with
memcg.move_charge_at_immigrate=1, or memcg deletion.  An example
showing error after memory.force_empty:
  $ cd /sys/fs/cgroup/memory
  $ mkdir x
  $ rm /data/tmp/file
  $ (echo $BASHPID  x/tasks  exec mmap_writer /data/tmp/file 1M) 
  [1] 13600
  $ grep ^mapped x/memory.stat
  mapped_file 1048576
  $ echo 13600  tasks
  $ echo 1  x/memory.force_empty
  $ grep ^mapped x/memory.stat
  mapped_file 4503599627370496

mapped_file should end with 0.
  4503599627370496 == 0x10,,, == 0x100,, pages
  1048576  == 0x10,   == 0x100 pages

This issue only affects the source memcg on 64 bit machines; the
destination memcg counters are correct.  So the rmdir case is not too
important because such counters are soon disappearing with the entire
memcg.  But the memcg.force_empty and
memory.move_charge_at_immigrate=1 cases are larger problems as the
bogus counters are visible for the (possibly long) remaining life of
the source memcg.

The problem is due to memcg use of __this_cpu_from(.., -nr_pages),
which is subtly wrong because it subtracts the unsigned int nr_pages
(either -1 or -512 for THP) from a signed long percpu counter.  When
nr_pages=-1, -nr_pages=0x.  On 64 bit machines
stat-count[idx] is signed 64 bit.  So memcg's attempt to simply
decrement a count (e.g. from 1 to 0) boils down to:
  long count = 1
  unsigned int nr_pages = 1
  count += -nr_pages  /* -nr_pages == 0x, */
  count is now 0x1,, instead of 0

The fix is to subtract the unsigned page count rather than adding its
negation.  This only works once percpu: fix this_cpu_sub() subtrahend
casting for unsigneds is applied to fix this_cpu_sub().

Signed-off-by: Greg Thelen gthe...@google.com
Acked-by: Tejun Heo t...@kernel.org
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aa8185c..b7ace0f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup 
*from,
 {
/* Update stat data for mem_cgroup */
preempt_disable();
-   __this_cpu_add(from-stat-count[idx], -nr_pages);
+   __this_cpu_sub(from-stat-count[idx], nr_pages);
__this_cpu_add(to-stat-count[idx], nr_pages);
preempt_enable();
 }
-- 
1.8.4.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RIP: mem_cgroup_move_account+0xf4/0x290

2013-10-26 Thread Greg Thelen
gt; [ 7691.500464]  [] ? insert_kthread_work+0x40/0x40
>> [ 7691.507335] Code: 85 f6 48 8b 55 d0 44 8b 4d c8 4c 8b 45 c0 0f 85 b3 00 
>> 00 00 41 8b 4c 24 18
>> 85 c9 0f 88 a6 00 00 00 48 8b b2 30 02 00 00 45 89 ca <4c> 39 56 18 0f 8c 36 
>> 01 00 00 44 89 c9
>> f7 d9 89 cf 65 48 01 7e
>> [ 7691.528638] RIP  [] mem_cgroup_move_account+0xf4/0x290
>>
>> Add the required __this_cpu_read().
>
> Sorry for my mistake and thanks for the fix up, it looks good to me.
>
> Reviewed-by: Sha Zhengju 
>
>
> Thanks,
> Sha
>>
>> Signed-off-by: Johannes Weiner 
>> ---
>>   mm/memcontrol.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 4097a78..a4864b6 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct 
>> mem_cgroup *from,
>>   {
>>  /* Update stat data for mem_cgroup */
>>  preempt_disable();
>> -WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
>> +WARN_ON_ONCE(__this_cpu_read(from->stat->count[idx]) < nr_pages);
>>  __this_cpu_add(from->stat->count[idx], -nr_pages);
>>  __this_cpu_add(to->stat->count[idx], nr_pages);
>>  preempt_enable();
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

I was just polishing up this following patch which I think is better
because it should avoid spurious warnings.

---8<---

>From c1f43ef0f4cc42fb2ecaeaca71bd247365e3521e Mon Sep 17 00:00:00 2001
From: Greg Thelen 
Date: Fri, 25 Oct 2013 21:59:57 -0700
Subject: [PATCH] memcg: remove incorrect underflow check

When a memcg is deleted mem_cgroup_reparent_charges() moves charged
memory to the parent memcg.  As of v3.11-9444-g3ea67d0 "memcg: add per
cgroup writeback pages accounting" there's bad pointer read.  The goal
was to check for counter underflow.  The counter is a per cpu counter
and there are two problems with the code:
(1) per cpu access function isn't used, instead a naked pointer is
used which easily causes panic.
(2) the check doesn't sum all cpus

Test:
  $ cd /sys/fs/cgroup/memory
  $ mkdir x
  $ echo 3 > /proc/sys/vm/drop_caches
  $ (echo $BASHPID >> x/tasks && exec cat) &
  [1] 7154
  $ grep ^mapped x/memory.stat
  mapped_file 53248
  $ echo 7154 > tasks
  $ rmdir x
  

The fix is to remove the check.  It's currently dangerous and isn't
worth fixing it to use something expensive, such as
percpu_counter_sum(), for each reparented page.  __this_cpu_read()
isn't enough to fix this because there's no guarantees of the current
cpus count.  The only guarantees is that the sum of all per-cpu
counter is >= nr_pages.

Signed-off-by: Greg Thelen 
---
 mm/memcontrol.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 34d3ca9..aa8185c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3773,7 +3773,6 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup 
*from,
 {
/* Update stat data for mem_cgroup */
preempt_disable();
-   WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
__this_cpu_add(from->stat->count[idx], -nr_pages);
__this_cpu_add(to->stat->count[idx], nr_pages);
preempt_enable();
-- 
1.8.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RIP: mem_cgroup_move_account+0xf4/0x290

2013-10-26 Thread Greg Thelen
://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

I was just polishing up this following patch which I think is better
because it should avoid spurious warnings.

---8---

From c1f43ef0f4cc42fb2ecaeaca71bd247365e3521e Mon Sep 17 00:00:00 2001
From: Greg Thelen gthe...@google.com
Date: Fri, 25 Oct 2013 21:59:57 -0700
Subject: [PATCH] memcg: remove incorrect underflow check

When a memcg is deleted mem_cgroup_reparent_charges() moves charged
memory to the parent memcg.  As of v3.11-9444-g3ea67d0 memcg: add per
cgroup writeback pages accounting there's bad pointer read.  The goal
was to check for counter underflow.  The counter is a per cpu counter
and there are two problems with the code:
(1) per cpu access function isn't used, instead a naked pointer is
used which easily causes panic.
(2) the check doesn't sum all cpus

Test:
  $ cd /sys/fs/cgroup/memory
  $ mkdir x
  $ echo 3  /proc/sys/vm/drop_caches
  $ (echo $BASHPID  x/tasks  exec cat) 
  [1] 7154
  $ grep ^mapped x/memory.stat
  mapped_file 53248
  $ echo 7154  tasks
  $ rmdir x
  PANIC

The fix is to remove the check.  It's currently dangerous and isn't
worth fixing it to use something expensive, such as
percpu_counter_sum(), for each reparented page.  __this_cpu_read()
isn't enough to fix this because there's no guarantees of the current
cpus count.  The only guarantees is that the sum of all per-cpu
counter is = nr_pages.

Signed-off-by: Greg Thelen gthe...@google.com
---
 mm/memcontrol.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 34d3ca9..aa8185c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3773,7 +3773,6 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup 
*from,
 {
/* Update stat data for mem_cgroup */
preempt_disable();
-   WARN_ON_ONCE(from-stat-count[idx]  nr_pages);
__this_cpu_add(from-stat-count[idx], -nr_pages);
__this_cpu_add(to-stat-count[idx], nr_pages);
preempt_enable();
-- 
1.8.4.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2 v4] memcg: support hierarchical memory.numa_stats

2013-09-16 Thread Greg Thelen
From: Ying Han 

From: Ying Han 

The memory.numa_stat file was not hierarchical.  Memory charged to the
children was not shown in parent's numa_stat.

This change adds the "hierarchical_" stats to the existing stats.  The
new hierarchical stats include the sum of all children's values in
addition to the value of the memcg.

Tested: Create cgroup a, a/b and run workload under b.  The values of
b are included in the "hierarchical_*" under a.

$ cd /sys/fs/cgroup
$ echo 1 > memory.use_hierarchy
$ mkdir a a/b

Run workload in a/b:
$ (echo $BASHPID >> a/b/cgroup.procs && cat /some/file && bash) &

The hierarchical_ fields in parent (a) show use of workload in a/b:
$ cat a/memory.numa_stat
total=0 N0=0 N1=0 N2=0 N3=0
file=0 N0=0 N1=0 N2=0 N3=0
anon=0 N0=0 N1=0 N2=0 N3=0
unevictable=0 N0=0 N1=0 N2=0 N3=0
hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

$ cat a/b/memory.numa_stat
total=908 N0=552 N1=317 N2=39 N3=0
file=850 N0=549 N1=301 N2=0 N3=0
anon=58 N0=3 N1=16 N2=39 N3=0
unevictable=0 N0=0 N1=0 N2=0 N3=0
hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

Signed-off-by: Ying Han 
Signed-off-by: Greg Thelen 
---
Changelog since v3:
- push 'iter' local variable usage closer to its usage
- documentation fixup

 Documentation/cgroups/memory.txt | 10 +++---
 mm/memcontrol.c  | 17 +
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 8af4ad1..e2bc132 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -573,15 +573,19 @@ an memcg since the pages are allowed to be allocated from 
any physical
 node.  One of the use cases is evaluating application performance by
 combining this information with the application's CPU allocation.
 
-We export "total", "file", "anon" and "unevictable" pages per-node for
-each memcg.  The ouput format of memory.numa_stat is:
+Each memcg's numa_stat file includes "total", "file", "anon" and "unevictable"
+per-node page counts including "hierarchical_" which sums up all
+hierarchical children's values in addition to the memcg's own value.
+
+The ouput format of memory.numa_stat is:
 
 total= N0= N1= ...
 file= N0= N1= ...
 anon= N0= N1= ...
 unevictable= N0= N1= ...
+hierarchical_= N0= N1= ...
 
-And we have total = file + anon + unevictable.
+The "total" count is sum of file + anon + unevictable.
 
 6. Hierarchy support
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5806eea..d02176d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5206,6 +5206,23 @@ static int memcg_numa_stat_show(struct 
cgroup_subsys_state *css,
seq_putc(m, '\n');
}
 
+   for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
+   struct mem_cgroup *iter;
+
+   nr = 0;
+   for_each_mem_cgroup_tree(iter, memcg)
+   nr += mem_cgroup_nr_lru_pages(iter, stat->lru_mask);
+   seq_printf(m, "hierarchical_%s=%lu", stat->name, nr);
+   for_each_node_state(nid, N_MEMORY) {
+   nr = 0;
+   for_each_mem_cgroup_tree(iter, memcg)
+   nr += mem_cgroup_node_nr_lru_pages(
+   iter, nid, stat->lru_mask);
+   seq_printf(m, " N%d=%lu", nid, nr);
+   }
+   seq_putc(m, '\n');
+   }
+
return 0;
 }
 #endif /* CONFIG_NUMA */
-- 
1.8.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2 v4] memcg: refactor mem_control_numa_stat_show()

2013-09-16 Thread Greg Thelen
Refactor mem_control_numa_stat_show() to use a new stats structure for
smaller and simpler code.  This consolidates nearly identical code.

  text  data  bssdec  hex   filename
8,137,679 1,703,496 1,896,448 11,737,623 b31a17 vmlinux.before
8,136,911 1,703,496 1,896,448 11,736,855 b31717 vmlinux.after

Signed-off-by: Greg Thelen 
Signed-off-by: Ying Han 
---
Changelog since v3:
- Use ARRAY_SIZE(stats) rather than array terminator.
- rebased to latest linus/master (d8efd82) to incorporate 182446d08 "cgroup:
  pass around cgroup_subsys_state instead of cgroup in file methods".

 mm/memcontrol.c | 58 +++--
 1 file changed, 23 insertions(+), 35 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d5ff3ce..5806eea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5179,45 +5179,33 @@ static int mem_cgroup_move_charge_write(struct 
cgroup_subsys_state *css,
 static int memcg_numa_stat_show(struct cgroup_subsys_state *css,
struct cftype *cft, struct seq_file *m)
 {
+   struct numa_stat {
+   const char *name;
+   unsigned int lru_mask;
+   };
+
+   static const struct numa_stat stats[] = {
+   { "total", LRU_ALL },
+   { "file", LRU_ALL_FILE },
+   { "anon", LRU_ALL_ANON },
+   { "unevictable", BIT(LRU_UNEVICTABLE) },
+   };
+   const struct numa_stat *stat;
int nid;
-   unsigned long total_nr, file_nr, anon_nr, unevictable_nr;
-   unsigned long node_nr;
+   unsigned long nr;
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
-   total_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL);
-   seq_printf(m, "total=%lu", total_nr);
-   for_each_node_state(nid, N_MEMORY) {
-   node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL);
-   seq_printf(m, " N%d=%lu", nid, node_nr);
-   }
-   seq_putc(m, '\n');
-
-   file_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
-   seq_printf(m, "file=%lu", file_nr);
-   for_each_node_state(nid, N_MEMORY) {
-   node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
-   LRU_ALL_FILE);
-   seq_printf(m, " N%d=%lu", nid, node_nr);
-   }
-   seq_putc(m, '\n');
-
-   anon_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
-   seq_printf(m, "anon=%lu", anon_nr);
-   for_each_node_state(nid, N_MEMORY) {
-   node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
-   LRU_ALL_ANON);
-   seq_printf(m, " N%d=%lu", nid, node_nr);
-   }
-   seq_putc(m, '\n');
-
-   unevictable_nr = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_UNEVICTABLE));
-   seq_printf(m, "unevictable=%lu", unevictable_nr);
-   for_each_node_state(nid, N_MEMORY) {
-   node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
-   BIT(LRU_UNEVICTABLE));
-   seq_printf(m, " N%d=%lu", nid, node_nr);
+   for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
+   nr = mem_cgroup_nr_lru_pages(memcg, stat->lru_mask);
+   seq_printf(m, "%s=%lu", stat->name, nr);
+   for_each_node_state(nid, N_MEMORY) {
+   nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
+ stat->lru_mask);
+   seq_printf(m, " N%d=%lu", nid, nr);
+   }
+   seq_putc(m, '\n');
}
-   seq_putc(m, '\n');
+
return 0;
 }
 #endif /* CONFIG_NUMA */
-- 
1.8.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2 v4] memcg: refactor mem_control_numa_stat_show()

2013-09-16 Thread Greg Thelen
Refactor mem_control_numa_stat_show() to use a new stats structure for
smaller and simpler code.  This consolidates nearly identical code.

  text  data  bssdec  hex   filename
8,137,679 1,703,496 1,896,448 11,737,623 b31a17 vmlinux.before
8,136,911 1,703,496 1,896,448 11,736,855 b31717 vmlinux.after

Signed-off-by: Greg Thelen gthe...@google.com
Signed-off-by: Ying Han ying...@google.com
---
Changelog since v3:
- Use ARRAY_SIZE(stats) rather than array terminator.
- rebased to latest linus/master (d8efd82) to incorporate 182446d08 cgroup:
  pass around cgroup_subsys_state instead of cgroup in file methods.

 mm/memcontrol.c | 58 +++--
 1 file changed, 23 insertions(+), 35 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d5ff3ce..5806eea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5179,45 +5179,33 @@ static int mem_cgroup_move_charge_write(struct 
cgroup_subsys_state *css,
 static int memcg_numa_stat_show(struct cgroup_subsys_state *css,
struct cftype *cft, struct seq_file *m)
 {
+   struct numa_stat {
+   const char *name;
+   unsigned int lru_mask;
+   };
+
+   static const struct numa_stat stats[] = {
+   { total, LRU_ALL },
+   { file, LRU_ALL_FILE },
+   { anon, LRU_ALL_ANON },
+   { unevictable, BIT(LRU_UNEVICTABLE) },
+   };
+   const struct numa_stat *stat;
int nid;
-   unsigned long total_nr, file_nr, anon_nr, unevictable_nr;
-   unsigned long node_nr;
+   unsigned long nr;
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
-   total_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL);
-   seq_printf(m, total=%lu, total_nr);
-   for_each_node_state(nid, N_MEMORY) {
-   node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL);
-   seq_printf(m,  N%d=%lu, nid, node_nr);
-   }
-   seq_putc(m, '\n');
-
-   file_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
-   seq_printf(m, file=%lu, file_nr);
-   for_each_node_state(nid, N_MEMORY) {
-   node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
-   LRU_ALL_FILE);
-   seq_printf(m,  N%d=%lu, nid, node_nr);
-   }
-   seq_putc(m, '\n');
-
-   anon_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
-   seq_printf(m, anon=%lu, anon_nr);
-   for_each_node_state(nid, N_MEMORY) {
-   node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
-   LRU_ALL_ANON);
-   seq_printf(m,  N%d=%lu, nid, node_nr);
-   }
-   seq_putc(m, '\n');
-
-   unevictable_nr = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_UNEVICTABLE));
-   seq_printf(m, unevictable=%lu, unevictable_nr);
-   for_each_node_state(nid, N_MEMORY) {
-   node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
-   BIT(LRU_UNEVICTABLE));
-   seq_printf(m,  N%d=%lu, nid, node_nr);
+   for (stat = stats; stat  stats + ARRAY_SIZE(stats); stat++) {
+   nr = mem_cgroup_nr_lru_pages(memcg, stat-lru_mask);
+   seq_printf(m, %s=%lu, stat-name, nr);
+   for_each_node_state(nid, N_MEMORY) {
+   nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
+ stat-lru_mask);
+   seq_printf(m,  N%d=%lu, nid, nr);
+   }
+   seq_putc(m, '\n');
}
-   seq_putc(m, '\n');
+
return 0;
 }
 #endif /* CONFIG_NUMA */
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2 v4] memcg: support hierarchical memory.numa_stats

2013-09-16 Thread Greg Thelen
From: Ying Han ying...@google.com

From: Ying Han ying...@google.com

The memory.numa_stat file was not hierarchical.  Memory charged to the
children was not shown in parent's numa_stat.

This change adds the hierarchical_ stats to the existing stats.  The
new hierarchical stats include the sum of all children's values in
addition to the value of the memcg.

Tested: Create cgroup a, a/b and run workload under b.  The values of
b are included in the hierarchical_* under a.

$ cd /sys/fs/cgroup
$ echo 1  memory.use_hierarchy
$ mkdir a a/b

Run workload in a/b:
$ (echo $BASHPID  a/b/cgroup.procs  cat /some/file  bash) 

The hierarchical_ fields in parent (a) show use of workload in a/b:
$ cat a/memory.numa_stat
total=0 N0=0 N1=0 N2=0 N3=0
file=0 N0=0 N1=0 N2=0 N3=0
anon=0 N0=0 N1=0 N2=0 N3=0
unevictable=0 N0=0 N1=0 N2=0 N3=0
hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

$ cat a/b/memory.numa_stat
total=908 N0=552 N1=317 N2=39 N3=0
file=850 N0=549 N1=301 N2=0 N3=0
anon=58 N0=3 N1=16 N2=39 N3=0
unevictable=0 N0=0 N1=0 N2=0 N3=0
hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

Signed-off-by: Ying Han ying...@google.com
Signed-off-by: Greg Thelen gthe...@google.com
---
Changelog since v3:
- push 'iter' local variable usage closer to its usage
- documentation fixup

 Documentation/cgroups/memory.txt | 10 +++---
 mm/memcontrol.c  | 17 +
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 8af4ad1..e2bc132 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -573,15 +573,19 @@ an memcg since the pages are allowed to be allocated from 
any physical
 node.  One of the use cases is evaluating application performance by
 combining this information with the application's CPU allocation.
 
-We export total, file, anon and unevictable pages per-node for
-each memcg.  The ouput format of memory.numa_stat is:
+Each memcg's numa_stat file includes total, file, anon and unevictable
+per-node page counts including hierarchical_counter which sums up all
+hierarchical children's values in addition to the memcg's own value.
+
+The ouput format of memory.numa_stat is:
 
 total=total pages N0=node 0 pages N1=node 1 pages ...
 file=total file pages N0=node 0 pages N1=node 1 pages ...
 anon=total anon pages N0=node 0 pages N1=node 1 pages ...
 unevictable=total anon pages N0=node 0 pages N1=node 1 pages ...
+hierarchical_counter=counter pages N0=node 0 pages N1=node 1 pages ...
 
-And we have total = file + anon + unevictable.
+The total count is sum of file + anon + unevictable.
 
 6. Hierarchy support
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5806eea..d02176d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5206,6 +5206,23 @@ static int memcg_numa_stat_show(struct 
cgroup_subsys_state *css,
seq_putc(m, '\n');
}
 
+   for (stat = stats; stat  stats + ARRAY_SIZE(stats); stat++) {
+   struct mem_cgroup *iter;
+
+   nr = 0;
+   for_each_mem_cgroup_tree(iter, memcg)
+   nr += mem_cgroup_nr_lru_pages(iter, stat-lru_mask);
+   seq_printf(m, hierarchical_%s=%lu, stat-name, nr);
+   for_each_node_state(nid, N_MEMORY) {
+   nr = 0;
+   for_each_mem_cgroup_tree(iter, memcg)
+   nr += mem_cgroup_node_nr_lru_pages(
+   iter, nid, stat-lru_mask);
+   seq_printf(m,  N%d=%lu, nid, nr);
+   }
+   seq_putc(m, '\n');
+   }
+
return 0;
 }
 #endif /* CONFIG_NUMA */
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2 v3] memcg: support hierarchical memory.numa_stats

2013-09-05 Thread Greg Thelen
From: Ying Han 

The memory.numa_stat file was not hierarchical.  Memory charged to the
children was not shown in parent's numa_stat.

This change adds the "hierarchical_" stats to the existing stats.  The
new hierarchical stats include the sum of all children's values in
addition to the value of the memcg.

Tested: Create cgroup a, a/b and run workload under b.  The values of
b are included in the "hierarchical_*" under a.

$ cd /sys/fs/cgroup
$ echo 1 > memory.use_hierarchy
$ mkdir a a/b

Run workload in a/b:
$ (echo $BASHPID >> a/b/cgroup.procs && cat /some/file && bash) &

The hierarchical_ fields in parent (a) show use of workload in a/b:
$ cat a/memory.numa_stat
total=0 N0=0 N1=0 N2=0 N3=0
file=0 N0=0 N1=0 N2=0 N3=0
anon=0 N0=0 N1=0 N2=0 N3=0
unevictable=0 N0=0 N1=0 N2=0 N3=0
hierarchical_total=61 N0=0 N1=41 N2=20 N3=0
hierarchical_file=14 N0=0 N1=0 N2=14 N3=0
hierarchical_anon=47 N0=0 N1=41 N2=6 N3=0
hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

The workload memory usage:
$ cat a/b/memory.numa_stat
total=73 N0=0 N1=41 N2=32 N3=0
file=14 N0=0 N1=0 N2=14 N3=0
anon=59 N0=0 N1=41 N2=18 N3=0
unevictable=0 N0=0 N1=0 N2=0 N3=0
hierarchical_total=73 N0=0 N1=41 N2=32 N3=0
hierarchical_file=14 N0=0 N1=0 N2=14 N3=0
hierarchical_anon=59 N0=0 N1=41 N2=18 N3=0
hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

Signed-off-by: Ying Han 
Signed-off-by: Greg Thelen 
---
Changelog since v2:
- reworded Documentation/cgroup/memory.txt
- updated commit description

 Documentation/cgroups/memory.txt | 10 +++---
 mm/memcontrol.c  | 16 
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 2a33306..d6d6479 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -571,15 +571,19 @@ an memcg since the pages are allowed to be allocated from 
any physical
 node.  One of the use cases is evaluating application performance by
 combining this information with the application's CPU allocation.
 
-We export "total", "file", "anon" and "unevictable" pages per-node for
-each memcg.  The ouput format of memory.numa_stat is:
+Each memcg's numa_stat file includes "total", "file", "anon" and "unevictable"
+per-node page counts including "hierarchical_" which sums of all
+hierarchical children's values in addition to the memcg's own value.
+
+The ouput format of memory.numa_stat is:
 
 total= N0= N1= ...
 file= N0= N1= ...
 anon= N0= N1= ...
 unevictable= N0= N1= ...
+hierarchical_= N0= N1= ...
 
-And we have total = file + anon + unevictable.
+The "total" count is sum of file + anon + unevictable.
 
 6. Hierarchy support
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4d2b037..0e5be30 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5394,6 +5394,7 @@ static int memcg_numa_stat_show(struct cgroup *cont, 
struct cftype *cft,
int nid;
unsigned long nr;
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+   struct mem_cgroup *iter;
 
for (stat = stats; stat->name; stat++) {
nr = mem_cgroup_nr_lru_pages(memcg, stat->lru_mask);
@@ -5406,6 +5407,21 @@ static int memcg_numa_stat_show(struct cgroup *cont, 
struct cftype *cft,
seq_putc(m, '\n');
}
 
+   for (stat = stats; stat->name; stat++) {
+   nr = 0;
+   for_each_mem_cgroup_tree(iter, memcg)
+   nr += mem_cgroup_nr_lru_pages(iter, stat->lru_mask);
+   seq_printf(m, "hierarchical_%s=%lu", stat->name, nr);
+   for_each_node_state(nid, N_MEMORY) {
+   nr = 0;
+   for_each_mem_cgroup_tree(iter, memcg)
+   nr += mem_cgroup_node_nr_lru_pages(
+   iter, nid, stat->lru_mask);
+   seq_printf(m, " N%d=%lu", nid, nr);
+   }
+   seq_putc(m, '\n');
+   }
+
return 0;
 }
 #endif /* CONFIG_NUMA */
-- 
1.8.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2 v3] memcg: refactor mem_control_numa_stat_show()

2013-09-05 Thread Greg Thelen
Refactor mem_control_numa_stat_show() to use a new stats structure for
smaller and simpler code.  This consolidates nearly identical code.

   text data  bssdec  hex   filename
8,055,980 1,675,648 1,896,448 11,628,076 b16e2c vmlinux.before
8,055,276 1,675,648 1,896,448 11,627,372 b16b6c vmlinux.after

Signed-off-by: Greg Thelen 
Signed-off-by: Ying Han 
---
Changelog since v2:
- rebased to v3.11
- updated commit description

 mm/memcontrol.c | 57 +++--
 1 file changed, 23 insertions(+), 34 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0878ff7..4d2b037 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5378,45 +5378,34 @@ static int mem_cgroup_move_charge_write(struct cgroup 
*cgrp,
 static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft,
  struct seq_file *m)
 {
+   struct numa_stat {
+   const char *name;
+   unsigned int lru_mask;
+   };
+
+   static const struct numa_stat stats[] = {
+   { "total", LRU_ALL },
+   { "file", LRU_ALL_FILE },
+   { "anon", LRU_ALL_ANON },
+   { "unevictable", BIT(LRU_UNEVICTABLE) },
+   { NULL, 0 }  /* terminator */
+   };
+   const struct numa_stat *stat;
int nid;
-   unsigned long total_nr, file_nr, anon_nr, unevictable_nr;
-   unsigned long node_nr;
+   unsigned long nr;
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 
-   total_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL);
-   seq_printf(m, "total=%lu", total_nr);
-   for_each_node_state(nid, N_MEMORY) {
-   node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL);
-   seq_printf(m, " N%d=%lu", nid, node_nr);
-   }
-   seq_putc(m, '\n');
-
-   file_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
-   seq_printf(m, "file=%lu", file_nr);
-   for_each_node_state(nid, N_MEMORY) {
-   node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
-   LRU_ALL_FILE);
-   seq_printf(m, " N%d=%lu", nid, node_nr);
-   }
-   seq_putc(m, '\n');
-
-   anon_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
-   seq_printf(m, "anon=%lu", anon_nr);
-   for_each_node_state(nid, N_MEMORY) {
-   node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
-   LRU_ALL_ANON);
-   seq_printf(m, " N%d=%lu", nid, node_nr);
+   for (stat = stats; stat->name; stat++) {
+   nr = mem_cgroup_nr_lru_pages(memcg, stat->lru_mask);
+   seq_printf(m, "%s=%lu", stat->name, nr);
+   for_each_node_state(nid, N_MEMORY) {
+   nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
+ stat->lru_mask);
+   seq_printf(m, " N%d=%lu", nid, nr);
+   }
+   seq_putc(m, '\n');
}
-   seq_putc(m, '\n');
 
-   unevictable_nr = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_UNEVICTABLE));
-   seq_printf(m, "unevictable=%lu", unevictable_nr);
-   for_each_node_state(nid, N_MEMORY) {
-   node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
-   BIT(LRU_UNEVICTABLE));
-   seq_printf(m, " N%d=%lu", nid, node_nr);
-   }
-   seq_putc(m, '\n');
return 0;
 }
 #endif /* CONFIG_NUMA */
-- 
1.8.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2 v3] memcg: refactor mem_control_numa_stat_show()

2013-09-05 Thread Greg Thelen
Refactor mem_control_numa_stat_show() to use a new stats structure for
smaller and simpler code.  This consolidates nearly identical code.

   text data  bssdec  hex   filename
8,055,980 1,675,648 1,896,448 11,628,076 b16e2c vmlinux.before
8,055,276 1,675,648 1,896,448 11,627,372 b16b6c vmlinux.after

Signed-off-by: Greg Thelen gthe...@google.com
Signed-off-by: Ying Han ying...@google.com
---
Changelog since v2:
- rebased to v3.11
- updated commit description

 mm/memcontrol.c | 57 +++--
 1 file changed, 23 insertions(+), 34 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0878ff7..4d2b037 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5378,45 +5378,34 @@ static int mem_cgroup_move_charge_write(struct cgroup 
*cgrp,
 static int memcg_numa_stat_show(struct cgroup *cont, struct cftype *cft,
  struct seq_file *m)
 {
+   struct numa_stat {
+   const char *name;
+   unsigned int lru_mask;
+   };
+
+   static const struct numa_stat stats[] = {
+   { total, LRU_ALL },
+   { file, LRU_ALL_FILE },
+   { anon, LRU_ALL_ANON },
+   { unevictable, BIT(LRU_UNEVICTABLE) },
+   { NULL, 0 }  /* terminator */
+   };
+   const struct numa_stat *stat;
int nid;
-   unsigned long total_nr, file_nr, anon_nr, unevictable_nr;
-   unsigned long node_nr;
+   unsigned long nr;
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 
-   total_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL);
-   seq_printf(m, total=%lu, total_nr);
-   for_each_node_state(nid, N_MEMORY) {
-   node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL);
-   seq_printf(m,  N%d=%lu, nid, node_nr);
-   }
-   seq_putc(m, '\n');
-
-   file_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_FILE);
-   seq_printf(m, file=%lu, file_nr);
-   for_each_node_state(nid, N_MEMORY) {
-   node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
-   LRU_ALL_FILE);
-   seq_printf(m,  N%d=%lu, nid, node_nr);
-   }
-   seq_putc(m, '\n');
-
-   anon_nr = mem_cgroup_nr_lru_pages(memcg, LRU_ALL_ANON);
-   seq_printf(m, anon=%lu, anon_nr);
-   for_each_node_state(nid, N_MEMORY) {
-   node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
-   LRU_ALL_ANON);
-   seq_printf(m,  N%d=%lu, nid, node_nr);
+   for (stat = stats; stat-name; stat++) {
+   nr = mem_cgroup_nr_lru_pages(memcg, stat-lru_mask);
+   seq_printf(m, %s=%lu, stat-name, nr);
+   for_each_node_state(nid, N_MEMORY) {
+   nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
+ stat-lru_mask);
+   seq_printf(m,  N%d=%lu, nid, nr);
+   }
+   seq_putc(m, '\n');
}
-   seq_putc(m, '\n');
 
-   unevictable_nr = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_UNEVICTABLE));
-   seq_printf(m, unevictable=%lu, unevictable_nr);
-   for_each_node_state(nid, N_MEMORY) {
-   node_nr = mem_cgroup_node_nr_lru_pages(memcg, nid,
-   BIT(LRU_UNEVICTABLE));
-   seq_printf(m,  N%d=%lu, nid, node_nr);
-   }
-   seq_putc(m, '\n');
return 0;
 }
 #endif /* CONFIG_NUMA */
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2 v3] memcg: support hierarchical memory.numa_stats

2013-09-05 Thread Greg Thelen
From: Ying Han ying...@google.com

The memory.numa_stat file was not hierarchical.  Memory charged to the
children was not shown in parent's numa_stat.

This change adds the hierarchical_ stats to the existing stats.  The
new hierarchical stats include the sum of all children's values in
addition to the value of the memcg.

Tested: Create cgroup a, a/b and run workload under b.  The values of
b are included in the hierarchical_* under a.

$ cd /sys/fs/cgroup
$ echo 1  memory.use_hierarchy
$ mkdir a a/b

Run workload in a/b:
$ (echo $BASHPID  a/b/cgroup.procs  cat /some/file  bash) 

The hierarchical_ fields in parent (a) show use of workload in a/b:
$ cat a/memory.numa_stat
total=0 N0=0 N1=0 N2=0 N3=0
file=0 N0=0 N1=0 N2=0 N3=0
anon=0 N0=0 N1=0 N2=0 N3=0
unevictable=0 N0=0 N1=0 N2=0 N3=0
hierarchical_total=61 N0=0 N1=41 N2=20 N3=0
hierarchical_file=14 N0=0 N1=0 N2=14 N3=0
hierarchical_anon=47 N0=0 N1=41 N2=6 N3=0
hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

The workload memory usage:
$ cat a/b/memory.numa_stat
total=73 N0=0 N1=41 N2=32 N3=0
file=14 N0=0 N1=0 N2=14 N3=0
anon=59 N0=0 N1=41 N2=18 N3=0
unevictable=0 N0=0 N1=0 N2=0 N3=0
hierarchical_total=73 N0=0 N1=41 N2=32 N3=0
hierarchical_file=14 N0=0 N1=0 N2=14 N3=0
hierarchical_anon=59 N0=0 N1=41 N2=18 N3=0
hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

Signed-off-by: Ying Han ying...@google.com
Signed-off-by: Greg Thelen gthe...@google.com
---
Changelog since v2:
- reworded Documentation/cgroup/memory.txt
- updated commit description

 Documentation/cgroups/memory.txt | 10 +++---
 mm/memcontrol.c  | 16 
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 2a33306..d6d6479 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -571,15 +571,19 @@ an memcg since the pages are allowed to be allocated from 
any physical
 node.  One of the use cases is evaluating application performance by
 combining this information with the application's CPU allocation.
 
-We export total, file, anon and unevictable pages per-node for
-each memcg.  The ouput format of memory.numa_stat is:
+Each memcg's numa_stat file includes total, file, anon and unevictable
+per-node page counts including hierarchical_counter which sums of all
+hierarchical children's values in addition to the memcg's own value.
+
+The ouput format of memory.numa_stat is:
 
 total=total pages N0=node 0 pages N1=node 1 pages ...
 file=total file pages N0=node 0 pages N1=node 1 pages ...
 anon=total anon pages N0=node 0 pages N1=node 1 pages ...
 unevictable=total anon pages N0=node 0 pages N1=node 1 pages ...
+hierarchical_counter=counter pages N0=node 0 pages N1=node 1 pages ...
 
-And we have total = file + anon + unevictable.
+The total count is sum of file + anon + unevictable.
 
 6. Hierarchy support
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4d2b037..0e5be30 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5394,6 +5394,7 @@ static int memcg_numa_stat_show(struct cgroup *cont, 
struct cftype *cft,
int nid;
unsigned long nr;
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+   struct mem_cgroup *iter;
 
for (stat = stats; stat-name; stat++) {
nr = mem_cgroup_nr_lru_pages(memcg, stat-lru_mask);
@@ -5406,6 +5407,21 @@ static int memcg_numa_stat_show(struct cgroup *cont, 
struct cftype *cft,
seq_putc(m, '\n');
}
 
+   for (stat = stats; stat-name; stat++) {
+   nr = 0;
+   for_each_mem_cgroup_tree(iter, memcg)
+   nr += mem_cgroup_nr_lru_pages(iter, stat-lru_mask);
+   seq_printf(m, hierarchical_%s=%lu, stat-name, nr);
+   for_each_node_state(nid, N_MEMORY) {
+   nr = 0;
+   for_each_mem_cgroup_tree(iter, memcg)
+   nr += mem_cgroup_node_nr_lru_pages(
+   iter, nid, stat-lru_mask);
+   seq_printf(m,  N%d=%lu, nid, nr);
+   }
+   seq_putc(m, '\n');
+   }
+
return 0;
 }
 #endif /* CONFIG_NUMA */
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] memcg: fix multiple large threshold notifications

2013-08-31 Thread Greg Thelen
A memory cgroup with (1) multiple threshold notifications and (2) at
least one threshold >=2G was not reliable.  Specifically the
notifications would either not fire or would not fire in the proper
order.

The __mem_cgroup_threshold() signaling logic depends on keeping 64 bit
thresholds in sorted order.  mem_cgroup_usage_register_event() sorts
them with compare_thresholds(), which returns the difference of two 64
bit thresholds as an int.  If the difference is positive but has
bit[31] set, then sort() treats the difference as negative and breaks
sort order.

This fix compares the two arbitrary 64 bit thresholds returning the
classic -1, 0, 1 result.

The test below sets two notifications (at 0x1000 and 0x81001000):
  cd /sys/fs/cgroup/memory
  mkdir x
  for x in 4096 2164264960; do
cgroup_event_listener x/memory.usage_in_bytes $x | sed "s/^/$x listener:/" &
  done
  echo $$ > x/cgroup.procs
  anon_leaker 500M

v3.11-rc7 fails to signal the 4096 event listener:
  Leaking...
  Done leaking pages.

Patched v3.11-rc7 properly notifies:
  Leaking...
  4096 listener:2013:8:31:14:13:36
  Done leaking pages.

The fixed bug is old.  It appears to date back to the introduction of
memcg threshold notifications in v2.6.34-rc1-116-g2e72b6347c94 "memcg:
implement memory thresholds"

Signed-off-by: Greg Thelen 
---
 mm/memcontrol.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0878ff7..aa44621 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5616,7 +5616,13 @@ static int compare_thresholds(const void *a, const void 
*b)
const struct mem_cgroup_threshold *_a = a;
const struct mem_cgroup_threshold *_b = b;
 
-   return _a->threshold - _b->threshold;
+   if (_a->threshold > _b->threshold)
+   return 1;
+
+   if (_a->threshold < _b->threshold)
+   return -1;
+
+   return 0;
 }
 
 static int mem_cgroup_oom_notify_cb(struct mem_cgroup *memcg)
-- 
1.8.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] memcg: fix multiple large threshold notifications

2013-08-31 Thread Greg Thelen
A memory cgroup with (1) multiple threshold notifications and (2) at
least one threshold =2G was not reliable.  Specifically the
notifications would either not fire or would not fire in the proper
order.

The __mem_cgroup_threshold() signaling logic depends on keeping 64 bit
thresholds in sorted order.  mem_cgroup_usage_register_event() sorts
them with compare_thresholds(), which returns the difference of two 64
bit thresholds as an int.  If the difference is positive but has
bit[31] set, then sort() treats the difference as negative and breaks
sort order.

This fix compares the two arbitrary 64 bit thresholds returning the
classic -1, 0, 1 result.

The test below sets two notifications (at 0x1000 and 0x81001000):
  cd /sys/fs/cgroup/memory
  mkdir x
  for x in 4096 2164264960; do
cgroup_event_listener x/memory.usage_in_bytes $x | sed s/^/$x listener:/ 
  done
  echo $$  x/cgroup.procs
  anon_leaker 500M

v3.11-rc7 fails to signal the 4096 event listener:
  Leaking...
  Done leaking pages.

Patched v3.11-rc7 properly notifies:
  Leaking...
  4096 listener:2013:8:31:14:13:36
  Done leaking pages.

The fixed bug is old.  It appears to date back to the introduction of
memcg threshold notifications in v2.6.34-rc1-116-g2e72b6347c94 memcg:
implement memory thresholds

Signed-off-by: Greg Thelen gthe...@google.com
---
 mm/memcontrol.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0878ff7..aa44621 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5616,7 +5616,13 @@ static int compare_thresholds(const void *a, const void 
*b)
const struct mem_cgroup_threshold *_a = a;
const struct mem_cgroup_threshold *_b = b;
 
-   return _a-threshold - _b-threshold;
+   if (_a-threshold  _b-threshold)
+   return 1;
+
+   if (_a-threshold  _b-threshold)
+   return -1;
+
+   return 0;
 }
 
 static int mem_cgroup_oom_notify_cb(struct mem_cgroup *memcg)
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv10 1/4] debugfs: add get/set for atomic types

2013-05-09 Thread Greg Thelen
On Wed, May 08 2013, Seth Jennings wrote:

> debugfs currently lack the ability to create attributes
> that set/get atomic_t values.
>
> This patch adds support for this through a new
> debugfs_create_atomic_t() function.
>
> Signed-off-by: Seth Jennings 
> Acked-by: Greg Kroah-Hartman 
> Acked-by: Mel Gorman 
> ---
>  fs/debugfs/file.c   | 42 ++
>  include/linux/debugfs.h |  2 ++
>  2 files changed, 44 insertions(+)
>
> diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
> index c5ca6ae..fa26d5b 100644
> --- a/fs/debugfs/file.c
> +++ b/fs/debugfs/file.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  static ssize_t default_read_file(struct file *file, char __user *buf,
>size_t count, loff_t *ppos)
> @@ -403,6 +404,47 @@ struct dentry *debugfs_create_size_t(const char *name, 
> umode_t mode,
>  }
>  EXPORT_SYMBOL_GPL(debugfs_create_size_t);
>  
> +static int debugfs_atomic_t_set(void *data, u64 val)
> +{
> + atomic_set((atomic_t *)data, val);
> + return 0;
> +}
> +static int debugfs_atomic_t_get(void *data, u64 *val)
> +{
> + *val = atomic_read((atomic_t *)data);
> + return 0;
> +}
> +DEFINE_SIMPLE_ATTRIBUTE(fops_atomic_t, debugfs_atomic_t_get,
> + debugfs_atomic_t_set, "%llu\n");
> +DEFINE_SIMPLE_ATTRIBUTE(fops_atomic_t_ro, debugfs_atomic_t_get, NULL, 
> "%llu\n");
> +DEFINE_SIMPLE_ATTRIBUTE(fops_atomic_t_wo, NULL, debugfs_atomic_t_set, 
> "%llu\n");
> +
> +/**
> + * debugfs_create_atomic_t - create a debugfs file that is used to read and
> + * write an atomic_t value
> + * @name: a pointer to a string containing the name of the file to create.
> + * @mode: the permission that the file should have
> + * @parent: a pointer to the parent dentry for this file.  This should be a
> + *  directory dentry if set.  If this parameter is %NULL, then the
> + *  file will be created in the root of the debugfs filesystem.
> + * @value: a pointer to the variable that the file should read to and write
> + * from.
> + */
> +struct dentry *debugfs_create_atomic_t(const char *name, umode_t mode,
> +  struct dentry *parent, atomic_t *value)
> +{
> + /* if there are no write bits set, make read only */
> + if (!(mode & S_IWUGO))
> + return debugfs_create_file(name, mode, parent, value,
> + _atomic_t_ro);
> + /* if there are no read bits set, make write only */
> + if (!(mode & S_IRUGO))
> + return debugfs_create_file(name, mode, parent, value,
> + _atomic_t_wo);
> +
> + return debugfs_create_file(name, mode, parent, value, _atomic_t);
> +}
> +EXPORT_SYMBOL_GPL(debugfs_create_atomic_t);
>  
>  static ssize_t read_file_bool(struct file *file, char __user *user_buf,
> size_t count, loff_t *ppos)
> diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h
> index 63f2465..d68b4ea 100644
> --- a/include/linux/debugfs.h
> +++ b/include/linux/debugfs.h
> @@ -79,6 +79,8 @@ struct dentry *debugfs_create_x64(const char *name, umode_t 
> mode,
> struct dentry *parent, u64 *value);
>  struct dentry *debugfs_create_size_t(const char *name, umode_t mode,
>struct dentry *parent, size_t *value);
> +struct dentry *debugfs_create_atomic_t(const char *name, umode_t mode,
> +  struct dentry *parent, atomic_t *value);
>  struct dentry *debugfs_create_bool(const char *name, umode_t mode,
> struct dentry *parent, u32 *value);

Looking at v3.9 I see a conflicting definition of
debugfs_create_atomic_t() in lib/fault-inject.c.  A kernel with this
patch and CONFIG_FAULT_INJECTION=y and CONFIG_FAULT_INJECTION_DEBUG_FS=y
will not build:

lib/fault-inject.c:196:23: error: static declaration of 
'debugfs_create_atomic_t' follows non-static declaration
include/linux/debugfs.h:87:16: note: previous declaration of 
'debugfs_create_atomic_t' was here
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv10 1/4] debugfs: add get/set for atomic types

2013-05-09 Thread Greg Thelen
On Wed, May 08 2013, Seth Jennings wrote:

 debugfs currently lack the ability to create attributes
 that set/get atomic_t values.

 This patch adds support for this through a new
 debugfs_create_atomic_t() function.

 Signed-off-by: Seth Jennings sjenn...@linux.vnet.ibm.com
 Acked-by: Greg Kroah-Hartman gre...@linuxfoundation.org
 Acked-by: Mel Gorman mgor...@suse.de
 ---
  fs/debugfs/file.c   | 42 ++
  include/linux/debugfs.h |  2 ++
  2 files changed, 44 insertions(+)

 diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
 index c5ca6ae..fa26d5b 100644
 --- a/fs/debugfs/file.c
 +++ b/fs/debugfs/file.c
 @@ -21,6 +21,7 @@
  #include linux/debugfs.h
  #include linux/io.h
  #include linux/slab.h
 +#include linux/atomic.h
  
  static ssize_t default_read_file(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
 @@ -403,6 +404,47 @@ struct dentry *debugfs_create_size_t(const char *name, 
 umode_t mode,
  }
  EXPORT_SYMBOL_GPL(debugfs_create_size_t);
  
 +static int debugfs_atomic_t_set(void *data, u64 val)
 +{
 + atomic_set((atomic_t *)data, val);
 + return 0;
 +}
 +static int debugfs_atomic_t_get(void *data, u64 *val)
 +{
 + *val = atomic_read((atomic_t *)data);
 + return 0;
 +}
 +DEFINE_SIMPLE_ATTRIBUTE(fops_atomic_t, debugfs_atomic_t_get,
 + debugfs_atomic_t_set, %llu\n);
 +DEFINE_SIMPLE_ATTRIBUTE(fops_atomic_t_ro, debugfs_atomic_t_get, NULL, 
 %llu\n);
 +DEFINE_SIMPLE_ATTRIBUTE(fops_atomic_t_wo, NULL, debugfs_atomic_t_set, 
 %llu\n);
 +
 +/**
 + * debugfs_create_atomic_t - create a debugfs file that is used to read and
 + * write an atomic_t value
 + * @name: a pointer to a string containing the name of the file to create.
 + * @mode: the permission that the file should have
 + * @parent: a pointer to the parent dentry for this file.  This should be a
 + *  directory dentry if set.  If this parameter is %NULL, then the
 + *  file will be created in the root of the debugfs filesystem.
 + * @value: a pointer to the variable that the file should read to and write
 + * from.
 + */
 +struct dentry *debugfs_create_atomic_t(const char *name, umode_t mode,
 +  struct dentry *parent, atomic_t *value)
 +{
 + /* if there are no write bits set, make read only */
 + if (!(mode  S_IWUGO))
 + return debugfs_create_file(name, mode, parent, value,
 + fops_atomic_t_ro);
 + /* if there are no read bits set, make write only */
 + if (!(mode  S_IRUGO))
 + return debugfs_create_file(name, mode, parent, value,
 + fops_atomic_t_wo);
 +
 + return debugfs_create_file(name, mode, parent, value, fops_atomic_t);
 +}
 +EXPORT_SYMBOL_GPL(debugfs_create_atomic_t);
  
  static ssize_t read_file_bool(struct file *file, char __user *user_buf,
 size_t count, loff_t *ppos)
 diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h
 index 63f2465..d68b4ea 100644
 --- a/include/linux/debugfs.h
 +++ b/include/linux/debugfs.h
 @@ -79,6 +79,8 @@ struct dentry *debugfs_create_x64(const char *name, umode_t 
 mode,
 struct dentry *parent, u64 *value);
  struct dentry *debugfs_create_size_t(const char *name, umode_t mode,
struct dentry *parent, size_t *value);
 +struct dentry *debugfs_create_atomic_t(const char *name, umode_t mode,
 +  struct dentry *parent, atomic_t *value);
  struct dentry *debugfs_create_bool(const char *name, umode_t mode,
 struct dentry *parent, u32 *value);

Looking at v3.9 I see a conflicting definition of
debugfs_create_atomic_t() in lib/fault-inject.c.  A kernel with this
patch and CONFIG_FAULT_INJECTION=y and CONFIG_FAULT_INJECTION_DEBUG_FS=y
will not build:

lib/fault-inject.c:196:23: error: static declaration of 
'debugfs_create_atomic_t' follows non-static declaration
include/linux/debugfs.h:87:16: note: previous declaration of 
'debugfs_create_atomic_t' was here
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vfs: dcache: cond_resched in shrink_dentry_list

2013-04-10 Thread Greg Thelen
On Wed, Apr 10 2013, Andrew Morton wrote:

> On Tue, 09 Apr 2013 17:37:20 -0700 Greg Thelen  wrote:
>
>> > Call cond_resched() in shrink_dcache_parent() to maintain
>> > interactivity.
>> >
>> > Before this patch:
>> >
>> > void shrink_dcache_parent(struct dentry * parent)
>> > {
>> >while ((found = select_parent(parent, )) != 0)
>> >shrink_dentry_list();
>> > }
>> >
>> > select_parent() populates the dispose list with dentries which
>> > shrink_dentry_list() then deletes.  select_parent() carefully uses
>> > need_resched() to avoid doing too much work at once.  But neither
>> > shrink_dcache_parent() nor its called functions call cond_resched().
>> > So once need_resched() is set select_parent() will return single
>> > dentry dispose list which is then deleted by shrink_dentry_list().
>> > This is inefficient when there are a lot of dentry to process.  This
>> > can cause softlockup and hurts interactivity on non preemptable
>> > kernels.
>> >
>> > This change adds cond_resched() in shrink_dcache_parent().  The
>> > benefit of this is that need_resched() is quickly cleared so that
>> > future calls to select_parent() are able to efficiently return a big
>> > batch of dentry.
>> >
>> > These additional cond_resched() do not seem to impact performance, at
>> > least for the workload below.
>> >
>> > Here is a program which can cause soft lockup on a if other system
>> > activity sets need_resched().
>
> I was unable to guess what word was missing from "on a if other" ;)

Less is more ;)  Reword to:

  Here is a program which can cause soft lockup if other system activity
  sets need_resched().

>> Should this change go through Al's or Andrew's branch?
>
> I'll fight him for it.

Thanks.

> Softlockups are fairly serious, so I'll put a cc:stable in there.  Or
> were the changes which triggered this problem added after 3.9?

This also applies to stable.  I see the problem at least back to v3.3.
I did not test earlier kernels, but could if you want.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vfs: dcache: cond_resched in shrink_dentry_list

2013-04-10 Thread Greg Thelen
On Wed, Apr 10 2013, Andrew Morton wrote:

 On Tue, 09 Apr 2013 17:37:20 -0700 Greg Thelen gthe...@google.com wrote:

  Call cond_resched() in shrink_dcache_parent() to maintain
  interactivity.
 
  Before this patch:
 
  void shrink_dcache_parent(struct dentry * parent)
  {
 while ((found = select_parent(parent, dispose)) != 0)
 shrink_dentry_list(dispose);
  }
 
  select_parent() populates the dispose list with dentries which
  shrink_dentry_list() then deletes.  select_parent() carefully uses
  need_resched() to avoid doing too much work at once.  But neither
  shrink_dcache_parent() nor its called functions call cond_resched().
  So once need_resched() is set select_parent() will return single
  dentry dispose list which is then deleted by shrink_dentry_list().
  This is inefficient when there are a lot of dentry to process.  This
  can cause softlockup and hurts interactivity on non preemptable
  kernels.
 
  This change adds cond_resched() in shrink_dcache_parent().  The
  benefit of this is that need_resched() is quickly cleared so that
  future calls to select_parent() are able to efficiently return a big
  batch of dentry.
 
  These additional cond_resched() do not seem to impact performance, at
  least for the workload below.
 
  Here is a program which can cause soft lockup on a if other system
  activity sets need_resched().

 I was unable to guess what word was missing from on a if other ;)

Less is more ;)  Reword to:

  Here is a program which can cause soft lockup if other system activity
  sets need_resched().

 Should this change go through Al's or Andrew's branch?

 I'll fight him for it.

Thanks.

 Softlockups are fairly serious, so I'll put a cc:stable in there.  Or
 were the changes which triggered this problem added after 3.9?

This also applies to stable.  I see the problem at least back to v3.3.
I did not test earlier kernels, but could if you want.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    1   2   3   4   >