from:"Greg Thelen"

Re: [PATCH] shmem, memcg: enable memcg aware shrinker

2020-06-01 Thread Greg Thelen

Kirill Tkhai  wrote:

> Hi, Greg,
>
> good finding. See comments below.
>
> On 01.06.2020 06:22, Greg Thelen wrote:
>> Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged
>> shrinkers during memcg shrink_slab()") a memcg aware shrinker is only
>> called when the per-memcg per-node shrinker_map indicates that the
>> shrinker may have objects to release to the memcg and node.
>> 
>> shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs
>> shrinker which advertises per memcg and numa awareness.  The shmem
>> shrinker releases memory by splitting hugepages that extend beyond
>> i_size.
>> 
>> Shmem does not currently set bits in shrinker_map.  So, starting with
>> b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under
>> pressure.  This leads to undeserved memcg OOM kills.
>> Example that reliably sees memcg OOM kill in unpatched kernel:
>>   FS=/tmp/fs
>>   CONTAINER=/cgroup/memory/tmpfs_shrinker
>>   mkdir -p $FS
>>   mount -t tmpfs -o huge=always nodev $FS
>>   # Create 1000 MB container, which shouldn't suffer OOM.
>>   mkdir $CONTAINER
>>   echo 1000M > $CONTAINER/memory.limit_in_bytes
>>   echo $BASHPID >> $CONTAINER/cgroup.procs
>>   # Create 4000 files.  Ideally each file uses 4k data page + a little
>>   # metadata.  Assume 8k total per-file, 32MB (4000*8k) should easily
>>   # fit within container's 1000 MB.  But if data pages use 2MB
>>   # hugepages (due to aggressive huge=always) then files consume 8GB,
>>   # which hits memcg 1000 MB limit.
>>   for i in {1..4000}; do
>> echo . > $FS/$i
>>   done
>> 
>> v5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg
>> aware") maintains the per-node per-memcg shrinker bitmap for THP
>> shrinker.  But there's no such logic in shmem.  Make shmem set the
>> per-memcg per-node shrinker bits when it modifies inodes to have
>> shrinkable pages.
>> 
>> Fixes: b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers 
>> during memcg shrink_slab()")
>> Cc:  # 4.19+
>> Signed-off-by: Greg Thelen 
>> ---
>>  mm/shmem.c | 61 +++---
>>  1 file changed, 35 insertions(+), 26 deletions(-)
>> 
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index bd8840082c94..e11090f78cb5 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1002,6 +1002,33 @@ static int shmem_getattr(const struct path *path, 
>> struct kstat *stat,
>>  return 0;
>>  }
>>  
>> +/*
>> + * Expose inode and optional page to shrinker as having a possibly 
>> splittable
>> + * hugepage that reaches beyond i_size.
>> + */
>> +static void shmem_shrinker_add(struct shmem_sb_info *sbinfo,
>> +   struct inode *inode, struct page *page)
>> +{
>> +struct shmem_inode_info *info = SHMEM_I(inode);
>> +
>> +spin_lock(&sbinfo->shrinklist_lock);
>> +/*
>> + * _careful to defend against unlocked access to ->shrink_list in
>> + * shmem_unused_huge_shrink()
>> + */
>> +if (list_empty_careful(&info->shrinklist)) {
>> +list_add_tail(&info->shrinklist, &sbinfo->shrinklist);
>> +sbinfo->shrinklist_len++;
>> +}
>> +spin_unlock(&sbinfo->shrinklist_lock);
>> +
>> +#ifdef CONFIG_MEMCG
>> +if (page && PageTransHuge(page))
>> +memcg_set_shrinker_bit(page->mem_cgroup, page_to_nid(page),
>> +   inode->i_sb->s_shrink.id);
>> +#endif
>> +}
>> +
>>  static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>>  {
>>  struct inode *inode = d_inode(dentry);
>> @@ -1048,17 +1075,13 @@ static int shmem_setattr(struct dentry *dentry, 
>> struct iattr *attr)
>>   * to shrink under memory pressure.
>>   */
>>  if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>> -spin_lock(&sbinfo->shrinklist_lock);
>> -/*
>> - * _careful to defend against unlocked access to
>> - * ->shrink_list in shmem_unused_huge_shrink()
>> - */
>> -if (list_empty_careful(&info->shrinklist)) {
>> -list_add_tail(&info->shrinkl

Re: [PATCH] selftests: more general make nesting support

2020-08-05 Thread Greg Thelen

On Tue, Jul 28, 2020 at 12:32 AM Greg Thelen  wrote:
>
> selftests can be built from the toplevel kernel makefile (e.g. make
> kselftest-all) or directly (make -C tools/testing/selftests all).
>
> The toplevel kernel makefile explicitly disables implicit rules with
> "MAKEFLAGS += -rR", which is passed to tools/testing/selftests.  Some
> selftest makefiles require implicit make rules, which is why
> commit 67d8712dcc70 ("selftests: Fix build failures when invoked from
> kselftest target") reenables implicit rules by clearing MAKEFLAGS if
> MAKELEVEL=1.
>
> So far so good.  However, if the toplevel makefile is called from an
> outer makefile then MAKELEVEL will be elevated, which breaks the
> MAKELEVEL equality test.
> Example wrapped makefile error:
>   $ cat ~/Makefile
>   all:
> $(MAKE) defconfig
> $(MAKE) kselftest-all
>   $ make -sf ~/Makefile
> futex_wait_timeout.c /src/tools/testing/selftests/kselftest_harness.h   
> /src/tools/testing/selftests/kselftest.h ../include/futextest.h 
> ../include/atomic.h ../include/logging.h -lpthread -lrt -o 
> /src/tools/testing/selftests/futex/functional/futex_wait_timeout
>   make[4]: futex_wait_timeout.c: Command not found
>
> Rather than checking $(MAKELEVEL), check for $(LINK.c), which is a more
> direct side effect of "make -R".  This enables arbitrary makefile
> nesting.
>
> Signed-off-by: Greg Thelen 
> ---
>  tools/testing/selftests/Makefile | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/tools/testing/selftests/Makefile 
> b/tools/testing/selftests/Makefile
> index 1195bd85af38..289a2e4b3f6f 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -84,10 +84,10 @@ endif
>  # of the targets gets built.
>  FORCE_TARGETS ?=
>
> -# Clear LDFLAGS and MAKEFLAGS if called from main
> -# Makefile to avoid test build failures when test
> -# Makefile doesn't have explicit build rules.
> -ifeq (1,$(MAKELEVEL))
> +# Clear LDFLAGS and MAKEFLAGS when implicit rules are missing.  This provides
> +# implicit rules to sub-test Makefiles which avoids build failures in test
> +# Makefile that don't have explicit build rules.
> +ifeq (,$(LINK.c))
>  override LDFLAGS =
>  override MAKEFLAGS =
>  endif
> --
> 2.28.0.rc0.142.g3c755180ce-goog

Is there any feedback on this patch?  It's not high priority but something that
will help me make more use of selftests.

Re: [PATCH] selftests: more general make nesting support

2020-08-05 Thread Greg Thelen

Shuah Khan  wrote:

> On 8/5/20 1:36 PM, Greg Thelen wrote:
>> On Tue, Jul 28, 2020 at 12:32 AM Greg Thelen  wrote:
>>>
>>> selftests can be built from the toplevel kernel makefile (e.g. make
>>> kselftest-all) or directly (make -C tools/testing/selftests all).
>>>
>>> The toplevel kernel makefile explicitly disables implicit rules with
>>> "MAKEFLAGS += -rR", which is passed to tools/testing/selftests.  Some
>>> selftest makefiles require implicit make rules, which is why
>>> commit 67d8712dcc70 ("selftests: Fix build failures when invoked from
>>> kselftest target") reenables implicit rules by clearing MAKEFLAGS if
>>> MAKELEVEL=1.
>>>
>>> So far so good.  However, if the toplevel makefile is called from an
>>> outer makefile then MAKELEVEL will be elevated, which breaks the
>>> MAKELEVEL equality test.
>>> Example wrapped makefile error:
>>>$ cat ~/Makefile
>>>all:
>>>  $(MAKE) defconfig
>>>  $(MAKE) kselftest-all
>>>$ make -sf ~/Makefile
>>>  futex_wait_timeout.c /src/tools/testing/selftests/kselftest_harness.h  
>>>  /src/tools/testing/selftests/kselftest.h ../include/futextest.h 
>>> ../include/atomic.h ../include/logging.h -lpthread -lrt -o 
>>> /src/tools/testing/selftests/futex/functional/futex_wait_timeout
>>>make[4]: futex_wait_timeout.c: Command not found
>>>
>>> Rather than checking $(MAKELEVEL), check for $(LINK.c), which is a more
>>> direct side effect of "make -R".  This enables arbitrary makefile
>>> nesting.
>>>
>>> Signed-off-by: Greg Thelen 
>>> ---
>>>   tools/testing/selftests/Makefile | 8 
>>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/tools/testing/selftests/Makefile 
>>> b/tools/testing/selftests/Makefile
>>> index 1195bd85af38..289a2e4b3f6f 100644
>>> --- a/tools/testing/selftests/Makefile
>>> +++ b/tools/testing/selftests/Makefile
>>> @@ -84,10 +84,10 @@ endif
>>>   # of the targets gets built.
>>>   FORCE_TARGETS ?=
>>>
>>> -# Clear LDFLAGS and MAKEFLAGS if called from main
>>> -# Makefile to avoid test build failures when test
>>> -# Makefile doesn't have explicit build rules.
>>> -ifeq (1,$(MAKELEVEL))
>>> +# Clear LDFLAGS and MAKEFLAGS when implicit rules are missing.  This 
>>> provides
>>> +# implicit rules to sub-test Makefiles which avoids build failures in test
>>> +# Makefile that don't have explicit build rules.
>>> +ifeq (,$(LINK.c))
>>>   override LDFLAGS =
>>>   override MAKEFLAGS =
>>>   endif
>>> --
>>> 2.28.0.rc0.142.g3c755180ce-goog
>> 
>> Is there any feedback on this patch?  It's not high priority but something 
>> that
>> will help me make more use of selftests.
>> 
>
> No issues with the patch. I will apply it once the merge window ends.
>
> thanks,
> -- Shuah

Great.  Thank you.

Re: + revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch added to -mm tree

2020-12-07 Thread Greg Thelen

a...@linux-foundation.org wrote:

> The patch titled
>  Subject: revert "mm/filemap: add static for function 
> __add_to_page_cache_locked"
> has been added to the -mm tree.  Its filename is
>  
> revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch
>
> This patch should soon appear at
> 
> https://ozlabs.org/~akpm/mmots/broken-out/revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch
> and later at
> 
> https://ozlabs.org/~akpm/mmotm/broken-out/revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch
>
> Before you just go and hit "reply", please:
>a) Consider who else should be cc'ed
>b) Prefer to cc a suitable mailing list as well
>c) Ideally: find the original patch on the mailing list and do a
>   reply-to-all to that, adding suitable additional cc's
>
> *** Remember to use Documentation/process/submit-checklist.rst when testing 
> your code ***
>
> The -mm tree is included into linux-next and is updated
> there every 3-4 working days
>
> --
> From: Andrew Morton 
> Subject: revert "mm/filemap: add static for function 
> __add_to_page_cache_locked"
>
> Revert 3351b16af494 ("mm/filemap: add static for function
> __add_to_page_cache_locked") due to build issues with
> CONFIG_DEBUG_INFO_BTF=y.
>
> Link: 
> https://lkml.kernel.org/r/caadnvqj6tmzbxvtrobueh6qa0h+q7yaskxrvvvxhqr3kbzd...@mail.gmail.com
> Cc: Michal Kubecek 
> Cc: Justin Forbes 
> Cc: Alex Shi 
> Cc: Souptick Joarder 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> Cc: Josef Bacik 
> Signed-off-by: Andrew Morton 

Thanks.

Tested-by: Greg Thelen 

> ---
>
>  mm/filemap.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- 
> a/mm/filemap.c~revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked
> +++ a/mm/filemap.c
> @@ -827,7 +827,7 @@ int replace_page_cache_page(struct page
>  }
>  EXPORT_SYMBOL_GPL(replace_page_cache_page);
>  
> -static noinline int __add_to_page_cache_locked(struct page *page,
> +noinline int __add_to_page_cache_locked(struct page *page,
>   struct address_space *mapping,
>   pgoff_t offset, gfp_t gfp,
>   void **shadowp)
> _
>
> Patches currently in -mm which might be from a...@linux-foundation.org are
>
> revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch
> kthread_worker-document-cpu-hotplug-handling-fix.patch
> mm.patch
> mm-remove-the-unuseful-else-after-a-return-checkpatch-fixes.patch
> mm-prevent-gup_fast-from-racing-with-cow-during-fork-checkpatch-fixes.patch
> mm-swap_state-skip-meaningless-swap-cache-readahead-when-ra_infowin-==-0-fix.patch
> mm-vmallocc-__vmalloc_area_node-avoid-32-bit-overflow.patch
> mm-cma-improve-pr_debug-log-in-cma_release-fix.patch
> mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix-2.patch
> lib-cmdline_kunit-add-a-new-test-suite-for-cmdline-api-fix.patch
> ilog2-improve-ilog2-for-constant-arguments-checkpatch-fixes.patch
> lib-test_bitmapc-add-for_each_set_clump-test-cases-checkpatch-fixes.patch
> checkpatch-fix-typo_spelling-check-for-words-with-apostrophe-fix.patch
> resource-fix-kernel-doc-markups-checkpatch-fixes.patch
> linux-next-rejects.patch
> kmap-stupid-hacks-to-make-it-compile.patch
> init-kconfig-dont-assume-scripts-lld-versionsh-is-executable.patch
> set_memory-allow-set_direct_map__noflush-for-multiple-pages-fix.patch
> arch-mm-wire-up-memfd_secret-system-call-were-relevant-fix.patch
> kernel-forkc-export-kernel_thread-to-modules.patch

Re: [PATCH 07/10] mm/vmscan: add helper for querying ability to age anonymous pages

2021-03-19 Thread Greg Thelen

Dave Hansen  wrote:

> From: Dave Hansen 
>
> Anonymous pages are kept on their own LRU(s).  These lists could
> theoretically always be scanned and maintained.  But, without swap,
> there is currently nothing the kernel can *do* with the results of a
> scanned, sorted LRU for anonymous pages.
>
> A check for '!total_swap_pages' currently serves as a valid check as
> to whether anonymous LRUs should be maintained.  However, another
> method will be added shortly: page demotion.
>
> Abstract out the 'total_swap_pages' checks into a helper, give it a
> logically significant name, and check for the possibility of page
> demotion.

Reviewed-by: Greg Thelen 

> Signed-off-by: Dave Hansen 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> Cc: David Hildenbrand 
> Cc: osalvador 
> ---
>
>  b/mm/vmscan.c |   28 +---
>  1 file changed, 25 insertions(+), 3 deletions(-)
>
> diff -puN mm/vmscan.c~mm-vmscan-anon-can-be-aged mm/vmscan.c
> --- a/mm/vmscan.c~mm-vmscan-anon-can-be-aged  2021-03-04 15:35:58.935806422 
> -0800
> +++ b/mm/vmscan.c 2021-03-04 15:35:58.942806422 -0800
> @@ -2517,6 +2517,26 @@ out:
>   }
>  }
>  
> +/*
> + * Anonymous LRU management is a waste if there is
> + * ultimately no way to reclaim the memory.
> + */
> +bool anon_should_be_aged(struct lruvec *lruvec)

Should this be static?

> +{
> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
> + /* Aging the anon LRU is valuable if swap is present: */
> + if (total_swap_pages > 0)
> + return true;
> +
> + /* Also valuable if anon pages can be demoted: */
> + if (next_demotion_node(pgdat->node_id) >= 0)
> + return true;
> +
> + /* No way to reclaim anon pages.  Should not age anon LRUs: */
> + return false;
> +}
> +
>  static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  {
>   unsigned long nr[NR_LRU_LISTS];
> @@ -2626,7 +2646,8 @@ static void shrink_lruvec(struct lruvec
>* Even if we did not try to evict anon pages at all, we want to
>* rebalance the anon lru active/inactive ratio.
>*/
> - if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON))
> + if (anon_should_be_aged(lruvec) &&
> + inactive_is_low(lruvec, LRU_INACTIVE_ANON))
>   shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>  sc, LRU_ACTIVE_ANON);
>  }
> @@ -3455,10 +3476,11 @@ static void age_active_anon(struct pglis
>   struct mem_cgroup *memcg;
>   struct lruvec *lruvec;
>  
> - if (!total_swap_pages)
> + lruvec = mem_cgroup_lruvec(NULL, pgdat);
> +
> + if (!anon_should_be_aged(lruvec))
>   return;
>  
> - lruvec = mem_cgroup_lruvec(NULL, pgdat);
>   if (!inactive_is_low(lruvec, LRU_INACTIVE_ANON))
>   return;
>  
> _

Re: [PATCH] mm/filemap: add static for function __add_to_page_cache_locked

2020-12-07 Thread Greg Thelen

Alex Shi  wrote:

> 在 2020/11/11 上午3:50, Andrew Morton 写道:
>> On Tue, 10 Nov 2020 08:39:24 +0530 Souptick Joarder  
>> wrote:
>> 
>>> On Fri, Nov 6, 2020 at 4:55 PM Alex Shi  wrote:

 Otherwise it cause gcc warning:
   ^~~
 ../mm/filemap.c:830:14: warning: no previous prototype for
 ‘__add_to_page_cache_locked’ [-Wmissing-prototypes]
  noinline int __add_to_page_cache_locked(struct page *page,
   ^~
>>>
>>> Is CONFIG_DEBUG_INFO_BTF enabled in your .config ?
>> 
>> hm, yes.
>
> When the config enabled, compiling looks good untill pahole tool
> used to get BTF info, but I still failed on a right version pahole
>> 1.16. Sorry.

I'm seeing an issue with this patch.  My build system has pahole v1.17,
but I don't think the pahole version is key.

$ git checkout 3351b16af494  # recently added to linus/master
$ make defconfig
$ make menuconfig  # set CONFIG_BPF_SYSCALL and CONFIG_DEBUG_INFO_BTF
$ make V=1
+ ./tools/bpf/resolve_btfids/resolve_btfids vmlinux
FAILED unresolved symbol __add_to_page_cache_locked

Reverting 3351b16af494 ("mm/filemap: add static for function
__add_to_page_cache_locked") fixes the issue.

I don't see the warning which motivated this patch, but maybe it
requires particular a .config or gcc version.  Perhaps adding a
__add_to_page_cache_locked() prototype would meet avoid it.  But I
haven't studied enough on BTF to know if there's a better answer.

 Signed-off-by: Alex Shi 
 Cc: Andrew Morton 
 Cc: linux...@kvack.org
 Cc: linux-kernel@vger.kernel.org
 ---
  mm/filemap.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/mm/filemap.c b/mm/filemap.c
 index d90614f501da..249cf489f5df 100644
 --- a/mm/filemap.c
 +++ b/mm/filemap.c
 @@ -827,7 +827,7 @@ int replace_page_cache_page(struct page *old, struct 
 page *new, gfp_t gfp_mask)
  }
  EXPORT_SYMBOL_GPL(replace_page_cache_page);

 -noinline int __add_to_page_cache_locked(struct page *page,
 +static noinline int __add_to_page_cache_locked(struct page *page,
 struct address_space *mapping,
 pgoff_t offset, gfp_t gfp,
 void **shadowp)
>> 
>> It's unclear to me whether BTF_ID() requires that the target symbol be
>> non-static.  It doesn't actually reference the symbol:
>> 
>> #define BTF_ID(prefix, name) \
>> __BTF_ID(__ID(__BTF_ID__##prefix##__##name##__))
>> 
>
> The above usage make me thought BTF don't require the symbol, though
> the symbol still exist in vmlinux with 'static'.
>
> So any comments of this, Alexei? 
>
>> Alexei, can you please comment?
>>

Re: [PATCH] mm: mmap_lock: fix use-after-free race and css ref leak in tracepoints

2020-12-01 Thread Greg Thelen

Axel Rasmussen  wrote:

> On Mon, Nov 30, 2020 at 5:34 PM Shakeel Butt  wrote:
>>
>> On Mon, Nov 30, 2020 at 3:43 PM Axel Rasmussen  
>> wrote:
>> >
>> > syzbot reported[1] a use-after-free introduced in 0f818c4bc1f3. The bug
>> > is that an ongoing trace event might race with the tracepoint being
>> > disabled (and therefore the _unreg() callback being called). Consider
>> > this ordering:
>> >
>> > T1: trace event fires, get_mm_memcg_path() is called
>> > T1: get_memcg_path_buf() returns a buffer pointer
>> > T2: trace_mmap_lock_unreg() is called, buffers are freed
>> > T1: cgroup_path() is called with the now-freed buffer
>>
>> Any reason to use the cgroup_path instead of the cgroup_ino? There are
>> other examples of trace points using cgroup_ino and no need to
>> allocate buffers. Also cgroup namespace might complicate the path
>> usage.
>
> Hmm, so in general I would love to use a numeric identifier instead of a 
> string.
>
> I did some reading, and it looks like the cgroup_ino() mainly has to
> do with writeback, instead of being just a general identifier?
> https://www.kernel.org/doc/Documentation/cgroup-v2.txt
>
> There is cgroup_id() which I think is almost what I'd want, but there
> are a couple problems with it:
>
> - I don't know of a way for userspace to translate IDs -> paths, to
> make them human readable?

The id => name map can be built from user space with a tree walk.
Example:

$ find /sys/fs/cgroup/memory -type d -printf '%i %P\n'  
# ~ [main] 
20387 init.scope 
31 system.slice

> - Also I think the ID implementation we use for this is "dense",
> meaning if a cgroup is removed, its ID is likely to be quickly reused.
>
>>
>> >
>> > The solution in this commit is to modify trace_mmap_lock_unreg() to
>> > first stop new buffers from being handed out, and then to wait (spin)
>> > until any existing buffer references are dropped (i.e., those trace
>> > events complete).
>> >
>> > I have a simple reproducer program which spins up two pools of threads,
>> > doing the following in a tight loop:
>> >
>> >   Pool 1:
>> >   mmap(NULL, 4096, PROT_READ | PROT_WRITE,
>> >MAP_PRIVATE | MAP_ANONYMOUS, -1, 0)
>> >   munmap()
>> >
>> >   Pool 2:
>> >   echo 1 > /sys/kernel/debug/tracing/events/mmap_lock/enable
>> >   echo 0 > /sys/kernel/debug/tracing/events/mmap_lock/enable
>> >
>> > This triggers the use-after-free very quickly. With this patch, I let it
>> > run for an hour without any BUGs.
>> >
>> > While fixing this, I also noticed and fixed a css ref leak. Previously
>> > we called get_mem_cgroup_from_mm(), but we never called css_put() to
>> > release that reference. get_mm_memcg_path() now does this properly.
>> >
>> > [1]: https://syzkaller.appspot.com/bug?extid=19e6dd9943972fa1c58a
>> >
>> > Fixes: 0f818c4bc1f3 ("mm: mmap_lock: add tracepoints around lock 
>> > acquisition")
>>
>> The original patch is in mm tree, so the SHA1 is not stabilized.
>> Usually Andrew squash the fixes into the original patches.
>
> Ah, I added this because it also shows up in linux-next, under the
> next-20201130 tag. I'll remove it in v2, squashing is fine. :)
>
>>
>> > Signed-off-by: Axel Rasmussen 
>> > ---
>> >  mm/mmap_lock.c | 100 +
>> >  1 file changed, 85 insertions(+), 15 deletions(-)
>> >
>> > diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
>> > index 12af8f1b8a14..be38dc58278b 100644
>> > --- a/mm/mmap_lock.c
>> > +++ b/mm/mmap_lock.c
>> > @@ -3,6 +3,7 @@
>> >  #include 
>> >
>> >  #include 
>> > +#include 
>> >  #include 
>> >  #include 
>> >  #include 
>> > @@ -18,13 +19,28 @@ EXPORT_TRACEPOINT_SYMBOL(mmap_lock_released);
>> >  #ifdef CONFIG_MEMCG
>> >
>> >  /*
>> > - * Our various events all share the same buffer (because we don't want or 
>> > need
>> > - * to allocate a set of buffers *per event type*), so we need to protect 
>> > against
>> > - * concurrent _reg() and _unreg() calls, and count how many _reg() calls 
>> > have
>> > - * been made.
>> > + * This is unfortunately complicated... _reg() and _unreg() may be called
>> > + * in parallel, separately for each of our three event types. To save 
>> > memory,
>> > + * all of the event types share the same buffers. Furthermore, trace 
>> > events
>> > + * might happen in parallel with _unreg(); we need to ensure we don't 
>> > free the
>> > + * buffers before all inflights have finished. Because these events happen
>> > + * "frequently", we also want to prevent new inflights from starting once 
>> > the
>> > + * _unreg() process begins. And, for performance reasons, we want to 
>> > avoid any
>> > + * locking in the trace event path.
>> > + *
>> > + * So:
>> > + *
>> > + * - Use a spinlock to serialize _reg() and _unreg() calls.
>> > + * - Keep track of nested _reg() calls with a lock-protected counter.
>> > + * - Define a flag indicating whether or not unregistration

Re: [PATCH v2 1/3] percpu: add test module for various percpu operations

2013-11-07 Thread Greg Thelen

On Mon, Nov 04 2013, Andrew Morton wrote:

> On Sun, 27 Oct 2013 10:30:15 -0700 Greg Thelen  wrote:
>
>> Tests various percpu operations.
>
> Could you please take a look at the 32-bit build (this is i386):
>
> lib/percpu_test.c: In function 'percpu_test_init':
> lib/percpu_test.c:61: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:61: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:61: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:61: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:61: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:61: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:70: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:70: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:70: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:70: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:70: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:70: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:89: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:89: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:89: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:89: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:89: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:89: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:97: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:97: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:97: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:97: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:97: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:97: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:112: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:112: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:112: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:112: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:112: warning: integer constant is too large for 'long' type
> lib/percpu_test.c:112: warning: integer constant is too large for 'long' type

I was using gcc 4.6 which apparently adds LL suffix as needed.  Though
there were some other code problems with 32 bit beyond missing suffixes.
Fixed version below tested with both gcc 4.4 and gcc 4.6 on 32 and 64
bit x86.

---8<---

>From a95bb1ce42b4492644fa10c7c80fd9bbd7bf23b9 Mon Sep 17 00:00:00 2001
In-Reply-To: <20131104160918.0c571b410cf165e9c4b4a...@linux-foundation.org>
References: <20131104160918.0c571b410cf165e9c4b4a...@linux-foundation.org>
From: Greg Thelen 
Date: Sun, 27 Oct 2013 10:30:15 -0700
Subject: [PATCH v2] percpu: add test module for various percpu operations

Tests various percpu operations.

Enable with CONFIG_PERCPU_TEST=m.

Signed-off-by: Greg Thelen 
Acked-by: Tejun Heo 
---
Changelog since v1:
- use %lld/x which allows for less casting
- fix 32 bit build by casting large constants

 lib/Kconfig.debug |   9 
 lib/Makefile  |   2 +
 lib/percpu_test.c | 138 ++
 3 files changed, 149 insertions(+)
 create mode 100644 lib/percpu_test.c

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 094f3152ec2b..1891eb271adf 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1472,6 +1472,15 @@ config INTERVAL_TREE_TEST
help
  A benchmark measuring the performance of the interval tree library
 
+config PERCPU_TEST
+   tristate "Per cpu operations test"
+   depends on m && DEBUG_KERNEL
+   help
+ Enable this option to build test module which validates per-cpu
+ operations.
+
+ If unsure, say N.
+
 config ATOMIC64_SELFTEST
bool "Perform an atomic64_t self-test at boot"
help
diff --git a/lib/Makefile b/lib/Makefile
index f3bb2cb98adf..bb016e116ba4 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -157,6 +157,8 @@ obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o

Re: [PATCH -mm v2 2/2] mm: get rid of __GFP_KMEMCG

2014-04-01 Thread Greg Thelen


On Tue, Apr 01 2014, Vladimir Davydov  wrote:

> Currently to allocate a page that should be charged to kmemcg (e.g.
> threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page
> allocated is then to be freed by free_memcg_kmem_pages. Apart from
> looking asymmetrical, this also requires intrusion to the general
> allocation path. So let's introduce separate functions that will
> alloc/free pages charged to kmemcg.
>
> The new functions are called alloc_kmem_pages and free_kmem_pages. They
> should be used when the caller actually would like to use kmalloc, but
> has to fall back to the page allocator for the allocation is large. They
> only differ from alloc_pages and free_pages in that besides allocating
> or freeing pages they also charge them to the kmem resource counter of
> the current memory cgroup.
>
> Signed-off-by: Vladimir Davydov 
> ---
>  include/linux/gfp.h |   10 ---
>  include/linux/memcontrol.h  |2 +-
>  include/linux/slab.h|   11 
>  include/linux/thread_info.h |2 --
>  include/trace/events/gfpflags.h |1 -
>  kernel/fork.c   |6 ++---
>  mm/page_alloc.c |   56 
> ---
>  mm/slab_common.c|   12 +
>  mm/slub.c   |6 ++---
>  9 files changed, 60 insertions(+), 46 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 39b81dc7d01a..d382db71e300 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -31,7 +31,6 @@ struct vm_area_struct;
>  #define ___GFP_HARDWALL  0x2u
>  #define ___GFP_THISNODE  0x4u
>  #define ___GFP_RECLAIMABLE   0x8u
> -#define ___GFP_KMEMCG0x10u
>  #define ___GFP_NOTRACK   0x20u
>  #define ___GFP_NO_KSWAPD 0x40u
>  #define ___GFP_OTHER_NODE0x80u
> @@ -91,7 +90,6 @@ struct vm_area_struct;
>  
>  #define __GFP_NO_KSWAPD  ((__force gfp_t)___GFP_NO_KSWAPD)
>  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of 
> other node */
> -#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from 
> a memcg-accounted resource */
>  #define __GFP_WRITE  ((__force gfp_t)___GFP_WRITE)   /* Allocator intends to 
> dirty page */
>  
>  /*
> @@ -353,6 +351,10 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int 
> order,
>  #define alloc_page_vma_node(gfp_mask, vma, addr, node)   \
>   alloc_pages_vma(gfp_mask, 0, vma, addr, node)
>  
> +extern struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order);
> +extern struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask,
> +   unsigned int order);
> +
>  extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
>  extern unsigned long get_zeroed_page(gfp_t gfp_mask);
>  
> @@ -372,8 +374,8 @@ extern void free_pages(unsigned long addr, unsigned int 
> order);
>  extern void free_hot_cold_page(struct page *page, int cold);
>  extern void free_hot_cold_page_list(struct list_head *list, int cold);
>  
> -extern void __free_memcg_kmem_pages(struct page *page, unsigned int order);
> -extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
> +extern void __free_kmem_pages(struct page *page, unsigned int order);
> +extern void free_kmem_pages(unsigned long addr, unsigned int order);
>  
>  #define __free_page(page) __free_pages((page), 0)
>  #define free_page(addr) free_pages((addr), 0)
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 29068dd26c3d..13acdb5259f5 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -543,7 +543,7 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup 
> **memcg, int order)
>* res_counter_charge_nofail, but we hope those allocations are rare,
>* and won't be worth the trouble.
>*/
> - if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
> + if (gfp & __GFP_NOFAIL)
>   return true;
>   if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
>   return true;
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 3dd389aa91c7..6d6959292e00 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -358,17 +358,6 @@ kmem_cache_alloc_node_trace(struct kmem_cache *s,
>  #include 
>  #endif
>  
> -static __always_inline void *
> -kmalloc_order(size_t size, gfp_t flags, unsigned int order)
> -{
> - void *ret;
> -
> - flags |= (__GFP_COMP | __GFP_KMEMCG);
> - ret = (void *) __get_free_pages(flags, order);
> - kmemleak_alloc(ret, size, 1, flags);
> - return ret;
> -}
> -

Removing this from the header file breaks builds without
CONFIG_TRACING.
Example:
% make allnoconfig && make -j4 mm/
[...]
include/linux/slab.h: In function ‘kmalloc_order_trace’:
include/linux/slab.h:367:2: error: imp

Re: [PATCH] ipc,shm: increase default size for shmmax

2014-04-01 Thread Greg Thelen


On Tue, Apr 01 2014, Davidlohr Bueso  wrote:

> On Tue, 2014-04-01 at 19:56 -0400, KOSAKI Motohiro wrote:
>> >> > Ah-hah, that's interesting info.
>> >> >
>> >> > Let's make the default 64GB?
>> >>
>> >> 64GB is infinity at that time, but it no longer near infinity today. I 
>> >> like
>> >> very large or total memory proportional number.
>> >
>> > So I still like 0 for unlimited. Nice, clean and much easier to look at
>> > than ULONG_MAX. And since we cannot disable shm through SHMMIN, I really
>> > don't see any disadvantages, as opposed to some other arbitrary value.
>> > Furthermore it wouldn't break userspace: any existing sysctl would
>> > continue to work, and if not set, the user never has to worry about this
>> > tunable again.
>> >
>> > Please let me know if you all agree with this...
>> 
>> Surething. Why not. :)
>
> *sigh* actually, the plot thickens a bit with SHMALL (total size of shm
> segments system wide, in pages). Currently by default:
>
> #define SHMALL (SHMMAX/getpagesize()*(SHMMNI/16))
>
> This deals with physical memory, at least admins are recommended to set
> it to some large percentage of ram / pagesize. So I think that if we
> loose control over the default value, users can potentially DoS the
> system, or at least cause excessive swapping if not manually set, but
> then again the same goes for anon mem... so do we care?

At least when there's an egregious anon leak the oom killer has the
power to free the memory by killing until the memory is unreferenced.
This isn't true for shm or tmpfs.  So shm is more effective than anon at
crushing a machine.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm v2 1/2] sl[au]b: charge slabs to kmemcg explicitly

2014-04-01 Thread Greg Thelen


On Tue, Apr 01 2014, Vladimir Davydov  wrote:

> We have only a few places where we actually want to charge kmem so
> instead of intruding into the general page allocation path with
> __GFP_KMEMCG it's better to explictly charge kmem there. All kmem
> charges will be easier to follow that way.
>
> This is a step towards removing __GFP_KMEMCG. It removes __GFP_KMEMCG
> from memcg caches' allocflags. Instead it makes slab allocation path
> call memcg_charge_kmem directly getting memcg to charge from the cache's
> memcg params.
>
> This also eliminates any possibility of misaccounting an allocation
> going from one memcg's cache to another memcg, because now we always
> charge slabs against the memcg the cache belongs to. That's why this
> patch removes the big comment to memcg_kmem_get_cache.
>
> Signed-off-by: Vladimir Davydov 

Acked-by: Greg Thelen 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ipc,shm: increase default size for shmmax

2014-04-01 Thread Greg Thelen


On Tue, Apr 01 2014, Kamezawa Hiroyuki  wrote:

>> On Tue, Apr 01 2014, Davidlohr Bueso  wrote:
>> 
>>> On Tue, 2014-04-01 at 19:56 -0400, KOSAKI Motohiro wrote:
>>>>>>> Ah-hah, that's interesting info.
>>>>>>>
>>>>>>> Let's make the default 64GB?
>>>>>>
>>>>>> 64GB is infinity at that time, but it no longer near infinity today. I 
>>>>>> like
>>>>>> very large or total memory proportional number.
>>>>>
>>>>> So I still like 0 for unlimited. Nice, clean and much easier to look at
>>>>> than ULONG_MAX. And since we cannot disable shm through SHMMIN, I really
>>>>> don't see any disadvantages, as opposed to some other arbitrary value.
>>>>> Furthermore it wouldn't break userspace: any existing sysctl would
>>>>> continue to work, and if not set, the user never has to worry about this
>>>>> tunable again.
>>>>>
>>>>> Please let me know if you all agree with this...
>>>>
>>>> Surething. Why not. :)
>>>
>>> *sigh* actually, the plot thickens a bit with SHMALL (total size of shm
>>> segments system wide, in pages). Currently by default:
>>>
>>> #define SHMALL (SHMMAX/getpagesize()*(SHMMNI/16))
>>>
>>> This deals with physical memory, at least admins are recommended to set
>>> it to some large percentage of ram / pagesize. So I think that if we
>>> loose control over the default value, users can potentially DoS the
>>> system, or at least cause excessive swapping if not manually set, but
>>> then again the same goes for anon mem... so do we care?
>> 
> (2014/04/02 10:08), Greg Thelen wrote:
>> 
>> At least when there's an egregious anon leak the oom killer has the
>> power to free the memory by killing until the memory is unreferenced.
>> This isn't true for shm or tmpfs.  So shm is more effective than anon at
>> crushing a machine.
>
> Hm..sysctl.kernel.shm_rmid_forced won't work with oom-killer ?
>
> http://www.openwall.com/lists/kernel-hardening/2011/07/26/7
>
> I like to handle this kind of issue under memcg but hmm..tmpfs's limit is half
> of memory at default.

Ah, yes. I forgot about shm_rmid_forced.  Thanks.  It would give the oom
killer ability to cleanup shm (as it does with anon) when
shm_rmid_forced=1.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm v2.1] mm: get rid of __GFP_KMEMCG

2014-04-02 Thread Greg Thelen


On Tue, Apr 01 2014, Vladimir Davydov  wrote:

> Currently to allocate a page that should be charged to kmemcg (e.g.
> threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page
> allocated is then to be freed by free_memcg_kmem_pages. Apart from
> looking asymmetrical, this also requires intrusion to the general
> allocation path. So let's introduce separate functions that will
> alloc/free pages charged to kmemcg.
>
> The new functions are called alloc_kmem_pages and free_kmem_pages. They
> should be used when the caller actually would like to use kmalloc, but
> has to fall back to the page allocator for the allocation is large. They
> only differ from alloc_pages and free_pages in that besides allocating
> or freeing pages they also charge them to the kmem resource counter of
> the current memory cgroup.
>
> Signed-off-by: Vladimir Davydov 

One comment nit below, otherwise looks good to me.

Acked-by: Greg Thelen 

> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Glauber Costa 
> Cc: Christoph Lameter 
> Cc: Pekka Enberg 
> ---
> Changes in v2.1:
>  - add missing kmalloc_order forward declaration; lacking it caused
>compilation breakage with CONFIG_TRACING=n
>
>  include/linux/gfp.h |   10 ---
>  include/linux/memcontrol.h  |2 +-
>  include/linux/slab.h|   11 +---
>  include/linux/thread_info.h |2 --
>  include/trace/events/gfpflags.h |1 -
>  kernel/fork.c   |6 ++---
>  mm/page_alloc.c |   56 
> ---
>  mm/slab_common.c|   12 +
>  mm/slub.c   |6 ++---
>  9 files changed, 61 insertions(+), 45 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 39b81dc7d01a..d382db71e300 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -31,7 +31,6 @@ struct vm_area_struct;
>  #define ___GFP_HARDWALL  0x2u
>  #define ___GFP_THISNODE  0x4u
>  #define ___GFP_RECLAIMABLE   0x8u
> -#define ___GFP_KMEMCG0x10u
>  #define ___GFP_NOTRACK   0x20u
>  #define ___GFP_NO_KSWAPD 0x40u
>  #define ___GFP_OTHER_NODE0x80u
> @@ -91,7 +90,6 @@ struct vm_area_struct;
>  
>  #define __GFP_NO_KSWAPD  ((__force gfp_t)___GFP_NO_KSWAPD)
>  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of 
> other node */
> -#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from 
> a memcg-accounted resource */
>  #define __GFP_WRITE  ((__force gfp_t)___GFP_WRITE)   /* Allocator intends to 
> dirty page */
>  
>  /*
> @@ -353,6 +351,10 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int 
> order,
>  #define alloc_page_vma_node(gfp_mask, vma, addr, node)   \
>   alloc_pages_vma(gfp_mask, 0, vma, addr, node)
>  
> +extern struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order);
> +extern struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask,
> +   unsigned int order);
> +
>  extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
>  extern unsigned long get_zeroed_page(gfp_t gfp_mask);
>  
> @@ -372,8 +374,8 @@ extern void free_pages(unsigned long addr, unsigned int 
> order);
>  extern void free_hot_cold_page(struct page *page, int cold);
>  extern void free_hot_cold_page_list(struct list_head *list, int cold);
>  
> -extern void __free_memcg_kmem_pages(struct page *page, unsigned int order);
> -extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
> +extern void __free_kmem_pages(struct page *page, unsigned int order);
> +extern void free_kmem_pages(unsigned long addr, unsigned int order);
>  
>  #define __free_page(page) __free_pages((page), 0)
>  #define free_page(addr) free_pages((addr), 0)
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 29068dd26c3d..13acdb5259f5 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -543,7 +543,7 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup 
> **memcg, int order)
>* res_counter_charge_nofail, but we hope those allocations are rare,
>* and won't be worth the trouble.
>*/

Just a few lines higher in first memcg_kmem_newpage_charge() comment,
there is a leftover reference to GFP_KMEMCG which should be removed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] dm bufio: fully initialize shrinker

2014-07-31 Thread Greg Thelen

1d3d4437eae1 ("vmscan: per-node deferred work") added a flags field to
struct shrinker assuming that all shrinkers were zero filled.  The dm
bufio shrinker is not zero filled, which leaves arbitrary kmalloc() data
in flags.  So far the only defined flags bit is SHRINKER_NUMA_AWARE.
But there are proposed patches which add other bits to shrinker.flags
(e.g. memcg awareness).

Rather than simply initializing the shrinker, this patch uses kzalloc()
when allocating the dm_bufio_client to ensure that the embedded shrinker
and any other similar structures are zeroed.

This fixes theoretical over aggressive shrinking of dm bufio objects.
If the uninitialized dm_bufio_client.shrinker.flags contains
SHRINKER_NUMA_AWARE then shrink_slab() would call the dm shrinker for
each numa node rather than just once.  This has been broken since 3.12.

Signed-off-by: Greg Thelen 
---
 drivers/md/dm-bufio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 4e84095833db..d724459860d9 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -1541,7 +1541,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct 
block_device *bdev, unsign
BUG_ON(block_size < 1 << SECTOR_SHIFT ||
   (block_size & (block_size - 1)));
 
-   c = kmalloc(sizeof(*c), GFP_KERNEL);
+   c = kzalloc(sizeof(*c), GFP_KERNEL);
if (!c) {
r = -ENOMEM;
goto bad_client;
-- 
2.0.0.526.g5318336

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] memory cgroup: weak points of kmem accounting design

2014-09-17 Thread Greg Thelen


On Tue, Sep 16 2014, Vladimir Davydov wrote:

> Hi Suleiman,
>
> On Mon, Sep 15, 2014 at 12:13:33PM -0700, Suleiman Souhlal wrote:
>> On Mon, Sep 15, 2014 at 3:44 AM, Vladimir Davydov
>>  wrote:
>> > Hi,
>> >
>> > I'd like to discuss downsides of the kmem accounting part of the memory
>> > cgroup controller and a possible way to fix them. I'd really appreciate
>> > if you could share your thoughts on it.
>> >
>> > The idea lying behind the kmem accounting design is to provide each
>> > memory cgroup with its private copy of every kmem_cache and list_lru
>> > it's going to use. This is implemented by bundling these structures with
>> > arrays storing per-memcg copies. The arrays are referenced by css id.
>> > When a process in a cgroup tries to allocate an object from a kmem cache
>> > we first find out which cgroup the process resides in, then look up the
>> > cache copy corresponding to the cgroup, and finally allocate a new
>> > object from the private cache. Similarly, on addition/deletion of an
>> > object from a list_lru, we first obtain the kmem cache the object was
>> > allocated from, then look up the memory cgroup which the cache belongs
>> > to, and finally add/remove the object from the private copy of the
>> > list_lru corresponding to the cgroup.
>> >
>> > Though simple it looks from the first glance, it has a number of serious
>> > weaknesses:
>> >
>> >  - Providing each memory cgroup with its own kmem cache increases
>> >external fragmentation.
>> 
>> I haven't seen any evidence of this being a problem (but that doesn't
>> mean it doesn't exist).
>
> Actually, it's rather speculative. For example, if we have say a hundred
> of extra objects per cache (fragmented or in per-cpu stocks) of size 256
> bytes, then for one cache the overhead would be 25K, which is
> negligible. Now if there are thousand cgroups using the cache, we have
> to pay 25M, which is noticeable. Anyway, to estimate this exactly, one
> needs to run a typical workload inside a cgroup.
>
>>
>> >  - SLAB isn't ready to deal with thousands of caches: its algorithm
>> >walks over all system caches and shrinks them periodically, which may
>> >be really costly if we have thousands active memory cgroups.
>> 
>> This could be throttled.
>
> It could be, but then we'd have more objects in per-cpu stocks, which
> means more memory overhead.
>
>> 
>> >
>> >  - Caches may now be created/destroyed frequently and from various
>> >places: on system cache destruction, on cgroup offline, from a work
>> >struct scheduled by kmalloc. Synchronizing them properly is really
>> >difficult. I've fixed some places, but it's still desperately buggy.
>> 
>> Agreed.
>> 
>> >  - It's hard to determine when we should destroy a cache that belongs to
>> >a dead memory cgroup. The point is both SLAB and SLUB implementations
>> >always keep some pages in stock for performance reasons, so just
>> >scheduling cache destruction work from kfree once the last slab page
>> >is freed isn't enough - it will normally never happen for SLUB and
>> >may take really long for SLAB. Of course, we can forbid SL[AU]B
>> >algorithm to stock pages in dead caches, but it looks ugly and has
>> >negative impact on performance (I did this, but finally decided to
>> >revert). Another approach could be scanning dead caches periodically
>> >or on memory pressure, but that would be ugly too.
>> 
>> Not sure about slub, but for SLAB doesn't cache_reap take care of that?
>
> It is, but it takes some time. If we decide to throttle it, then it'll
> take even longer. Anyway, SLUB has nothing like that, therefore we'd
> have to handle different algorithms in different ways, which I
> particularly dislike.
>
>> 
>> >
>> >  - The arrays for storing per-memcg copies can get really large,
>> >especially if we finally decide to leave dead memory cgroups hanging
>> >until memory pressure reaps objects assigned to them and let them
>> >free. How can we deal with an array of, say, 20K elements? Simply
>> >allocating them with kmal^W vmalloc will result in memory wastes. It
>> >will be particularly funny if the user wants to provide each cgroup
>> >with a separate mount point: each super block will have a list_lru
>> >for every memory cgroup, but only one of them will be really used.
>> >That said we need a kind of dynamic reclaimable arrays. Radix trees
>> >would fit, but they are way slower than plain arrays, which is a
>> >no-go, because we want to look up on each kmalloc, list_lru_add/del,
>> >which are fast paths.
>> 
>> The initial design we had was to have an array indexed by "cache id"
>> in struct memcg, instead of the current array indexed by "css id" in
>> struct kmem_cache.
>> The initial design doesn't have the problem you're describing here, as
>> far as I can tell.
>
> It is indexed by "cache id", not "css id", but it doesn't matter
> actually. Suppose, when a cgroup is taken offli

Re: [patch] mm: memcontrol: support transparent huge pages under pressure

2014-09-22 Thread Greg Thelen


On Fri, Sep 19 2014, Johannes Weiner wrote:

> In a memcg with even just moderate cache pressure, success rates for
> transparent huge page allocations drop to zero, wasting a lot of
> effort that the allocator puts into assembling these pages.
>
> The reason for this is that the memcg reclaim code was never designed
> for higher-order charges.  It reclaims in small batches until there is
> room for at least one page.  Huge pages charges only succeed when
> these batches add up over a series of huge faults, which is unlikely
> under any significant load involving order-0 allocations in the group.
>
> Remove that loop on the memcg side in favor of passing the actual
> reclaim goal to direct reclaim, which is already set up and optimized
> to meet higher-order goals efficiently.
>
> This brings memcg's THP policy in line with the system policy: if the
> allocator painstakingly assembles a hugepage, memcg will at least make
> an honest effort to charge it.  As a result, transparent hugepage
> allocation rates amid cache activity are drastically improved:
>
>   vanilla patched
> pgalloc 4717530.80 (  +0.00%)   4451376.40 (  -5.64%)
> pgfault  491370.60 (  +0.00%)225477.40 ( -54.11%)
> pgmajfault2.00 (  +0.00%) 1.80 (  -6.67%)
> thp_fault_alloc   0.00 (  +0.00%)   531.60 (+100.00%)
> thp_fault_fallback  749.00 (  +0.00%)   217.40 ( -70.88%)
>
> [ Note: this may in turn increase memory consumption from internal
>   fragmentation, which is an inherent risk of transparent hugepages.
>   Some setups may have to adjust the memcg limits accordingly to
>   accomodate this - or, if the machine is already packed to capacity,
>   disable the transparent huge page feature. ]

We're using an earlier version of this patch, so I approve of the
general direction.  But I have some feedback.

The memsw aspect of this change seems somewhat separate.  Can it be
split into a different patch?

The memsw aspect of this patch seems to change behavior.  Is this
intended?  If so, a mention of it in the commit log would assuage the
reader.  I'll explain...  Assume a machine with swap enabled and
res.limit==memsw.limit, thus memsw_is_minimum is true.  My understanding
is that memsw.usage represents sum(ram_usage, swap_usage).  So when
memsw_is_minimum=true, then both swap_usage=0 and
memsw.usage==res.usage.  In this condition, if res usage is at limit
then there's no point in swapping because memsw.usage is already
maximal.  Prior to this patch I think the kernel did the right thing,
but not afterwards.

Before this patch:
  if res.usage == res.limit, try_charge() indirectly calls
  try_to_free_mem_cgroup_pages(noswap=true)

After this patch:
  if res.usage == res.limit, try_charge() calls
  try_to_free_mem_cgroup_pages(may_swap=true)

Notice the inverted swap-is-allowed value.

I haven't had time to look at your other outstanding memcg patches.
These comments were made with this patch in isolation.

> Signed-off-by: Johannes Weiner 
> ---
>  include/linux/swap.h |  6 ++--
>  mm/memcontrol.c  | 86 
> +++-
>  mm/vmscan.c  |  7 +++--
>  3 files changed, 25 insertions(+), 74 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index ea4f926e6b9b..37a585beef5c 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -327,8 +327,10 @@ extern void lru_cache_add_active_or_unevictable(struct 
> page *page,
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>   gfp_t gfp_mask, nodemask_t *mask);
>  extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> -   gfp_t gfp_mask, bool noswap);
> +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> +   unsigned long nr_pages,
> +   gfp_t gfp_mask,
> +   bool may_swap);
>  extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
>   gfp_t gfp_mask, bool noswap,
>   struct zone *zone,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9431024e490c..e2def11f1ec1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -315,9 +315,6 @@ struct mem_cgroup {
>   /* OOM-Killer disable */
>   int oom_kill_disable;
>  
> - /* set when res.limit == memsw.limit */
> - boolmemsw_is_minimum;
> -
>   /* protect arrays of thresholds */
>   struct mutex thresholds_lock;
>  
> @@ -481,14 +478,6 @@ enum res_type {
>  #define OOM_CONTROL  (0)
>  
>  /*
> - * Reclaim f

Re: [patch] mm: memcontrol: support transparent huge pages under pressure

2014-09-23 Thread Greg Thelen


On Tue, Sep 23 2014, Johannes Weiner wrote:

> On Mon, Sep 22, 2014 at 10:52:50PM -0700, Greg Thelen wrote:
>> 
>> On Fri, Sep 19 2014, Johannes Weiner wrote:
>> 
>> > In a memcg with even just moderate cache pressure, success rates for
>> > transparent huge page allocations drop to zero, wasting a lot of
>> > effort that the allocator puts into assembling these pages.
>> >
>> > The reason for this is that the memcg reclaim code was never designed
>> > for higher-order charges.  It reclaims in small batches until there is
>> > room for at least one page.  Huge pages charges only succeed when
>> > these batches add up over a series of huge faults, which is unlikely
>> > under any significant load involving order-0 allocations in the group.
>> >
>> > Remove that loop on the memcg side in favor of passing the actual
>> > reclaim goal to direct reclaim, which is already set up and optimized
>> > to meet higher-order goals efficiently.
>> >
>> > This brings memcg's THP policy in line with the system policy: if the
>> > allocator painstakingly assembles a hugepage, memcg will at least make
>> > an honest effort to charge it.  As a result, transparent hugepage
>> > allocation rates amid cache activity are drastically improved:
>> >
>> >   vanilla patched
>> > pgalloc 4717530.80 (  +0.00%)   4451376.40 (  -5.64%)
>> > pgfault  491370.60 (  +0.00%)225477.40 ( -54.11%)
>> > pgmajfault2.00 (  +0.00%) 1.80 (  -6.67%)
>> > thp_fault_alloc   0.00 (  +0.00%)   531.60 (+100.00%)
>> > thp_fault_fallback  749.00 (  +0.00%)   217.40 ( -70.88%)
>> >
>> > [ Note: this may in turn increase memory consumption from internal
>> >   fragmentation, which is an inherent risk of transparent hugepages.
>> >   Some setups may have to adjust the memcg limits accordingly to
>> >   accomodate this - or, if the machine is already packed to capacity,
>> >   disable the transparent huge page feature. ]
>> 
>> We're using an earlier version of this patch, so I approve of the
>> general direction.  But I have some feedback.
>> 
>> The memsw aspect of this change seems somewhat separate.  Can it be
>> split into a different patch?
>> 
>> The memsw aspect of this patch seems to change behavior.  Is this
>> intended?  If so, a mention of it in the commit log would assuage the
>> reader.  I'll explain...  Assume a machine with swap enabled and
>> res.limit==memsw.limit, thus memsw_is_minimum is true.  My understanding
>> is that memsw.usage represents sum(ram_usage, swap_usage).  So when
>> memsw_is_minimum=true, then both swap_usage=0 and
>> memsw.usage==res.usage.  In this condition, if res usage is at limit
>> then there's no point in swapping because memsw.usage is already
>> maximal.  Prior to this patch I think the kernel did the right thing,
>> but not afterwards.
>> 
>> Before this patch:
>>   if res.usage == res.limit, try_charge() indirectly calls
>>   try_to_free_mem_cgroup_pages(noswap=true)
>> 
>> After this patch:
>>   if res.usage == res.limit, try_charge() calls
>>   try_to_free_mem_cgroup_pages(may_swap=true)
>> 
>> Notice the inverted swap-is-allowed value.
>
> For some reason I had myself convinced that this is dead code due to a
> change in callsites a long time ago, but you are right that currently
> try_charge() relies on it, thanks for pointing it out.
>
> However, memsw is always equal to or bigger than the memory limit - so
> instead of keeping a separate state variable to track when memory
> failure implies memsw failure, couldn't we just charge memsw first?
>
> How about the following?  But yeah, I'd split this into a separate
> patch now.

Looks good to me.  Thanks.

Acked-by: Greg Thelen 

> ---
>  mm/memcontrol.c | 15 ---
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e2def11f1ec1..7c9a8971d0f4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2497,16 +2497,17 @@ retry:
>   goto done;
>  
>   size = batch * PAGE_SIZE;
> - if (!res_counter_charge(&memcg->res, size, &fail_res)) {
> - if (!do_swap_account)
> + if (!do_swap_account ||
> + !res_counter_charge(&memcg->memsw, size, &fail_res)) {
> + if (!res_counter_charge(&

Re: [PATCH v2 0/4] memcg: Low-limit reclaim

2014-04-29 Thread Greg Thelen


On Mon, Apr 28 2014, Roman Gushchin  wrote:

> 28.04.2014, 16:27, "Michal Hocko" :
>> The series is based on top of the current mmotm tree. Once the series
>> gets accepted I will post a patch which will mark the soft limit as
>> deprecated with a note that it will be eventually dropped. Let me know
>> if you would prefer to have such a patch a part of the series.
>>
>> Thoughts?
>
>
> Looks good to me.
>
> The only question is: are there any ideas how the hierarchy support
> will be used in this case in practice?
> Will someone set low limit for non-leaf cgroups? Why?
>
> Thanks,
> Roman

I imagine that a hosting service may want to give X MB to a top level
memcg (/a) with sub-jobs (/a/b, /a/c) which may(not) have their own
low-limits.

Examples:

case_1) only set low limit on /a.  /a/b and /a/c may overcommit /a's
memory (b.limit_in_bytes + c.limit_in_bytes > a.limit_in_bytes).

case_2) low limits on all memcg.  But not overcommitting low_limits
(b.low_limit_in_in_bytes + c.low_limit_in_in_bytes <=
a.low_limit_in_in_bytes).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] memcg: deprecate memory.force_empty knob

2014-05-16 Thread Greg Thelen

On Tue, May 13 2014, Michal Hocko  wrote:

> force_empty has been introduced primarily to drop memory before it gets
> reparented on the group removal. This alone doesn't sound fully
> justified because reparented pages which are not in use can be reclaimed
> also later when there is a memory pressure on the parent level.
>
> Mark the knob CFTYPE_INSANE which tells the cgroup core that it
> shouldn't create the knob with the experimental sane_behavior. Other
> users will get informed about the deprecation and asked to tell us more
> because I do not expect most users will use sane_behavior cgroups mode
> very soon.
> Anyway I expect that most users will be simply cgroup remove handlers
> which do that since ever without having any good reason for it.
>
> If somebody really cares because reparented pages, which would be
> dropped otherwise, push out more important ones then we should fix the
> reparenting code and put pages to the tail.

I should mention a case where I've needed to use memory.force_empty: to
synchronously flush stats from child to parent.  Without force_empty
memory.stat is temporarily inconsistent until async css_offline
reparents charges.  Here is an example on v3.14 showing that
parent/memory.stat contents are in-flux immediately after rmdir of
parent/child.

$ cat /test
#!/bin/bash

# Create parent and child.  Add some non-reclaimable anon rss to child,
# then move running task to parent.
mkdir p p/c
(echo $BASHPID > p/c/cgroup.procs && exec sleep 1d) &
pid=$!
sleep 1
echo $pid > p/cgroup.procs 

grep 'rss ' {p,p/c}/memory.stat
if [[ $1 == force ]]; then
  echo 1 > p/c/memory.force_empty
fi
rmdir p/c

echo 'For a small time the p/c memory has not been reparented to p.'
grep 'rss ' {p,p/c}/memory.stat

sleep 1
echo 'After waiting all memory has been reparented'
grep 'rss ' {p,p/c}/memory.stat

kill $pid
rmdir p

-- First, demonstrate that just rmdir, without memory.force_empty,
   temporarily hides reparented child memory stats.

$ /test
p/memory.stat:rss 0
p/memory.stat:total_rss 69632
p/c/memory.stat:rss 69632
p/c/memory.stat:total_rss 69632
For a small time the p/c memory has not been reparented to p.
p/memory.stat:rss 0
p/memory.stat:total_rss 0
grep: p/c/memory.stat: No such file or directory
After waiting all memory has been reparented
p/memory.stat:rss 69632
p/memory.stat:total_rss 69632
grep: p/c/memory.stat: No such file or directory
/test: Terminated  ( echo $BASHPID > p/c/cgroup.procs && exec sleep 
1d )

-- Demonstrate that using memory.force_empty before rmdir, behaves more
   sensibly.  Stats for reparented child memory are not hidden.

$ /test force
p/memory.stat:rss 0
p/memory.stat:total_rss 69632
p/c/memory.stat:rss 69632
p/c/memory.stat:total_rss 69632
For a small time the p/c memory has not been reparented to p.
p/memory.stat:rss 69632
p/memory.stat:total_rss 69632
grep: p/c/memory.stat: No such file or directory
After waiting all memory has been reparented
p/memory.stat:rss 69632
p/memory.stat:total_rss 69632
grep: p/c/memory.stat: No such file or directory
/test: Terminated  ( echo $BASHPID > p/c/cgroup.procs && exec sleep 
1d )
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm 1/4] sl[au]b: do not charge large allocations to memcg

2014-03-26 Thread Greg Thelen


On Wed, Mar 26 2014, Vladimir Davydov  wrote:

> We don't track any random page allocation, so we shouldn't track kmalloc
> that falls back to the page allocator.

This seems like a change which will leads to confusing (and arguably
improper) kernel behavior.  I prefer the behavior prior to this patch.

Before this change both of the following allocations are charged to
memcg (assuming kmem accounting is enabled):
 a = kmalloc(KMALLOC_MAX_CACHE_SIZE, GFP_KERNEL)
 b = kmalloc(KMALLOC_MAX_CACHE_SIZE + 1, GFP_KERNEL)

After this change only 'a' is charged; 'b' goes directly to page
allocator which no longer does accounting.

> Signed-off-by: Vladimir Davydov 
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Glauber Costa 
> Cc: Christoph Lameter 
> Cc: Pekka Enberg 
> ---
>  include/linux/slab.h |2 +-
>  mm/memcontrol.c  |   27 +--
>  mm/slub.c|4 ++--
>  3 files changed, 4 insertions(+), 29 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 3dd389aa91c7..8a928ff71d93 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -363,7 +363,7 @@ kmalloc_order(size_t size, gfp_t flags, unsigned int 
> order)
>  {
>   void *ret;
>  
> - flags |= (__GFP_COMP | __GFP_KMEMCG);
> + flags |= __GFP_COMP;
>   ret = (void *) __get_free_pages(flags, order);
>   kmemleak_alloc(ret, size, 1, flags);
>   return ret;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b4b6aef562fa..81a162d01d4d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3528,35 +3528,10 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct 
> mem_cgroup **_memcg, int order)
>  
>   *_memcg = NULL;
>  
> - /*
> -  * Disabling accounting is only relevant for some specific memcg
> -  * internal allocations. Therefore we would initially not have such
> -  * check here, since direct calls to the page allocator that are marked
> -  * with GFP_KMEMCG only happen outside memcg core. We are mostly
> -  * concerned with cache allocations, and by having this test at
> -  * memcg_kmem_get_cache, we are already able to relay the allocation to
> -  * the root cache and bypass the memcg cache altogether.
> -  *
> -  * There is one exception, though: the SLUB allocator does not create
> -  * large order caches, but rather service large kmallocs directly from
> -  * the page allocator. Therefore, the following sequence when backed by
> -  * the SLUB allocator:
> -  *
> -  *  memcg_stop_kmem_account();
> -  *  kmalloc()
> -  *  memcg_resume_kmem_account();
> -  *
> -  * would effectively ignore the fact that we should skip accounting,
> -  * since it will drive us directly to this function without passing
> -  * through the cache selector memcg_kmem_get_cache. Such large
> -  * allocations are extremely rare but can happen, for instance, for the
> -  * cache arrays. We bring this test here.
> -  */
> - if (!current->mm || current->memcg_kmem_skip_account)
> + if (!current->mm)
>   return true;
>  
>   memcg = get_mem_cgroup_from_mm(current->mm);
> -
>   if (!memcg_can_account_kmem(memcg)) {
>   css_put(&memcg->css);
>   return true;
> diff --git a/mm/slub.c b/mm/slub.c
> index 5e234f1f8853..c2e58a787443 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3325,7 +3325,7 @@ static void *kmalloc_large_node(size_t size, gfp_t 
> flags, int node)
>   struct page *page;
>   void *ptr = NULL;
>  
> - flags |= __GFP_COMP | __GFP_NOTRACK | __GFP_KMEMCG;
> + flags |= __GFP_COMP | __GFP_NOTRACK;
>   page = alloc_pages_node(node, flags, get_order(size));
>   if (page)
>   ptr = page_address(page);
> @@ -3395,7 +3395,7 @@ void kfree(const void *x)
>   if (unlikely(!PageSlab(page))) {
>   BUG_ON(!PageCompound(page));
>   kfree_hook(x);
> - __free_memcg_kmem_pages(page, compound_order(page));
> + __free_pages(page, compound_order(page));
>   return;
>   }
>   slab_free(page->slab_cache, page, object, _RET_IP_);
> -- 
> 1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] memcg: remove lookup_cgroup_page() prototype

2014-06-19 Thread Greg Thelen

6b208e3f6e35 ("mm: memcg: remove unused node/section info from
pc->flags") deleted the lookup_cgroup_page() function but left a
prototype for it.

Kill the vestigial prototype.

Signed-off-by: Greg Thelen 
---
 include/linux/page_cgroup.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 777a524716db..0ff470de3c12 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -42,7 +42,6 @@ static inline void __init page_cgroup_init(void)
 #endif
 
 struct page_cgroup *lookup_page_cgroup(struct page *page);
-struct page *lookup_cgroup_page(struct page_cgroup *pc);
 
 #define TESTPCGFLAG(uname, lname)  \
 static inline int PageCgroup##uname(struct page_cgroup *pc)\
-- 
2.0.0.526.g5318336

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/5] Virtual Memory Resource Controller for cgroups

2014-07-09 Thread Greg Thelen

On Wed, Jul 9, 2014 at 9:36 AM, Vladimir Davydov  wrote:
> Hi Tim,
>
> On Wed, Jul 09, 2014 at 08:08:07AM -0700, Tim Hockin wrote:
>> How is this different from RLIMIT_AS?  You specifically mentioned it
>> earlier but you don't explain how this is different.
>
> The main difference is that RLIMIT_AS is per process while this
> controller is per cgroup. RLIMIT_AS doesn't allow us to limit VSIZE for
> a group of unrelated or cooperating through shmem processes.
>
> Also RLIMIT_AS accounts for total VM usage (including file mappings),
> while this only charges private writable and shared mappings, whose
> faulted-in pages always occupy mem+swap and therefore cannot be just
> synced and dropped like file pages. In other words, this controller
> works exactly as the global overcommit control.
>
>> From my perspective, this is pointless.  There's plenty of perfectly
>> correct software that mmaps files without concern for VSIZE, because
>> they never fault most of those pages in.
>
> But there's also software that correctly handles ENOMEM returned by
> mmap. For example, mongodb keeps growing its buffers until mmap fails.
> Therefore, if there's no overcommit control, it will be OOM-killed
> sooner or later, which may be pretty annoying. And we did have customers
> complaining about that.

Is mongodb's buffer growth causing the oom kills?

If yes, I wonder if apps, like mongodb, that want ENOMEM should (1)
use MAP_POPULATE and (2) we change vm_map_pgoff() to propagate
mm_populate() ENOMEM failures back to mmap()?

>> From my observations it is not generally possible to predict an
>> average VSIZE limit that would satisfy your concerns *and* not kill
>> lots of valid apps.
>
> Yes, it's difficult. Actually, we can only guess. Nevertheless, we
> predict and set the VSIZE limit system-wide by default.
>
>> It sounds like what you want is to limit or even disable swap usage.
>
> I want to avoid OOM kill if it's possible to return ENOMEM. OOM can be
> painful. It can kill lots of innocent processes. Of course, the user can
> protect some processes by setting oom_score_adj, but this is difficult
> and requires time and expertise, so an average user won't do that.
>
>> Given your example, your hypothetical user would probably be better of
>> getting an OOM kill early so she can fix her job spec to request more
>> memory.
>
> In my example the user won't get OOM kill *early*...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim

2014-06-09 Thread Greg Thelen

On Fri, Jun 06 2014, Michal Hocko  wrote:

> Some users (e.g. Google) would like to have stronger semantic than low
> limit offers currently. The fallback mode is not desirable and they
> prefer hitting OOM killer rather than ignoring low limit for protected
> groups. There are other possible usecases which can benefit from hard
> guarantees. I can imagine workloads where setting low_limit to the same
> value as hard_limit to prevent from any reclaim at all makes a lot of
> sense because reclaim is much more disrupting than restart of the load.
>
> This patch adds a new per memcg memory.reclaim_strategy knob which
> tells what to do in a situation when memory reclaim cannot do any
> progress because all groups in the reclaimed hierarchy are within their
> low_limit. There are two options available:
>   - low_limit_best_effort - the current mode when reclaim falls
> back to the even reclaim of all groups in the reclaimed
> hierarchy
>   - low_limit_guarantee - groups within low_limit are never
> reclaimed and OOM killer is triggered instead. OOM message
> will mention the fact that the OOM was triggered due to
> low_limit reclaim protection.

To (a) be consistent with existing hard and soft limits APIs and (b)
allow use of both best effort and guarantee memory limits, I wonder if
it's best to offer three per memcg limits, rather than two limits (hard,
low_limit) and a related reclaim_strategy knob.  The three limits I'm
thinking about are:

1) hard_limit (aka the existing limit_in_bytes cgroupfs file).  No
   change needed here.  This is an upper bound on a memcg hierarchy's
   memory consumption (assuming use_hierarchy=1).

2) best_effort_limit (aka desired working set).  This allow an
   application or administrator to provide a hint to the kernel about
   desired working set size.  Before oom'ing the kernel is allowed to
   reclaim below this limit.  I think the current soft_limit_in_bytes
   claims to provide this.  If we prefer to deprecate
   soft_limit_in_bytes, then a new desired_working_set_in_bytes (or a
   hopefully better named) API seems reasonable.

3) low_limit_guarantee which is a lower bound of memory usage.  A memcg
   would prefer to be oom killed rather than operate below this
   threshold.  Default value is zero to preserve compatibility with
   existing apps.

Logically hard_limit >= best_effort_limit >= low_limit_guarantee.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim

2014-06-10 Thread Greg Thelen


On Tue, Jun 10 2014, Johannes Weiner  wrote:

> On Mon, Jun 09, 2014 at 03:52:51PM -0700, Greg Thelen wrote:
>> 
>> On Fri, Jun 06 2014, Michal Hocko  wrote:
>> 
>> > Some users (e.g. Google) would like to have stronger semantic than low
>> > limit offers currently. The fallback mode is not desirable and they
>> > prefer hitting OOM killer rather than ignoring low limit for protected
>> > groups. There are other possible usecases which can benefit from hard
>> > guarantees. I can imagine workloads where setting low_limit to the same
>> > value as hard_limit to prevent from any reclaim at all makes a lot of
>> > sense because reclaim is much more disrupting than restart of the load.
>> >
>> > This patch adds a new per memcg memory.reclaim_strategy knob which
>> > tells what to do in a situation when memory reclaim cannot do any
>> > progress because all groups in the reclaimed hierarchy are within their
>> > low_limit. There are two options available:
>> >- low_limit_best_effort - the current mode when reclaim falls
>> >  back to the even reclaim of all groups in the reclaimed
>> >  hierarchy
>> >- low_limit_guarantee - groups within low_limit are never
>> >  reclaimed and OOM killer is triggered instead. OOM message
>> >  will mention the fact that the OOM was triggered due to
>> >  low_limit reclaim protection.
>> 
>> To (a) be consistent with existing hard and soft limits APIs and (b)
>> allow use of both best effort and guarantee memory limits, I wonder if
>> it's best to offer three per memcg limits, rather than two limits (hard,
>> low_limit) and a related reclaim_strategy knob.  The three limits I'm
>> thinking about are:
>> 
>> 1) hard_limit (aka the existing limit_in_bytes cgroupfs file).  No
>>change needed here.  This is an upper bound on a memcg hierarchy's
>>memory consumption (assuming use_hierarchy=1).
>
> This creates internal pressure.  Outside reclaim is not affected by
> it, but internal charges can not exceed this limit.  This is set to
> hard limit the maximum memory consumption of a group (max).
>
>> 2) best_effort_limit (aka desired working set).  This allow an
>>application or administrator to provide a hint to the kernel about
>>desired working set size.  Before oom'ing the kernel is allowed to
>>reclaim below this limit.  I think the current soft_limit_in_bytes
>>claims to provide this.  If we prefer to deprecate
>>soft_limit_in_bytes, then a new desired_working_set_in_bytes (or a
>>hopefully better named) API seems reasonable.
>
> This controls how external pressure applies to the group.
>
> But it's conceivable that we'd like to have the equivalent of such a
> soft limit for *internal* pressure.  Set below the hard limit, this
> internal soft limit would have charges trigger direct reclaim in the
> memcg but allow them to continue to the hard limit.  This would create
> a situation wherein the allocating tasks are not killed, but throttled
> under reclaim, which gives the administrator a window to detect the
> situation with vmpressure and possibly intervene.  Because as it
> stands, once the current hard limit is hit things can go down pretty
> fast and the window for reacting to vmpressure readings is often too
> small.  This would offer a more gradual deterioration.  It would be
> set to the upper end of the working set size range (high).
>
> I think for many users such an internal soft limit would actually be
> preferred over the current hard limit, as they'd rather have some
> reclaim throttling than an OOM kill when the group reaches its upper
> bound.  The current hard limit would be reserved for more advanced or
> paid cases, where the admin would rather see a memcg get OOM killed
> than exceed a certain size.
>
> Then, as you proposed, we'd have the soft limit for external pressure,
> where the kernel only reclaims groups within that limit in order to
> avoid OOM kills.  It would be set to the estimated lower end of the
> working set size range (low).
>
>> 3) low_limit_guarantee which is a lower bound of memory usage.  A memcg
>>would prefer to be oom killed rather than operate below this
>>threshold.  Default value is zero to preserve compatibility with
>>existing apps.
>
> And this would be the external pressure hard limit, which would be set
> to the absolute minimum requirement of the group (min).
>
> Either because it would be hopelessly thrashing without it, or because
> this guaranteed memory is actually paid for.  Again, I

Re: [PATCH -mm 1/4] sl[au]b: do not charge large allocations to memcg

2014-03-27 Thread Greg Thelen

On Thu, Mar 27, 2014 at 12:37 AM, Vladimir Davydov
 wrote:
> Hi Greg,
>
> On 03/27/2014 08:31 AM, Greg Thelen wrote:
>> On Wed, Mar 26 2014, Vladimir Davydov  wrote:
>>
>>> We don't track any random page allocation, so we shouldn't track kmalloc
>>> that falls back to the page allocator.
>> This seems like a change which will leads to confusing (and arguably
>> improper) kernel behavior.  I prefer the behavior prior to this patch.
>>
>> Before this change both of the following allocations are charged to
>> memcg (assuming kmem accounting is enabled):
>>  a = kmalloc(KMALLOC_MAX_CACHE_SIZE, GFP_KERNEL)
>>  b = kmalloc(KMALLOC_MAX_CACHE_SIZE + 1, GFP_KERNEL)
>>
>> After this change only 'a' is charged; 'b' goes directly to page
>> allocator which no longer does accounting.
>
> Why do we need to charge 'b' in the first place? Can the userspace
> trigger such allocations massively? If there can only be one or two such
> allocations from a cgroup, is there any point in charging them?

Of the top of my head I don't know of any >8KIB kmalloc()s so I can't
say if they're directly triggerable by user space en masse.  But we
recently ran into some order:3 allocations in networking.  The
networking allocations used a non-generic kmem_cache (rather than
kmalloc which started this discussion).  For details, see ed98df3361f0
("net: use __GFP_NORETRY for high order allocations").  I can't say if
such allocations exist in device drivers, but given the networking
example, it's conceivable that they may (or will) exist.

With slab this isn't a problem because sla has kmalloc kmem_caches for
all supported allocation sizes.  However, slub shows this issue for
any kmalloc() allocations larger than 8KIB (at least on x86_64).  It
seems like a strange directly to take kmem accounting to say that
kmalloc allocations are kmem limited, but only if they are either less
than a threshold size or done with slab.  Simply increasing the size
of a data structure doesn't seem like it should automatically cause
the memory to become exempt from kmem limits.

> In fact, do we actually need to charge every random kmem allocation? I
> guess not. For instance, filesystems often allocate data shared among
> all the FS users. It's wrong to charge such allocations to a particular
> memcg, IMO. That said the next step is going to be adding a per kmem
> cache flag specifying if allocations from this cache should be charged
> so that accounting will work only for those caches that are marked so
> explicitly.

It's a question of what direction to approach kmem slab accounting
from: either opt-out (as the code currently is), or opt-in (with per
kmem_cache flags as you suggest).  I agree that some structures end up
being shared (e.g. filesystem block bit map structures).  In an
opt-out system these are charged to a memcg initially and remain
charged there until the memcg is deleted at which point the shared
objects are reparented to a shared location.  While this isn't
perfect, it's unclear if it's better or worse than analyzing each
class of allocation and deciding if they should be opt'd-in.  One
could (though I'm not) make the case that even dentries are easily
shareable between containers and thus shouldn't be accounted to a
single memcg.  But given user space's ability to DoS a machine with
dentires, they should be accounted.

> There is one more argument for removing kmalloc_large accounting - we
> don't have an easy way to track such allocations, which prevents us from
> reparenting kmemcg charges on css offline. Of course, we could link
> kmalloc_large pages in some sort of per-memcg list which would allow us
> to find them on css offline, but I don't think such a complication is
> justified.

I assume that reparenting of such non kmem_cache allocations (e.g.
large kmalloc) is difficult because such pages refer to the memcg,
which we're trying to delete and the memcg has no index of such pages.
 If such zombie memcg are undesirable, then an alternative to indexing
the pages is to define a kmem context object which such large pages
point to.  The kmem context would be reparented without needing to
adjust the individual large pages.  But there are plenty of options.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 0/4] memcg: Low-limit reclaim

2014-05-28 Thread Greg Thelen


On Wed, May 28 2014, Johannes Weiner  wrote:

> On Wed, May 28, 2014 at 04:21:44PM +0200, Michal Hocko wrote:
>> On Wed 28-05-14 09:49:05, Johannes Weiner wrote:
>> > On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote:
>> > > Hi Andrew, Johannes,
>> > > 
>> > > On Mon 28-04-14 14:26:41, Michal Hocko wrote:
>> > > > This patchset introduces such low limit that is functionally similar
>> > > > to a minimum guarantee. Memcgs which are under their lowlimit are not
>> > > > considered eligible for the reclaim (both global and hardlimit) unless
>> > > > all groups under the reclaimed hierarchy are below the low limit when
>> > > > all of them are considered eligible.
>> > > > 
>> > > > The previous version of the patchset posted as a RFC
>> > > > (http://marc.info/?l=linux-mm&m=138677140628677&w=2) suggested a
>> > > > hard guarantee without any fallback. More discussions led me to
>> > > > reconsidering the default behavior and come up a more relaxed one. The
>> > > > hard requirement can be added later based on a use case which really
>> > > > requires. It would be controlled by memory.reclaim_flags knob which
>> > > > would specify whether to OOM or fallback (default) when all groups are
>> > > > bellow low limit.
>> > > 
>> > > It seems that we are not in a full agreement about the default behavior
>> > > yet. Johannes seems to be more for hard guarantee while I would like to
>> > > see the weaker approach first and move to the stronger model later.
>> > > Johannes, is this absolutely no-go for you? Do you think it is seriously
>> > > handicapping the semantic of the new knob?
>> > 
>> > Well we certainly can't start OOMing where we previously didn't,
>> > that's called a regression and automatically limits our options.
>> > 
>> > Any unexpected OOMs will be much more acceptable from a new feature
>> > than from configuration that previously "worked" and then stopped.
>> 
>> Yes and we are not talking about regressions, are we?
>> 
>> > > My main motivation for the weaker model is that it is hard to see all
>> > > the corner case right now and once we hit them I would like to see a
>> > > graceful fallback rather than fatal action like OOM killer. Besides that
>> > > the usaceses I am mostly interested in are OK with fallback when the
>> > > alternative would be OOM killer. I also feel that introducing a knob
>> > > with a weaker semantic which can be made stronger later is a sensible
>> > > way to go.
>> > 
>> > We can't make it stronger, but we can make it weaker. 
>> 
>> Why cannot we make it stronger by a knob/configuration option?
>
> Why can't we make it weaker by a knob?  Why should we design the
> default for unforeseeable cornercases rather than make the default
> make sense for existing cases and give cornercases a fallback once
> they show up?

My 2c...  The following works for my use cases:
1) introduce memory.low_limit_in_bytes (default=0 thus no default change
   from older kernels)
2) interested users will set low_limit_in_bytes to non-zero value.
   Memory protected by low limit should be as migratable/reclaimable as
   mlock memory.  If a zone full of mlock memory causes oom kills, then
   so should the low limit.

If we find corner cases where low_limit_in_bytes is too strict, then we
could discuss a new knob to relax it.  But I think we should start with
a strict low-limit.  If the oom killer gets tied in knots due to low
limit, then I'd like to explore fixing the oom killer before relaxing
low limit.

Disclaimer: new use cases will certainly appear with various
requirements.  But an oom-killing low_limit_in_bytes seems like a
generic opt-in feature, so I think it's worthwhile.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch v3 2/6] mm, compaction: return failed migration target pages back to freelist

2014-05-07 Thread Greg Thelen


On Wed, May 07 2014, Andrew Morton  wrote:

> On Tue, 6 May 2014 19:22:43 -0700 (PDT) David Rientjes  
> wrote:
>
>> Memory compaction works by having a "freeing scanner" scan from one end of a 
>> zone which isolates pages as migration targets while another "migrating 
>> scanner" 
>> scans from the other end of the same zone which isolates pages for migration.
>> 
>> When page migration fails for an isolated page, the target page is returned 
>> to 
>> the system rather than the freelist built by the freeing scanner.  This may 
>> require the freeing scanner to continue scanning memory after suitable 
>> migration 
>> targets have already been returned to the system needlessly.
>> 
>> This patch returns destination pages to the freeing scanner freelist when 
>> page 
>> migration fails.  This prevents unnecessary work done by the freeing scanner 
>> but 
>> also encourages memory to be as compacted as possible at the end of the zone.
>> 
>> Reported-by: Greg Thelen 
>
> What did Greg actually report?  IOW, what if any observable problem is
> being fixed here?

I detected the problem at runtime seeing that ext4 metadata pages (esp
the ones read by "sbi->s_group_desc[i] = sb_bread(sb, block)") were
constantly visited by compaction calls of migrate_pages().  These pages
had a non-zero b_count which caused fallback_migrate_page() ->
try_to_release_page() -> try_to_free_buffers() to fail.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 0/4] memcg: Low-limit reclaim

2014-01-29 Thread Greg Thelen

On Wed, Dec 11 2013, Michal Hocko wrote:

> Hi,
> previous discussions have shown that soft limits cannot be reformed
> (http://lwn.net/Articles/555249/). This series introduces an alternative
> approach to protecting memory allocated to processes executing within
> a memory cgroup controller. It is based on a new tunable that was
> discussed with Johannes and Tejun held during the last kernel summit.
>
> This patchset introduces such low limit that is functionally similar to a
> minimum guarantee. Memcgs which are under their lowlimit are not considered
> eligible for the reclaim (both global and hardlimit). The default value of
> the limit is 0 so all groups are eligible by default and an interested
> party has to explicitly set the limit.
>
> The primary use case is to protect an amount of memory allocated to a
> workload without it being reclaimed by an unrelated activity. In some
> cases this requirement can be fulfilled by mlock but it is not suitable
> for many loads and generally requires application awareness. Such
> application awareness can be complex. It effectively forbids the
> use of memory overcommit as the application must explicitly manage
> memory residency.
> With low limits, such workloads can be placed in a memcg with a low
> limit that protects the estimated working set.
>
> Another use case might be unreclaimable groups. Some loads might be so
> sensitive to reclaim that it is better to kill and start it again (or
> since checkpoint) rather than trash. This would be trivial with low
> limit set to unlimited and the OOM killer will handle the situation as
> required (e.g. kill and restart).
>
> The hierarchical behavior of the lowlimit is described in the first
> patch. It is followed by a direct reclaim fix which is necessary to
> handle situation when a no group is eligible because all groups are
> below low limit. This is not a big deal for hardlimit reclaim because
> we simply retry the reclaim few times and then trigger memcg OOM killer
> path. It would blow up in the global case when we would loop without
> doing any progress or trigger OOM killer. I would consider configuration
> leading to this state invalid but we should handle that gracefully.
>
> The third patch finally allows setting the lowlimit.
>
> The last patch tries expedites OOM if it is clear that no group is
> eligible for reclaim. It basically breaks out of loops in the direct
> reclaim and lets kswapd sleep because it wouldn't do any progress anyway.
>
> Thoughts?
>
> Short log says:
> Michal Hocko (4):
>   memcg, mm: introduce lowlimit reclaim
>   mm, memcg: allow OOM if no memcg is eligible during direct reclaim
>   memcg: Allow setting low_limit
>   mm, memcg: expedite OOM if no memcg is reclaimable
>
> And a diffstat
>  include/linux/memcontrol.h  | 14 +++
>  include/linux/res_counter.h | 40 ++
>  kernel/res_counter.c|  2 ++
>  mm/memcontrol.c | 60 
> -
>  mm/vmscan.c | 59 +---
>  5 files changed, 170 insertions(+), 5 deletions(-)

The series looks useful.  We (Google) have been using something similar.
In practice such a low_limit (or memory guarantee), doesn't nest very
well.

Example:
  - parent_memcg: limit 500, low_limit 500, usage 500
1 privately charged non-reclaimable page (e.g. mlock, slab)
  - child_memcg: limit 500, low_limit 500, usage 499

If a streaming file cache workload (e.g. sha1sum) starts gobbling up
page cache it will lead to an oom kill instead of reclaiming.  One could
argue that this is working as intended because child_memcg was promised
500 but can only get 499.  So child_memcg is oom killed rather than
being forced to operate below its promised low limit.

This has led to various internal workarounds like:
- don't charge any memory to interior tree nodes (e.g. parent_memcg);
  only charge memory to cgroup leafs.  This gets tricky when dealing
  with reparented memory inherited to parent from child during cgroup
  deletion.
- don't set low_limit on non leafs (e.g. do not set low limit on
  parent_memcg).  This constrains the cgroup layout a bit.  Some
  customers want to purchase $MEM and setup their workload with a few
  child cgroups.  A system daemon hands out $MEM by setting low_limit
  for top-level containers (e.g. parent_memcg).  Thereafter such
  customers are able to partition their workload with sub memcg below
  child_memcg.  Example:
 parent_memcg
 \
  child_memcg
/ \
server   backup
  Thereafter customers often want some weak isolation between server and
  backup.  To avoid undesired oom kills the server/backup isolation is
  provided with a softer memory guarantee (e.g. soft_limit).  The soft
  limit acts like the low_limit until priority becomes desperate.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a mess

Re: [RFC 0/4] memcg: Low-limit reclaim

2014-01-30 Thread Greg Thelen

On Thu, Jan 30 2014, Michal Hocko wrote:

> On Wed 29-01-14 11:08:46, Greg Thelen wrote:
> [...]
>> The series looks useful.  We (Google) have been using something similar.
>> In practice such a low_limit (or memory guarantee), doesn't nest very
>> well.
>> 
>> Example:
>>   - parent_memcg: limit 500, low_limit 500, usage 500
>> 1 privately charged non-reclaimable page (e.g. mlock, slab)
>>   - child_memcg: limit 500, low_limit 500, usage 499
>
> I am not sure this is a good example. Your setup basically say that no
> single page should be reclaimed. I can imagine this might be useful in
> some cases and I would like to allow it but it sounds too extreme (e.g.
> a load which would start trashing heavily once the reclaim starts and it
> makes more sense to start it again rather than crowl - think about some
> mathematical simulation which might diverge).

Pages will still be reclaimed the usage_in_bytes is exceeds
limit_in_bytes.  I see the low_limit as a way to tell the kernel: don't
reclaim my memory due to external pressure, but internal pressure is
different.

>> If a streaming file cache workload (e.g. sha1sum) starts gobbling up
>> page cache it will lead to an oom kill instead of reclaiming. 
>
> Does it make any sense to protect all of such memory although it is
> easily reclaimable?

I think protection makes sense in this case.  If I know my workload
needs 500 to operate well, then I reserve 500 using low_limit.  My app
doesn't want to run with less than its reservation.

>> One could argue that this is working as intended because child_memcg
>> was promised 500 but can only get 499.  So child_memcg is oom killed
>> rather than being forced to operate below its promised low limit.
>> 
>> This has led to various internal workarounds like:
>> - don't charge any memory to interior tree nodes (e.g. parent_memcg);
>>   only charge memory to cgroup leafs.  This gets tricky when dealing
>>   with reparented memory inherited to parent from child during cgroup
>>   deletion.
>
> Do those need any protection at all?

Interior tree nodes don't need protection from their children.  But
children and interior nodes need protection from siblings and parents.

>> - don't set low_limit on non leafs (e.g. do not set low limit on
>>   parent_memcg).  This constrains the cgroup layout a bit.  Some
>>   customers want to purchase $MEM and setup their workload with a few
>>   child cgroups.  A system daemon hands out $MEM by setting low_limit
>>   for top-level containers (e.g. parent_memcg).  Thereafter such
>>   customers are able to partition their workload with sub memcg below
>>   child_memcg.  Example:
>>  parent_memcg
>>  \
>>   child_memcg
>> / \
>> server   backup
>
> I think that the low_limit makes sense where you actually want to
> protect something from reclaim. And backup sounds like a bad fit for
> that.

The backup job would presumably have a small low_limit, but it may still
have a minimum working set required to make useful forward progress.

Example:
  parent_memcg
  \
   child_memcg limit 500, low_limit 500, usage 500
 / \
 |   backup   limit 10, low_limit 10, usage 10
 |
  server limit 490, low_limit 490, usage 490

One could argue that problems appear when
server.low_limit+backup.lower_limit=child_memcg.limit.  So the safer
configuration is leave some padding:
  server.low_limit + backup.low_limit + padding = child_memcg.limit
but this just defers the problem.  As memory is reparented into parent,
then padding must grow.

>>   Thereafter customers often want some weak isolation between server and
>>   backup.  To avoid undesired oom kills the server/backup isolation is
>>   provided with a softer memory guarantee (e.g. soft_limit).  The soft
>>   limit acts like the low_limit until priority becomes desperate.
>
> Johannes was already suggesting that the low_limit should allow for a
> weaker semantic as well. I am not very much inclined to that but I can
> leave with a knob which would say oom_on_lowlimit (on by default but
> allowed to be set to 0). We would fallback to the full reclaim if
> no groups turn out to be reclaimable.

I like the strong semantic of your low_limit at least at level:1 cgroups
(direct children of root).  But I have also encountered situations where
a strict guarantee is too strict and a mere preference is desirable.
Perhaps the best plan is to continue with the proposed strict low_limit
and eventually provide an additional mechanism which provides weaker
guarantees (e.g. soft_limit or something else if soft_limit cannot be
altered).  These two would offer good support for a va

Re: [RFC 0/4] memcg: Low-limit reclaim

2014-02-03 Thread Greg Thelen

On Mon, Feb 03 2014, Michal Hocko wrote:

> On Thu 30-01-14 16:28:27, Greg Thelen wrote:
>> On Thu, Jan 30 2014, Michal Hocko wrote:
>> 
>> > On Wed 29-01-14 11:08:46, Greg Thelen wrote:
>> > [...]
>> >> The series looks useful.  We (Google) have been using something similar.
>> >> In practice such a low_limit (or memory guarantee), doesn't nest very
>> >> well.
>> >> 
>> >> Example:
>> >>   - parent_memcg: limit 500, low_limit 500, usage 500
>> >> 1 privately charged non-reclaimable page (e.g. mlock, slab)
>> >>   - child_memcg: limit 500, low_limit 500, usage 499
>> >
>> > I am not sure this is a good example. Your setup basically say that no
>> > single page should be reclaimed. I can imagine this might be useful in
>> > some cases and I would like to allow it but it sounds too extreme (e.g.
>> > a load which would start trashing heavily once the reclaim starts and it
>> > makes more sense to start it again rather than crowl - think about some
>> > mathematical simulation which might diverge).
>> 
>> Pages will still be reclaimed the usage_in_bytes is exceeds
>> limit_in_bytes.  I see the low_limit as a way to tell the kernel: don't
>> reclaim my memory due to external pressure, but internal pressure is
>> different.
>
> That sounds strange and very confusing to me. What if the internal
> pressure comes from children memcgs? Lowlimit is intended for protecting
> a group from reclaim and it shouldn't matter whether the reclaim is a
> result of the internal or external pressure.
>
>> >> If a streaming file cache workload (e.g. sha1sum) starts gobbling up
>> >> page cache it will lead to an oom kill instead of reclaiming. 
>> >
>> > Does it make any sense to protect all of such memory although it is
>> > easily reclaimable?
>> 
>> I think protection makes sense in this case.  If I know my workload
>> needs 500 to operate well, then I reserve 500 using low_limit.  My app
>> doesn't want to run with less than its reservation.
>> 
>> >> One could argue that this is working as intended because child_memcg
>> >> was promised 500 but can only get 499.  So child_memcg is oom killed
>> >> rather than being forced to operate below its promised low limit.
>> >> 
>> >> This has led to various internal workarounds like:
>> >> - don't charge any memory to interior tree nodes (e.g. parent_memcg);
>> >>   only charge memory to cgroup leafs.  This gets tricky when dealing
>> >>   with reparented memory inherited to parent from child during cgroup
>> >>   deletion.
>> >
>> > Do those need any protection at all?
>> 
>> Interior tree nodes don't need protection from their children.  But
>> children and interior nodes need protection from siblings and parents.
>
> Why? They contains only reparented pages in the above case. Those would
> be #1 candidate for reclaim in most cases, no?

I think we're on the same page.  My example interior node has reclaimed
pages and is a #1 candidate for reclaim induced from charges against
parent_memcg, but not a candidate for reclaim due to global memory
pressure induced by a sibling of parent_memcg.

>> >> - don't set low_limit on non leafs (e.g. do not set low limit on
>> >>   parent_memcg).  This constrains the cgroup layout a bit.  Some
>> >>   customers want to purchase $MEM and setup their workload with a few
>> >>   child cgroups.  A system daemon hands out $MEM by setting low_limit
>> >>   for top-level containers (e.g. parent_memcg).  Thereafter such
>> >>   customers are able to partition their workload with sub memcg below
>> >>   child_memcg.  Example:
>> >>  parent_memcg
>> >>  \
>> >>   child_memcg
>> >> / \
>> >> server   backup
>> >
>> > I think that the low_limit makes sense where you actually want to
>> > protect something from reclaim. And backup sounds like a bad fit for
>> > that.
>> 
>> The backup job would presumably have a small low_limit, but it may still
>> have a minimum working set required to make useful forward progress.
>> 
>> Example:
>>   parent_memcg
>>   \
>>child_memcg limit 500, low_limit 500, usage 500
>>  / \
>>  |   backup   limit 10, low_limit 10, usage 10
>>  |
>>   server limit 490, low_limit 490,

Re: RIP: mem_cgroup_move_account+0xf4/0x290

2013-10-26 Thread Greg Thelen

>> [ 7691.500464]  [] ? insert_kthread_work+0x40/0x40
>> [ 7691.507335] Code: 85 f6 48 8b 55 d0 44 8b 4d c8 4c 8b 45 c0 0f 85 b3 00 
>> 00 00 41 8b 4c 24 18
>> 85 c9 0f 88 a6 00 00 00 48 8b b2 30 02 00 00 45 89 ca <4c> 39 56 18 0f 8c 36 
>> 01 00 00 44 89 c9
>> f7 d9 89 cf 65 48 01 7e
>> [ 7691.528638] RIP  [] mem_cgroup_move_account+0xf4/0x290
>>
>> Add the required __this_cpu_read().
>
> Sorry for my mistake and thanks for the fix up, it looks good to me.
>
> Reviewed-by: Sha Zhengju 
>
>
> Thanks,
> Sha
>>
>> Signed-off-by: Johannes Weiner 
>> ---
>>   mm/memcontrol.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 4097a78..a4864b6 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct 
>> mem_cgroup *from,
>>   {
>>  /* Update stat data for mem_cgroup */
>>  preempt_disable();
>> -WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
>> +WARN_ON_ONCE(__this_cpu_read(from->stat->count[idx]) < nr_pages);
>>  __this_cpu_add(from->stat->count[idx], -nr_pages);
>>  __this_cpu_add(to->stat->count[idx], nr_pages);
>>      preempt_enable();
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

I was just polishing up this following patch which I think is better
because it should avoid spurious warnings.

---8<---

>From c1f43ef0f4cc42fb2ecaeaca71bd247365e3521e Mon Sep 17 00:00:00 2001
From: Greg Thelen 
Date: Fri, 25 Oct 2013 21:59:57 -0700
Subject: [PATCH] memcg: remove incorrect underflow check

When a memcg is deleted mem_cgroup_reparent_charges() moves charged
memory to the parent memcg.  As of v3.11-9444-g3ea67d0 "memcg: add per
cgroup writeback pages accounting" there's bad pointer read.  The goal
was to check for counter underflow.  The counter is a per cpu counter
and there are two problems with the code:
(1) per cpu access function isn't used, instead a naked pointer is
used which easily causes panic.
(2) the check doesn't sum all cpus

Test:
  $ cd /sys/fs/cgroup/memory
  $ mkdir x
  $ echo 3 > /proc/sys/vm/drop_caches
  $ (echo $BASHPID >> x/tasks && exec cat) &
  [1] 7154
  $ grep ^mapped x/memory.stat
  mapped_file 53248
  $ echo 7154 > tasks
  $ rmdir x
  

The fix is to remove the check.  It's currently dangerous and isn't
worth fixing it to use something expensive, such as
percpu_counter_sum(), for each reparented page.  __this_cpu_read()
isn't enough to fix this because there's no guarantees of the current
cpus count.  The only guarantees is that the sum of all per-cpu
counter is >= nr_pages.

Signed-off-by: Greg Thelen 
---
 mm/memcontrol.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 34d3ca9..aa8185c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3773,7 +3773,6 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup 
*from,
 {
/* Update stat data for mem_cgroup */
preempt_disable();
-   WARN_ON_ONCE(from->stat->count[idx] < nr_pages);
__this_cpu_add(from->stat->count[idx], -nr_pages);
__this_cpu_add(to->stat->count[idx], nr_pages);
preempt_enable();
-- 
1.8.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] percpu counter: cast this_cpu_sub() adjustment

2013-10-27 Thread Greg Thelen

this_cpu_sub() is implemented as negation and addition.

This patch casts the adjustment to the counter type before negation to
sign extend the adjustment.  This helps in cases where the counter
type is wider than an unsigned adjustment.  An alternative to this
patch is to declare such operations unsupported, but it seemed useful
to avoid surprises.

This patch specifically helps the following example:
  unsigned int delta = 1
  preempt_disable()
  this_cpu_write(long_counter, 0)
  this_cpu_sub(long_counter, delta)
  preempt_enable()

Before this change long_counter on a 64 bit machine ends with value
0x, rather than 0x.  This is because
this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta),
which is basically:
  long_counter = 0 + 0x

Also apply the same cast to:
  __this_cpu_sub()
  this_cpu_sub_return()
  and __this_cpu_sub_return()

All percpu_test.ko passes, especially the following cases which
previously failed:

  l -= ui_one;
  __this_cpu_sub(long_counter, ui_one);
  CHECK(l, long_counter, -1);

  l -= ui_one;
  this_cpu_sub(long_counter, ui_one);
  CHECK(l, long_counter, -1);
  CHECK(l, long_counter, 0x);

  ul -= ui_one;
  __this_cpu_sub(ulong_counter, ui_one);
  CHECK(ul, ulong_counter, -1);
  CHECK(ul, ulong_counter, 0x);

  ul = this_cpu_sub_return(ulong_counter, ui_one);
  CHECK(ul, ulong_counter, 2);

  ul = __this_cpu_sub_return(ulong_counter, ui_one);
  CHECK(ul, ulong_counter, 1);

Signed-off-by: Greg Thelen 
---
 arch/x86/include/asm/percpu.h | 3 ++-
 include/linux/percpu.h| 8 
 lib/percpu_test.c | 2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 0da5200..b3e18f8 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -128,7 +128,8 @@ do {
\
 do {   \
typedef typeof(var) pao_T__;\
const int pao_ID__ = (__builtin_constant_p(val) &&  \
- ((val) == 1 || (val) == -1)) ? (val) : 0; \
+ ((val) == 1 || (val) == -1)) ?\
+   (int)(val) : 0; \
if (0) {\
pao_T__ pao_tmp__;  \
pao_tmp__ = (val);  \
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index cc88172..c74088a 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -332,7 +332,7 @@ do {
\
 #endif
 
 #ifndef this_cpu_sub
-# define this_cpu_sub(pcp, val)this_cpu_add((pcp), -(val))
+# define this_cpu_sub(pcp, val)this_cpu_add((pcp), 
-(typeof(pcp))(val))
 #endif
 
 #ifndef this_cpu_inc
@@ -418,7 +418,7 @@ do {
\
 # define this_cpu_add_return(pcp, val) 
__pcpu_size_call_return2(this_cpu_add_return_, pcp, val)
 #endif
 
-#define this_cpu_sub_return(pcp, val)  this_cpu_add_return(pcp, -(val))
+#define this_cpu_sub_return(pcp, val)  this_cpu_add_return(pcp, 
-(typeof(pcp))(val))
 #define this_cpu_inc_return(pcp)   this_cpu_add_return(pcp, 1)
 #define this_cpu_dec_return(pcp)   this_cpu_add_return(pcp, -1)
 
@@ -586,7 +586,7 @@ do {
\
 #endif
 
 #ifndef __this_cpu_sub
-# define __this_cpu_sub(pcp, val)  __this_cpu_add((pcp), -(val))
+# define __this_cpu_sub(pcp, val)  __this_cpu_add((pcp), 
-(typeof(pcp))(val))
 #endif
 
 #ifndef __this_cpu_inc
@@ -668,7 +668,7 @@ do {
\
__pcpu_size_call_return2(__this_cpu_add_return_, pcp, val)
 #endif
 
-#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, 
-(val))
+#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, 
-(typeof(pcp))(val))
 #define __this_cpu_inc_return(pcp) __this_cpu_add_return(pcp, 1)
 #define __this_cpu_dec_return(pcp) __this_cpu_add_return(pcp, -1)
 
diff --git a/lib/percpu_test.c b/lib/percpu_test.c
index 1ebeb44..8ab4231 100644
--- a/lib/percpu_test.c
+++ b/lib/percpu_test.c
@@ -118,7 +118,7 @@ static int __init percpu_test_init(void)
CHECK(ul, ulong_counter, 2);
 
ul = __this_cpu_sub_return(ulong_counter, ui_one);
-   CHECK(ul, ulong_counter, 0);
+   CHECK(ul, ulong_counter, 1);
 
preempt_enable();
 
-- 
1.8.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More major

[PATCH 1/3] percpu counter: test module

2013-10-27 Thread Greg Thelen

Tests various percpu operations.

Enable with CONFIG_PERCPU_TEST=m.

Signed-off-by: Greg Thelen 
---
 lib/Kconfig.debug |   9 
 lib/Makefile  |   2 +
 lib/percpu_test.c | 138 ++
 3 files changed, 149 insertions(+)
 create mode 100644 lib/percpu_test.c

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 06344d9..cee589d 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1472,6 +1472,15 @@ config INTERVAL_TREE_TEST
help
  A benchmark measuring the performance of the interval tree library
 
+config PERCPU_TEST
+   tristate "Per cpu counter test"
+   depends on m && DEBUG_KERNEL
+   help
+ Enable this option to build test modules with validates per-cpu
+ counter operations.
+
+ If unsure, say N.
+
 config ATOMIC64_SELFTEST
bool "Perform an atomic64_t self-test at boot"
help
diff --git a/lib/Makefile b/lib/Makefile
index f3bb2cb..bb016e1 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -157,6 +157,8 @@ obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o
 
 interval_tree_test-objs := interval_tree_test_main.o interval_tree.o
 
+obj-$(CONFIG_PERCPU_TEST) += percpu_test.o
+
 obj-$(CONFIG_ASN1) += asn1_decoder.o
 
 obj-$(CONFIG_FONT_SUPPORT) += fonts/
diff --git a/lib/percpu_test.c b/lib/percpu_test.c
new file mode 100644
index 000..1ebeb44
--- /dev/null
+++ b/lib/percpu_test.c
@@ -0,0 +1,138 @@
+#include 
+
+/* validate @native and @pcp counter values match @expected */
+#define CHECK(native, pcp, expected)\
+   do {\
+   WARN((native) != (expected),\
+"raw %ld (0x%lx) != expected %ld (0x%lx)", \
+(long)(native), (long)(native),\
+(long)(expected), (long)(expected));   \
+   WARN(__this_cpu_read(pcp) != (expected),\
+"pcp %ld (0x%lx) != expected %ld (0x%lx)", \
+(long)__this_cpu_read(pcp), (long)__this_cpu_read(pcp), \
+(long)(expected), (long)(expected));   \
+   } while (0)
+
+static DEFINE_PER_CPU(long, long_counter);
+static DEFINE_PER_CPU(unsigned long, ulong_counter);
+
+static int __init percpu_test_init(void)
+{
+   /*
+* volatile prevents compiler from optimizing it uses, otherwise the
+* +ul_one and -ul_one below would replace with inc/dec instructions.
+*/
+   volatile unsigned int ui_one = 1;
+   long l = 0;
+   unsigned long ul = 0;
+
+   pr_info("percpu test start\n");
+
+   preempt_disable();
+
+   l += -1;
+   __this_cpu_add(long_counter, -1);
+   CHECK(l, long_counter, -1);
+
+   l += 1;
+   __this_cpu_add(long_counter, 1);
+   CHECK(l, long_counter, 0);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul += 1UL;
+   __this_cpu_add(ulong_counter, 1UL);
+   CHECK(ul, ulong_counter, 1);
+
+   ul += -1UL;
+   __this_cpu_add(ulong_counter, -1UL);
+   CHECK(ul, ulong_counter, 0);
+
+   ul += -(unsigned long)1;
+   __this_cpu_add(ulong_counter, -(unsigned long)1);
+   CHECK(ul, ulong_counter, -1);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul -= 1;
+   __this_cpu_dec(ulong_counter);
+   CHECK(ul, ulong_counter, 0x);
+   CHECK(ul, ulong_counter, -1);
+
+   l += -ui_one;
+   __this_cpu_add(long_counter, -ui_one);
+   CHECK(l, long_counter, 0x);
+
+   l += ui_one;
+   __this_cpu_add(long_counter, ui_one);
+   CHECK(l, long_counter, 0x1);
+
+
+   l = 0;
+   __this_cpu_write(long_counter, 0);
+
+   l -= ui_one;
+   __this_cpu_sub(long_counter, ui_one);
+   CHECK(l, long_counter, -1);
+
+   l = 0;
+   __this_cpu_write(long_counter, 0);
+
+   l += ui_one;
+   __this_cpu_add(long_counter, ui_one);
+   CHECK(l, long_counter, 1);
+
+   l += -ui_one;
+   __this_cpu_add(long_counter, -ui_one);
+   CHECK(l, long_counter, 0x1);
+
+   l = 0;
+   __this_cpu_write(long_counter, 0);
+
+   l -= ui_one;
+   this_cpu_sub(long_counter, ui_one);
+   CHECK(l, long_counter, -1);
+   CHECK(l, long_counter, 0x);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul += ui_one;
+   __this_cpu_add(ulong_counter, ui_one);
+   CHECK(ul, ulong_counter, 1);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul -= ui_one;
+   __this_cpu_sub(ulong_counter, ui_one);
+   CHECK(ul, ulong_counter, -1);
+   CHECK(ul, ulong_counter, 0x);
+
+   ul = 3;
+   __this_cpu_write(ulong_counter, 3)

[PATCH 0/3] fix unsigned pcp adjustments

2013-10-27 Thread Greg Thelen

As of v3.11-9444-g3ea67d0 "memcg: add per cgroup writeback pages accounting"
memcg use of __this_cpu_add(counter, -nr_pages) leads to incorrect statistic
values because the negated nr_pages is not sign extended (counter is long,
nr_pages is unsigned int).  The memcg fix is __this_cpu_sub(counter, nr_pages).
But that doesn't simply work because __this_cpu_sub(counter, nr_pages) was
implemented as __this_cpu_add(counter, -nr_pages) which suffers the same
problem.  Example:
  unsigned int delta = 1
  preempt_disable()
  this_cpu_write(long_counter, 0)
  this_cpu_sub(long_counter, delta)
  preempt_enable()

Before this change long_counter on a 64 bit machine ends with value 0x,
rather than 0x.  This is because this_cpu_sub(pcp, delta) boils
down to:
  long_counter = 0 + 0x

Patch 1 creates a test module for percpu counters operations which demonstrates
the __this_cpu_sub() problems.  This patch is independent can be discarded if
there is no interest.

Patch 2 fixes __this_cpu_sub() to work with unsigned adjustments.

Patch 3 uses __this_cpu_sub() in memcg.

An alternative smaller solution is for memcg to use:
  __this_cpu_add(counter, -(int)nr_pages)
admitting that __this_cpu_add/sub() doesn't work with unsigned adjustments.  But
I felt like fixing the core services to prevent this in the future.

Greg Thelen (3):
  percpu counter: test module
  percpu counter: cast this_cpu_sub() adjustment
  memcg: use __this_cpu_sub to decrement stats

 arch/x86/include/asm/percpu.h |   3 +-
 include/linux/percpu.h|   8 +--
 lib/Kconfig.debug |   9 +++
 lib/Makefile  |   2 +
 lib/percpu_test.c | 138 ++
 mm/memcontrol.c   |   2 +-
 6 files changed, 156 insertions(+), 6 deletions(-)
 create mode 100644 lib/percpu_test.c

-- 
1.8.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] memcg: use __this_cpu_sub to decrement stats

2013-10-27 Thread Greg Thelen

As of v3.11-9444-g3ea67d0 "memcg: add per cgroup writeback pages
accounting" memcg counter errors are possible when moving charged
memory to a different memcg.  Charge movement occurs when processing
writes to memory.force_empty, moving tasks to a memcg with
memcg.move_charge_at_immigrate=1, or memcg deletion.  An example
showing error after memory.force_empty:
  $ cd /sys/fs/cgroup/memory
  $ mkdir x
  $ rm /data/tmp/file
  $ (echo $BASHPID >> x/tasks && exec mmap_writer /data/tmp/file 1M) &
  [1] 13600
  $ grep ^mapped x/memory.stat
  mapped_file 1048576
  $ echo 13600 > tasks
  $ echo 1 > x/memory.force_empty
  $ grep ^mapped x/memory.stat
  mapped_file 4503599627370496

mapped_file should end with 0.
  4503599627370496 == 0x10,,, == 0x100,, pages
  1048576  == 0x10,   == 0x100 pages

This issue only affects the source memcg on 64 bit machines; the
destination memcg counters are correct.  So the rmdir case is not too
important because such counters are soon disappearing with the entire
memcg.  But the memcg.force_empty and
memory.move_charge_at_immigrate=1 cases are larger problems as the
bogus counters are visible for the (possibly long) remaining life of
the source memcg.

The problem is due to memcg use of __this_cpu_from(.., -nr_pages),
which is subtly wrong because it subtracts the unsigned int nr_pages
(either -1 or -512 for THP) from a signed long percpu counter.  When
nr_pages=-1, -nr_pages=0x.  On 64 bit machines
stat->count[idx] is signed 64 bit.  So memcg's attempt to simply
decrement a count (e.g. from 1 to 0) boils down to:
  long count = 1
  unsigned int nr_pages = 1
  count += -nr_pages  /* -nr_pages == 0x, */
  count is now 0x1,, instead of 0

The fix is to subtract the unsigned page count rather than adding its
negation.  This only works with the "percpu counter: cast
this_cpu_sub() adjustment" patch which fixes this_cpu_sub().

Signed-off-by: Greg Thelen 
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aa8185c..b7ace0f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup 
*from,
 {
/* Update stat data for mem_cgroup */
preempt_disable();
-   __this_cpu_add(from->stat->count[idx], -nr_pages);
+   __this_cpu_sub(from->stat->count[idx], nr_pages);
__this_cpu_add(to->stat->count[idx], nr_pages);
preempt_enable();
 }
-- 
1.8.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] percpu counter: cast this_cpu_sub() adjustment

2013-10-27 Thread Greg Thelen

On Sun, Oct 27 2013, Tejun Heo wrote:

> On Sun, Oct 27, 2013 at 05:04:29AM -0700, Andrew Morton wrote:
>> On Sun, 27 Oct 2013 07:22:55 -0400 Tejun Heo  wrote:
>> 
>> > We probably want to cc stable for this and the next one.  How should
>> > these be routed?  I can take these through percpu tree or mm works
>> > too.  Either way, it'd be best to route them together.
>> 
>> Yes, all three look like -stable material to me.  I'll grab them later
>> in the week if you haven't ;)
>
> Tried to apply to percpu but the third one is a fix for a patch which
> was added to -mm during v3.12-rc1, so these are yours. :)

I don't object to stable for the first two non-memcg patches, but it's
probably unnecessary.  I should have made it more clear, but an audit of
v3.12-rc6 shows that only new memcg code is affected - the new
mem_cgroup_move_account_page_stat() is the only place where an unsigned
adjustment is used.  All other callers (e.g. shrink_dcache_sb) already
use a signed adjustment, so no problems before v3.12.  Though I did not
audit the stable kernel trees, so there could be something hiding in
there.

>> The names of the first two patches distress me.  They rather clearly
>> assert that the code affects percpu_counter.[ch], but that is not the case. 
>> Massaging is needed to fix that up.
>
> Yeah, something like the following would be better
>
>  percpu: add test module for various percpu operations
>  percpu: fix this_cpu_sub() subtrahend casting for unsigneds
>  memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend 
> casting

No objection to renaming.  Let me know if you want these reposed with
updated titles.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] percpu counter: cast this_cpu_sub() adjustment

2013-10-27 Thread Greg Thelen

On Sun, Oct 27 2013, Greg Thelen wrote:

> this_cpu_sub() is implemented as negation and addition.
>
> This patch casts the adjustment to the counter type before negation to
> sign extend the adjustment.  This helps in cases where the counter
> type is wider than an unsigned adjustment.  An alternative to this
> patch is to declare such operations unsupported, but it seemed useful
> to avoid surprises.
>
> This patch specifically helps the following example:
>   unsigned int delta = 1
>   preempt_disable()
>   this_cpu_write(long_counter, 0)
>   this_cpu_sub(long_counter, delta)
>   preempt_enable()
>
> Before this change long_counter on a 64 bit machine ends with value
> 0x, rather than 0x.  This is because
> this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta),
> which is basically:
>   long_counter = 0 + 0x
>
> Also apply the same cast to:
>   __this_cpu_sub()
>   this_cpu_sub_return()
>   and __this_cpu_sub_return()
>
> All percpu_test.ko passes, especially the following cases which
> previously failed:
>
>   l -= ui_one;
>   __this_cpu_sub(long_counter, ui_one);
>   CHECK(l, long_counter, -1);
>
>   l -= ui_one;
>   this_cpu_sub(long_counter, ui_one);
>   CHECK(l, long_counter, -1);
>   CHECK(l, long_counter, 0x);
>
>   ul -= ui_one;
>   __this_cpu_sub(ulong_counter, ui_one);
>   CHECK(ul, ulong_counter, -1);
>   CHECK(ul, ulong_counter, 0x);
>
>   ul = this_cpu_sub_return(ulong_counter, ui_one);
>   CHECK(ul, ulong_counter, 2);
>
>   ul = __this_cpu_sub_return(ulong_counter, ui_one);
>   CHECK(ul, ulong_counter, 1);
>
> Signed-off-by: Greg Thelen 
> ---
>  arch/x86/include/asm/percpu.h | 3 ++-
>  include/linux/percpu.h| 8 
>  lib/percpu_test.c | 2 +-
>  3 files changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
> index 0da5200..b3e18f8 100644
> --- a/arch/x86/include/asm/percpu.h
> +++ b/arch/x86/include/asm/percpu.h
> @@ -128,7 +128,8 @@ do {  
> \
>  do { \
>   typedef typeof(var) pao_T__;\
>   const int pao_ID__ = (__builtin_constant_p(val) &&  \
> -   ((val) == 1 || (val) == -1)) ? (val) : 0; \
> +   ((val) == 1 || (val) == -1)) ?\
> + (int)(val) : 0; \
>   if (0) {\
>   pao_T__ pao_tmp__;  \
>   pao_tmp__ = (val);  \
> diff --git a/include/linux/percpu.h b/include/linux/percpu.h
> index cc88172..c74088a 100644
> --- a/include/linux/percpu.h
> +++ b/include/linux/percpu.h
> @@ -332,7 +332,7 @@ do {  
> \
>  #endif
>  
>  #ifndef this_cpu_sub
> -# define this_cpu_sub(pcp, val)  this_cpu_add((pcp), -(val))
> +# define this_cpu_sub(pcp, val)  this_cpu_add((pcp), 
> -(typeof(pcp))(val))
>  #endif
>  
>  #ifndef this_cpu_inc
> @@ -418,7 +418,7 @@ do {  
> \
>  # define this_cpu_add_return(pcp, val)   
> __pcpu_size_call_return2(this_cpu_add_return_, pcp, val)
>  #endif
>  
> -#define this_cpu_sub_return(pcp, val)this_cpu_add_return(pcp, -(val))
> +#define this_cpu_sub_return(pcp, val)this_cpu_add_return(pcp, 
> -(typeof(pcp))(val))
>  #define this_cpu_inc_return(pcp) this_cpu_add_return(pcp, 1)
>  #define this_cpu_dec_return(pcp) this_cpu_add_return(pcp, -1)
>  
> @@ -586,7 +586,7 @@ do {  
> \
>  #endif
>  
>  #ifndef __this_cpu_sub
> -# define __this_cpu_sub(pcp, val)__this_cpu_add((pcp), -(val))
> +# define __this_cpu_sub(pcp, val)__this_cpu_add((pcp), 
> -(typeof(pcp))(val))
>  #endif
>  
>  #ifndef __this_cpu_inc
> @@ -668,7 +668,7 @@ do {  
> \
>   __pcpu_size_call_return2(__this_cpu_add_return_, pcp, val)
>  #endif
>  
> -#define __this_cpu_sub_return(pcp, val)  __this_cpu_add_return(pcp, 
> -(val))
> +#define __this_cpu_sub_return(pcp, val)  __this_cpu_add_return(pcp, 
> -(typeof(pcp))(val))
>  #define __this_cpu_inc_return(pcp)   __this_cpu_add_return(pcp, 1)
>  #define __this_cpu_d

[PATCH v2 0/3] fix unsigned pcp adjustments

2013-10-27 Thread Greg Thelen

As of v3.11-9444-g3ea67d0 "memcg: add per cgroup writeback pages accounting"
memcg use of __this_cpu_add(counter, -nr_pages) leads to incorrect statistic
values because the negated nr_pages is not sign extended (counter is long,
nr_pages is unsigned int).  The memcg fix is __this_cpu_sub(counter, nr_pages).
But that doesn't simply work because __this_cpu_sub(counter, nr_pages) was
implemented as __this_cpu_add(counter, -nr_pages) which suffers the same
problem.  Example:
  unsigned int delta = 1
  preempt_disable()
  this_cpu_write(long_counter, 0)
  this_cpu_sub(long_counter, delta)
  preempt_enable()

Before this change long_counter on a 64 bit machine ends with value 0x,
rather than 0x.  This is because this_cpu_sub(pcp, delta) boils
down to:
  long_counter = 0 + 0x

v3.12-rc6 shows that only new memcg code is affected by this problem - the new
mem_cgroup_move_account_page_stat() is the only place where an unsigned
adjustment is used.  All other callers (e.g. shrink_dcache_sb) already use a
signed adjustment, so no problems before v3.12.  Though I did not audit the
stable kernel trees, so there could be something hiding in there.

Patch 1 creates a test module for percpu operations which demonstrates the
__this_cpu_sub() problems.  This patch is independent can be discarded if there
is no interest.

Patch 2 fixes __this_cpu_sub() to work with unsigned adjustments.

Patch 3 uses __this_cpu_sub() in memcg.

An alternative smaller solution is for memcg to use:
  __this_cpu_add(counter, -(int)nr_pages)
admitting that __this_cpu_add/sub() doesn't work with unsigned adjustments.  But
I felt like fixing the core services to prevent this in the future.

Changes from V1:
- more accurate patch titles, patch logs, and test module description now
  referring to per cpu operations rather than per cpu counters.
- move small test code update from patch 2 to patch 1 (where the test is
  introduced).

Greg Thelen (3):
  percpu: add test module for various percpu operations
  percpu: fix this_cpu_sub() subtrahend casting for unsigneds
  memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend
casting

 arch/x86/include/asm/percpu.h |   3 +-
 include/linux/percpu.h|   8 +--
 lib/Kconfig.debug |   9 +++
 lib/Makefile  |   2 +
 lib/percpu_test.c | 138 ++
 mm/memcontrol.c   |   2 +-
 6 files changed, 156 insertions(+), 6 deletions(-)
 create mode 100644 lib/percpu_test.c

-- 
1.8.4.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 2/3] percpu: fix this_cpu_sub() subtrahend casting for unsigneds

2013-10-27 Thread Greg Thelen

this_cpu_sub() is implemented as negation and addition.

This patch casts the adjustment to the counter type before negation to
sign extend the adjustment.  This helps in cases where the counter
type is wider than an unsigned adjustment.  An alternative to this
patch is to declare such operations unsupported, but it seemed useful
to avoid surprises.

This patch specifically helps the following example:
  unsigned int delta = 1
  preempt_disable()
  this_cpu_write(long_counter, 0)
  this_cpu_sub(long_counter, delta)
  preempt_enable()

Before this change long_counter on a 64 bit machine ends with value
0x, rather than 0x.  This is because
this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta),
which is basically:
  long_counter = 0 + 0x

Also apply the same cast to:
  __this_cpu_sub()
  __this_cpu_sub_return()
  this_cpu_sub_return()

All percpu_test.ko passes, especially the following cases which
previously failed:

  l -= ui_one;
  __this_cpu_sub(long_counter, ui_one);
  CHECK(l, long_counter, -1);

  l -= ui_one;
  this_cpu_sub(long_counter, ui_one);
  CHECK(l, long_counter, -1);
  CHECK(l, long_counter, 0x);

  ul -= ui_one;
  __this_cpu_sub(ulong_counter, ui_one);
  CHECK(ul, ulong_counter, -1);
  CHECK(ul, ulong_counter, 0x);

  ul = this_cpu_sub_return(ulong_counter, ui_one);
  CHECK(ul, ulong_counter, 2);

  ul = __this_cpu_sub_return(ulong_counter, ui_one);
  CHECK(ul, ulong_counter, 1);

Signed-off-by: Greg Thelen 
Acked-by: Tejun Heo 
---
 arch/x86/include/asm/percpu.h | 3 ++-
 include/linux/percpu.h| 8 
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 0da5200..b3e18f8 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -128,7 +128,8 @@ do {
\
 do {   \
typedef typeof(var) pao_T__;\
const int pao_ID__ = (__builtin_constant_p(val) &&  \
- ((val) == 1 || (val) == -1)) ? (val) : 0; \
+ ((val) == 1 || (val) == -1)) ?\
+   (int)(val) : 0; \
if (0) {\
pao_T__ pao_tmp__;  \
pao_tmp__ = (val);  \
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index cc88172..c74088a 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -332,7 +332,7 @@ do {
\
 #endif
 
 #ifndef this_cpu_sub
-# define this_cpu_sub(pcp, val)this_cpu_add((pcp), -(val))
+# define this_cpu_sub(pcp, val)this_cpu_add((pcp), 
-(typeof(pcp))(val))
 #endif
 
 #ifndef this_cpu_inc
@@ -418,7 +418,7 @@ do {
\
 # define this_cpu_add_return(pcp, val) 
__pcpu_size_call_return2(this_cpu_add_return_, pcp, val)
 #endif
 
-#define this_cpu_sub_return(pcp, val)  this_cpu_add_return(pcp, -(val))
+#define this_cpu_sub_return(pcp, val)  this_cpu_add_return(pcp, 
-(typeof(pcp))(val))
 #define this_cpu_inc_return(pcp)   this_cpu_add_return(pcp, 1)
 #define this_cpu_dec_return(pcp)   this_cpu_add_return(pcp, -1)
 
@@ -586,7 +586,7 @@ do {
\
 #endif
 
 #ifndef __this_cpu_sub
-# define __this_cpu_sub(pcp, val)  __this_cpu_add((pcp), -(val))
+# define __this_cpu_sub(pcp, val)  __this_cpu_add((pcp), 
-(typeof(pcp))(val))
 #endif
 
 #ifndef __this_cpu_inc
@@ -668,7 +668,7 @@ do {
\
__pcpu_size_call_return2(__this_cpu_add_return_, pcp, val)
 #endif
 
-#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, 
-(val))
+#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, 
-(typeof(pcp))(val))
 #define __this_cpu_inc_return(pcp) __this_cpu_add_return(pcp, 1)
 #define __this_cpu_dec_return(pcp) __this_cpu_add_return(pcp, -1)
 
-- 
1.8.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 1/3] percpu: add test module for various percpu operations

2013-10-27 Thread Greg Thelen

Tests various percpu operations.

Enable with CONFIG_PERCPU_TEST=m.

Signed-off-by: Greg Thelen 
Acked-by: Tejun Heo 
---
 lib/Kconfig.debug |   9 
 lib/Makefile  |   2 +
 lib/percpu_test.c | 138 ++
 3 files changed, 149 insertions(+)
 create mode 100644 lib/percpu_test.c

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 06344d9..9fdb452 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1472,6 +1472,15 @@ config INTERVAL_TREE_TEST
help
  A benchmark measuring the performance of the interval tree library
 
+config PERCPU_TEST
+   tristate "Per cpu operations test"
+   depends on m && DEBUG_KERNEL
+   help
+ Enable this option to build test module which validates per-cpu
+ operations.
+
+ If unsure, say N.
+
 config ATOMIC64_SELFTEST
bool "Perform an atomic64_t self-test at boot"
help
diff --git a/lib/Makefile b/lib/Makefile
index f3bb2cb..bb016e1 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -157,6 +157,8 @@ obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o
 
 interval_tree_test-objs := interval_tree_test_main.o interval_tree.o
 
+obj-$(CONFIG_PERCPU_TEST) += percpu_test.o
+
 obj-$(CONFIG_ASN1) += asn1_decoder.o
 
 obj-$(CONFIG_FONT_SUPPORT) += fonts/
diff --git a/lib/percpu_test.c b/lib/percpu_test.c
new file mode 100644
index 000..fcca49e
--- /dev/null
+++ b/lib/percpu_test.c
@@ -0,0 +1,138 @@
+#include 
+
+/* validate @native and @pcp counter values match @expected */
+#define CHECK(native, pcp, expected)\
+   do {\
+   WARN((native) != (expected),\
+"raw %ld (0x%lx) != expected %ld (0x%lx)", \
+(long)(native), (long)(native),\
+(long)(expected), (long)(expected));   \
+   WARN(__this_cpu_read(pcp) != (expected),\
+"pcp %ld (0x%lx) != expected %ld (0x%lx)", \
+(long)__this_cpu_read(pcp), (long)__this_cpu_read(pcp), \
+(long)(expected), (long)(expected));   \
+   } while (0)
+
+static DEFINE_PER_CPU(long, long_counter);
+static DEFINE_PER_CPU(unsigned long, ulong_counter);
+
+static int __init percpu_test_init(void)
+{
+   /*
+* volatile prevents compiler from optimizing it uses, otherwise the
+* +ul_one and -ul_one below would replace with inc/dec instructions.
+*/
+   volatile unsigned int ui_one = 1;
+   long l = 0;
+   unsigned long ul = 0;
+
+   pr_info("percpu test start\n");
+
+   preempt_disable();
+
+   l += -1;
+   __this_cpu_add(long_counter, -1);
+   CHECK(l, long_counter, -1);
+
+   l += 1;
+   __this_cpu_add(long_counter, 1);
+   CHECK(l, long_counter, 0);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul += 1UL;
+   __this_cpu_add(ulong_counter, 1UL);
+   CHECK(ul, ulong_counter, 1);
+
+   ul += -1UL;
+   __this_cpu_add(ulong_counter, -1UL);
+   CHECK(ul, ulong_counter, 0);
+
+   ul += -(unsigned long)1;
+   __this_cpu_add(ulong_counter, -(unsigned long)1);
+   CHECK(ul, ulong_counter, -1);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul -= 1;
+   __this_cpu_dec(ulong_counter);
+   CHECK(ul, ulong_counter, 0x);
+   CHECK(ul, ulong_counter, -1);
+
+   l += -ui_one;
+   __this_cpu_add(long_counter, -ui_one);
+   CHECK(l, long_counter, 0x);
+
+   l += ui_one;
+   __this_cpu_add(long_counter, ui_one);
+   CHECK(l, long_counter, 0x1);
+
+
+   l = 0;
+   __this_cpu_write(long_counter, 0);
+
+   l -= ui_one;
+   __this_cpu_sub(long_counter, ui_one);
+   CHECK(l, long_counter, -1);
+
+   l = 0;
+   __this_cpu_write(long_counter, 0);
+
+   l += ui_one;
+   __this_cpu_add(long_counter, ui_one);
+   CHECK(l, long_counter, 1);
+
+   l += -ui_one;
+   __this_cpu_add(long_counter, -ui_one);
+   CHECK(l, long_counter, 0x1);
+
+   l = 0;
+   __this_cpu_write(long_counter, 0);
+
+   l -= ui_one;
+   this_cpu_sub(long_counter, ui_one);
+   CHECK(l, long_counter, -1);
+   CHECK(l, long_counter, 0x);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul += ui_one;
+   __this_cpu_add(ulong_counter, ui_one);
+   CHECK(ul, ulong_counter, 1);
+
+   ul = 0;
+   __this_cpu_write(ulong_counter, 0);
+
+   ul -= ui_one;
+   __this_cpu_sub(ulong_counter, ui_one);
+   CHECK(ul, ulong_counter, -1);
+   CHECK(ul, ulong_counter, 0x);
+
+   ul = 3;
+   __this_cpu_w

[PATCH v2 3/3] memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend casting

2013-10-27 Thread Greg Thelen

As of v3.11-9444-g3ea67d0 "memcg: add per cgroup writeback pages
accounting" memcg counter errors are possible when moving charged
memory to a different memcg.  Charge movement occurs when processing
writes to memory.force_empty, moving tasks to a memcg with
memcg.move_charge_at_immigrate=1, or memcg deletion.  An example
showing error after memory.force_empty:
  $ cd /sys/fs/cgroup/memory
  $ mkdir x
  $ rm /data/tmp/file
  $ (echo $BASHPID >> x/tasks && exec mmap_writer /data/tmp/file 1M) &
  [1] 13600
  $ grep ^mapped x/memory.stat
  mapped_file 1048576
  $ echo 13600 > tasks
  $ echo 1 > x/memory.force_empty
  $ grep ^mapped x/memory.stat
  mapped_file 4503599627370496

mapped_file should end with 0.
  4503599627370496 == 0x10,,, == 0x100,, pages
  1048576  == 0x10,   == 0x100 pages

This issue only affects the source memcg on 64 bit machines; the
destination memcg counters are correct.  So the rmdir case is not too
important because such counters are soon disappearing with the entire
memcg.  But the memcg.force_empty and
memory.move_charge_at_immigrate=1 cases are larger problems as the
bogus counters are visible for the (possibly long) remaining life of
the source memcg.

The problem is due to memcg use of __this_cpu_from(.., -nr_pages),
which is subtly wrong because it subtracts the unsigned int nr_pages
(either -1 or -512 for THP) from a signed long percpu counter.  When
nr_pages=-1, -nr_pages=0x.  On 64 bit machines
stat->count[idx] is signed 64 bit.  So memcg's attempt to simply
decrement a count (e.g. from 1 to 0) boils down to:
  long count = 1
  unsigned int nr_pages = 1
  count += -nr_pages  /* -nr_pages == 0x, */
  count is now 0x1,, instead of 0

The fix is to subtract the unsigned page count rather than adding its
negation.  This only works once "percpu: fix this_cpu_sub() subtrahend
casting for unsigneds" is applied to fix this_cpu_sub().

Signed-off-by: Greg Thelen 
Acked-by: Tejun Heo 
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aa8185c..b7ace0f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup 
*from,
 {
/* Update stat data for mem_cgroup */
preempt_disable();
-   __this_cpu_add(from->stat->count[idx], -nr_pages);
+   __this_cpu_sub(from->stat->count[idx], nr_pages);
__this_cpu_add(to->stat->count[idx], nr_pages);
preempt_enable();
 }
-- 
1.8.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ipc: introduce ipc_valid_object() helper to sort out IPC_RMID races

2013-12-17 Thread Greg Thelen

On Tue, Dec 17 2013, Rafael Aquini wrote:

> After the locking semantics for the SysV IPC API got improved, a couple of
> IPC_RMID race windows were opened because we ended up dropping the
> 'kern_ipc_perm.deleted' check performed way down in ipc_lock().
> The spotted races got sorted out by re-introducing the old test within
> the racy critical sections.
>
> This patch introduces ipc_valid_object() to consolidate the way we cope with
> IPC_RMID races by using the same abstraction across the API implementation.
>
> Signed-off-by: Rafael Aquini 

Acked-by: Greg Thelen 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ipc,shm: fix shm_file deletion races

2013-11-18 Thread Greg Thelen

When IPC_RMID races with other shm operations there's potential for
use-after-free of the shm object's associated file (shm_file).

Here's the race before this patch:
  TASK 1 TASK 2
  -- --
  shm_rmid()
ipc_lock_object()
 shmctl()
 shp = shm_obtain_object_check()

shm_destroy()
  shum_unlock()
  fput(shp->shm_file)
 ipc_lock_object()
 shmem_lock(shp->shm_file)
 

The oops is caused because shm_destroy() calls fput() after dropping the
ipc_lock.  fput() clears the file's f_inode, f_path.dentry, and
f_path.mnt, which causes various NULL pointer references in task 2.  I
reliably see the oops in task 2 if with shmlock, shmu

This patch fixes the races by:
1) set shm_file=NULL in shm_destroy() while holding ipc_object_lock().
2) modify at risk operations to check shm_file while holding
   ipc_object_lock().

Example workloads, which each trigger oops...

Workload 1:
  while true; do
id=$(shmget 1 4096)
shm_rmid $id &
shmlock $id &
wait
  done

  The oops stack shows accessing NULL f_inode due to racing fput:
_raw_spin_lock
shmem_lock
SyS_shmctl

Workload 2:
  while true; do
id=$(shmget 1 4096)
shmat $id 4096 &
shm_rmid $id &
wait
  done

  The oops stack is similar to workload 1 due to NULL f_inode:
touch_atime
shmem_mmap
shm_mmap
mmap_region
do_mmap_pgoff
do_shmat
SyS_shmat

Workload 3:
  while true; do
id=$(shmget 1 4096)
shmlock $id
shm_rmid $id &
shmunlock $id &
wait
  done

  The oops stack shows second fput tripping on an NULL f_inode.  The
  first fput() completed via from shm_destroy(), but a racing thread did
  a get_file() and queued this fput():
locks_remove_flock
__fput
fput
task_work_run
do_notify_resume
int_signal

Fixes: c2c737a0461e ("ipc,shm: shorten critical region for shmat")
Fixes: 2caacaa82a51 ("ipc,shm: shorten critical region for shmctl")
Signed-off-by: Greg Thelen 
Cc:   # 3.10.17+ 3.11.6+
---
 ipc/shm.c | 28 +++-
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index d69739610fd4..0bdf21c6814e 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -208,15 +208,18 @@ static void shm_open(struct vm_area_struct *vma)
  */
 static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
 {
+   struct file *shm_file;
+
+   shm_file = shp->shm_file;
+   shp->shm_file = NULL;
ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
shm_rmid(ns, shp);
shm_unlock(shp);
-   if (!is_file_hugepages(shp->shm_file))
-   shmem_lock(shp->shm_file, 0, shp->mlock_user);
+   if (!is_file_hugepages(shm_file))
+   shmem_lock(shm_file, 0, shp->mlock_user);
else if (shp->mlock_user)
-   user_shm_unlock(file_inode(shp->shm_file)->i_size,
-   shp->mlock_user);
-   fput (shp->shm_file);
+   user_shm_unlock(file_inode(shm_file)->i_size, shp->mlock_user);
+   fput(shm_file);
ipc_rcu_putref(shp, shm_rcu_free);
 }
 
@@ -983,6 +986,13 @@ SYSCALL_DEFINE3(shmctl, int, shmid, int, cmd, struct 
shmid_ds __user *, buf)
}
 
shm_file = shp->shm_file;
+
+   /* check if shm_destroy() is tearing down shp */
+   if (shm_file == NULL) {
+   err = -EIDRM;
+   goto out_unlock0;
+   }
+
if (is_file_hugepages(shm_file))
goto out_unlock0;
 
@@ -1101,6 +,14 @@ long do_shmat(int shmid, char __user *shmaddr, int 
shmflg, ulong *raddr,
goto out_unlock;
 
ipc_lock_object(&shp->shm_perm);
+
+   /* check if shm_destroy() is tearing down shp */
+   if (shp->shm_file == NULL) {
+   ipc_unlock_object(&shp->shm_perm);
+   err = -EIDRM;
+   goto out_unlock;
+   }
+
path = shp->shm_file->f_path;
path_get(&path);
shp->shm_nattch++;
-- 
1.8.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests

2014-08-07 Thread Greg Thelen


On Thu, Aug 07 2014, Johannes Weiner wrote:

> On Thu, Aug 07, 2014 at 03:08:22PM +0200, Michal Hocko wrote:
>> On Mon 04-08-14 17:14:54, Johannes Weiner wrote:
>> > Instead of passing the request size to direct reclaim, memcg just
>> > manually loops around reclaiming SWAP_CLUSTER_MAX pages until the
>> > charge can succeed.  That potentially wastes scan progress when huge
>> > page allocations require multiple invocations, which always have to
>> > restart from the default scan priority.
>> > 
>> > Pass the request size as a reclaim target to direct reclaim and leave
>> > it to that code to reach the goal.
>> 
>> THP charge then will ask for 512 pages to be (direct) reclaimed. That
>> is _a lot_ and I would expect long stalls to achieve this target. I
>> would also expect quick priority drop down and potential over-reclaim
>> for small and moderately sized memcgs (e.g. memcg with 1G worth of pages
>> would need to drop down below DEF_PRIORITY-2 to have a chance to scan
>> that many pages). All that done for a charge which can fallback to a
>> single page charge.
>> 
>> The current code is quite hostile to THP when we are close to the limit
>> but solving this by introducing long stalls instead doesn't sound like a
>> proper approach to me.
>
> THP latencies are actually the same when comparing high limit nr_pages
> reclaim with the current hard limit SWAP_CLUSTER_MAX reclaim, although
> system time is reduced with the high limit.
>
> High limit reclaim with SWAP_CLUSTER_MAX has better fault latency but
> it doesn't actually contain the workload - with 1G high and a 4G load,
> the consumption at the end of the run is 3.7G.
>
> So what I'm proposing works and is of equal quality from a THP POV.
> This change is complicated enough when we stick to the facts, let's
> not make up things based on gut feeling.

I think that high order non THP page allocations also benefit from this.
Such allocations don't have a small page fallback.

This may be in flux, but linux-next shows me that:
* mem_cgroup_reclaim()
  frees at least SWAP_CLUSTER_MAX (32) pages.
* try_charge() calls mem_cgroup_reclaim() indefinitely for
  costly (3) or smaller orders assuming that something is reclaimed on
  each iteration.
* try_charge() uses a loop of MEM_CGROUP_RECLAIM_RETRIES (5) for
  larger-than-costly orders.

So for larger-than-costly allocations, try_charge() should be able to
reclaim 160 (5*32) pages which satisfies an order:7 allocation.  But for
order:8+ allocations try_charge() and mem_cgroup_reclaim() are too eager
to give up without something like this.  So I think this patch is a step
in the right direction.

Coincidentally, we've been recently been experimenting with something
like this.  Though we didn't modify the interface between
mem_cgroup_reclaim() and try_to_free_mem_cgroup_pages() - instead we
looped within mem_cgroup_reclaim() until nr_pages of margin were found.
But I have no objection the proposed plumbing of nr_pages all the way
into try_to_free_mem_cgroup_pages().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure

2015-08-05 Thread Greg Thelen


mho...@kernel.org wrote:

> From: Michal Hocko 
>
> Journal transaction might fail prematurely because the frozen_buffer
> is allocated by GFP_NOFS request:
> [   72.440013] do_get_write_access: OOM for frozen_buffer
> [   72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: 
> Out of memory in __ext4_journal_get_write_access
> [   72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: 
> Out of memory
> (...snipped)
> [   72.495559] do_get_write_access: OOM for frozen_buffer
> [   72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: 
> Out of memory in __ext4_journal_get_write_access
> [   72.496839] do_get_write_access: OOM for frozen_buffer
> [   72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: 
> Out of memory in __ext4_journal_get_write_access
> [   72.505766] Aborting journal on device sda1-8.
> [   72.505851] EXT4-fs (sda1): Remounting filesystem read-only
>
> This wasn't a problem until "mm: page_alloc: do not lock up GFP_NOFS
> allocations upon OOM" because small GPF_NOFS allocations never failed.
> This allocation seems essential for the journal and GFP_NOFS is too
> restrictive to the memory allocator so let's use __GFP_NOFAIL here to
> emulate the previous behavior.
>
> jbd code has the very same issue so let's do the same there as well.
>
> Signed-off-by: Michal Hocko 
> ---
>  fs/jbd/transaction.c  | 11 +--
>  fs/jbd2/transaction.c | 14 +++---
>  2 files changed, 4 insertions(+), 21 deletions(-)
>
> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
> index 1695ba8334a2..bf7474deda2f 100644
> --- a/fs/jbd/transaction.c
> +++ b/fs/jbd/transaction.c
> @@ -673,16 +673,7 @@ do_get_write_access(handle_t *handle, struct 
> journal_head *jh,
>   jbd_unlock_bh_state(bh);
>   frozen_buffer =
>   jbd_alloc(jh2bh(jh)->b_size,
> -  GFP_NOFS);
> - if (!frozen_buffer) {
> - printk(KERN_ERR
> -"%s: OOM for frozen_buffer\n",
> -__func__);
> - JBUFFER_TRACE(jh, "oom!");
> - error = -ENOMEM;
> - jbd_lock_bh_state(bh);
> - goto done;
> - }
> +  GFP_NOFS|__GFP_NOFAIL);
>   goto repeat;
>   }
>   jh->b_frozen_data = frozen_buffer;
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index ff2f2e6ad311..bff071e21553 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -923,16 +923,7 @@ do_get_write_access(handle_t *handle, struct 
> journal_head *jh,
>   jbd_unlock_bh_state(bh);
>   frozen_buffer =
>   jbd2_alloc(jh2bh(jh)->b_size,
> -  GFP_NOFS);
> - if (!frozen_buffer) {
> - printk(KERN_ERR
> -"%s: OOM for frozen_buffer\n",
> -__func__);
> - JBUFFER_TRACE(jh, "oom!");
> - error = -ENOMEM;
> - jbd_lock_bh_state(bh);
> - goto done;
> - }
> +  GFP_NOFS|__GFP_NOFAIL);
>   goto repeat;
>   }
>   jh->b_frozen_data = frozen_buffer;
> @@ -1157,7 +1148,8 @@ int jbd2_journal_get_undo_access(handle_t *handle, 
> struct buffer_head *bh)
>  
>  repeat:
>   if (!jh->b_committed_data) {
> - committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
> + committed_data = jbd2_alloc(jh2bh(jh)->b_size,
> + GFP_NOFS|__GFP_NOFAIL);
>   if (!committed_data) {
>   printk(KERN_ERR "%s: No memory for committed data\n",
>   __func__);

Is this "if (!committed_data) {" check now dead code?

I also see other similar suspected dead sites in the rest of the series.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] memcg: fix stale mem_cgroup_force_empty() comment

2016-05-06 Thread Greg Thelen

commit f61c42a7d911 ("memcg: remove tasks/children test from
mem_cgroup_force_empty()") removed memory reparenting from the function.

Fix the function's comment.

Signed-off-by: Greg Thelen 
---
 mm/memcontrol.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fe787f5c41bd..19fd76168a05 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2636,8 +2636,7 @@ static inline bool memcg_has_children(struct mem_cgroup 
*memcg)
 }
 
 /*
- * Reclaims as many pages from the given memcg as possible and moves
- * the rest to the parent.
+ * Reclaims as many pages from the given memcg as possible.
  *
  * Caller is responsible for holding css reference for memcg.
  */
-- 
2.8.0.rc3.226.g39d4020

Re: [PATCH v3] x86, kaslr: Prevent .bss from overlaping initrd

2014-11-17 Thread Greg Thelen

On Fri, Oct 31 2014, Junjie Mao wrote:

> When choosing a random address, the current implementation does not take into
> account the reversed space for .bss and .brk sections. Thus the relocated 
> kernel
> may overlap other components in memory. Here is an example of the overlap 
> from a
> x86_64 kernel in qemu (the ranges of physical addresses are presented):
>
>  Physical Address
>
> 0x0fe0  --++  <-- randomized base
>/  |  relocated kernel  |
>vmlinux.bin| (from vmlinux.bin) |
> 0x1336d000(an ELF file)   ++--
>\  ||  \
> 0x1376d870  --++   |
>   |relocs table|   |
> 0x13c1c2a8++   .bss and .brk
>   ||   |
> 0x13ce6000++   |
>   ||  /
> 0x13f77000|   initrd   |--
>   ||
> 0x13fef374++
>
> The initrd image will then be overwritten by the memset during early
> initialization:
>
> [1.655204] Unpacking initramfs...
> [1.662831] Initramfs unpacking failed: junk in compressed archive
>
> This patch prevents the above situation by requiring a larger space when 
> looking
> for a random kernel base, so that existing logic can effectively avoids the
> overlap.
>
> Fixes: 82fa9637a2 ("x86, kaslr: Select random position from e820 maps")
> Reported-by: Fengguang Wu 
> Signed-off-by: Junjie Mao 
> [kees: switched to perl to avoid hex translation pain in mawk vs gawk]
> [kees: calculated overlap without relocs table]
> Signed-off-by: Kees Cook 
> Cc: sta...@vger.kernel.org
> ---
> This version updates the commit log only.
>
> Kees, please help review the documentation. Thanks!
>
> Best Regards
> Junjie Mao
[...]
> diff --git a/arch/x86/tools/calc_run_size.pl b/arch/x86/tools/calc_run_size.pl
> new file mode 100644
> index ..0b0b124d3ece
> --- /dev/null
> +++ b/arch/x86/tools/calc_run_size.pl
> @@ -0,0 +1,30 @@
> +#!/usr/bin/perl
> +#
> +# Calculate the amount of space needed to run the kernel, including room for
> +# the .bss and .brk sections.
> +#
> +# Usage:
> +# objdump -h a.out | perl calc_run_size.pl
> +use strict;
> +
> +my $mem_size = 0;
> +my $file_offset = 0;
> +
> +my $sections=" *[0-9]+ \.(?:bss|brk) +";
> +while (<>) {
> + if (/^$sections([0-9a-f]+) +(?:[0-9a-f]+ +){2}([0-9a-f]+)/) {
> + my $size = hex($1);
> + my $offset = hex($2);
> + $mem_size += $size;
> + if ($file_offset == 0) {
> + $file_offset = $offset;
> + } elsif ($file_offset != $offset) {
> + die ".bss and .brk lack common file offset\n";
> + }
> + }
> +}
> +
> +if ($file_offset == 0) {
> + die "Never found .bss or .brk file offset\n";
> +}
> +printf("%d\n", $mem_size + $file_offset);

Given that bss and brk are nobits (i.e. only ALLOC) sections, does
file_offset make sense as a load address.  This fails with gold:

$ git checkout v3.18-rc5
$ make  # with gold
[...]
..bss and .brk lack common file offset
..bss and .brk lack common file offset
..bss and .brk lack common file offset
..bss and .brk lack common file offset
  MKPIGGY arch/x86/boot/compressed/piggy.S
Usage: arch/x86/boot/compressed/mkpiggy compressed_file run_size
make[2]: *** [arch/x86/boot/compressed/piggy.S] Error 1
make[1]: *** [arch/x86/boot/compressed/vmlinux] Error 2
make: *** [bzImage] Error 2

In ld.bfd brk/bss file_offsets match, but they differ with ld.gold:

$ objdump -h vmlinux.ld
[...]
  0 .text 00818bb3  8100  0100  0020  2**12
  CONTENTS, ALLOC, LOAD, READONLY, CODE
[...]
 26 .bss  000e  81fe8000  01fe8000  013e8000  2**12
  ALLOC
 27 .brk  00026000  820c8000  020c8000  013e8000  2**0
  ALLOC

$ objdump -h vmlinux.ld | perl arch/x86/tools/calc_run_size.pl
21946368
# aka 0x14ee000


$ objdump -h vmlinux.gold
[...]
  0 .text 00818bb3  8100  0100  1000  2**12
  CONTENTS, ALLOC, LOAD, READONLY, CODE
[...]
 26 .bss  000e  81feb000  01feb000  00e9  2**12
  ALLOC
 27 .brk  00026000  820cb000  020cb000  00f7  2**0
  ALLOC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3] x86, kaslr: Prevent .bss from overlaping initrd

2014-11-17 Thread Greg Thelen


On Mon, Nov 17 2014, Greg Thelen wrote:
[...]
> Given that bss and brk are nobits (i.e. only ALLOC) sections, does
> file_offset make sense as a load address.  This fails with gold:
>
> $ git checkout v3.18-rc5
> $ make  # with gold
> [...]
> ..bss and .brk lack common file offset
> ..bss and .brk lack common file offset
> ..bss and .brk lack common file offset
> ..bss and .brk lack common file offset
>   MKPIGGY arch/x86/boot/compressed/piggy.S
> Usage: arch/x86/boot/compressed/mkpiggy compressed_file run_size
> make[2]: *** [arch/x86/boot/compressed/piggy.S] Error 1
> make[1]: *** [arch/x86/boot/compressed/vmlinux] Error 2
> make: *** [bzImage] Error 2
[...]

I just saw http://www.spinics.net/lists/kernel/msg1869328.html which
fixes things for me.  Sorry for the noise.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch v2] mm, hugetlb: abort __get_user_pages if current has been oom killed

2015-03-09 Thread Greg Thelen


On Mon, Mar 09 2015, David Rientjes wrote:

> If __get_user_pages() is faulting a significant number of hugetlb pages,
> usually as the result of mmap(MAP_LOCKED), it can potentially allocate a
> very large amount of memory.
>
> If the process has been oom killed, this will cause a lot of memory to
> be overcharged to its memcg since it has access to memory reserves or
> could potentially deplete all system memory reserves.

s/memcg/hugetlb_cgroup/ but I don't think hugetlb has any
fatal_signal_pending() based overcharging.  I no objection to the patch,
but this doesn't seems like a cgroup thing, so the commit log could
stand a tweak.

> In the same way that commit 4779280d1ea4 ("mm: make get_user_pages() 
> interruptible") aborted for pending SIGKILLs when faulting non-hugetlb
> memory, based on the premise of commit 462e00cc7151 ("oom: stop
> allocating user memory if TIF_MEMDIE is set"), hugetlb page faults now
> terminate when the process has been oom killed.
>
> Cc: Naoya Horiguchi 
> Cc: Davidlohr Bueso 
> Cc: "Kirill A. Shutemov" 
> Signed-off-by: David Rientjes 
> ---
>  v2: check signal inside follow_huegtlb_page() loop per Kirill
>
>  mm/hugetlb.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3276,6 +3276,15 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   struct page *page;
>  
>   /*
> +  * If we have a pending SIGKILL, don't keep faulting pages and
> +  * potentially allocating memory.
> +  */
> + if (unlikely(fatal_signal_pending(current))) {
> + remainder = 0;
> + break;
> + }
> +
> + /*
>* Some archs (sparc64, sh*) have multiple pte_ts to
>* each hugepage.  We have to make sure we get the
>* first, for the page indexing below to work.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch v3] mm, hugetlb: abort __get_user_pages if current has been oom killed

2015-03-09 Thread Greg Thelen


On Mon, Mar 09 2015, David Rientjes wrote:

> If __get_user_pages() is faulting a significant number of hugetlb pages,
> usually as the result of mmap(MAP_LOCKED), it can potentially allocate a
> very large amount of memory.
>
> If the process has been oom killed, this will cause a lot of memory to
> potentially deplete memory reserves.
>
> In the same way that commit 4779280d1ea4 ("mm: make get_user_pages() 
> interruptible") aborted for pending SIGKILLs when faulting non-hugetlb
> memory, based on the premise of commit 462e00cc7151 ("oom: stop
> allocating user memory if TIF_MEMDIE is set"), hugetlb page faults now
> terminate when the process has been oom killed.
>
> Cc: Greg Thelen 
> Cc: Naoya Horiguchi 
> Cc: Davidlohr Bueso 
> Acked-by: "Kirill A. Shutemov" 
> Signed-off-by: David Rientjes 

Looks good.

Acked-by: "Greg Thelen" 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Making memcg track ownership per address_space or anon_vma

2015-02-06 Thread Greg Thelen

On Fri, Feb 6, 2015 at 6:17 AM, Tejun Heo  wrote:
> Hello, Greg.
>
> On Thu, Feb 05, 2015 at 04:03:34PM -0800, Greg Thelen wrote:
>> So  this is  a system  which charges  all cgroups  using a  shared inode
>> (recharge on read) for all resident pages of that shared inode.  There's
>> only one copy of the page in memory on just one LRU, but the page may be
>> charged to multiple container's (shared_)usage.
>
> Yeap.
>
>> Perhaps I missed it, but what happens when a child's limit is
>> insufficient to accept all pages shared by its siblings?  Example
>> starting with 2M cached of a shared file:
>>
>>   A
>>   +-B(usage=2M lim=3M hosted_usage=2M)
>> +-C  (usage=0  lim=2M shared_usage=2M)
>> +-D  (usage=0  lim=2M shared_usage=2M)
>> \-E  (usage=0  lim=1M shared_usage=0)
>>
>> If E faults in a new 4K page within the shared file, then E is a sharing
>> participant so it'd be charged the 2M+4K, which pushes E over it's
>> limit.
>
> OOM?  It shouldn't be participating in sharing of an inode if it can't
> match others' protection on the inode, I think.  What we're doing now
> w/ page based charging is kinda unfair because in the situations like
> above the one under pressure can end up siphoning off of the larger
> cgroups' protection if they actually use overlapping areas; however,
> for disjoint areas, per-page charging would behave correctly.
>
> So, this part comes down to the same question - whether multiple
> cgroups accessing disjoint areas of a single inode is an important
> enough use case.  If we say yes to that, we better make writeback
> support that too.

If cgroups are about isolation then writing to shared files should be
rare, so I'm willing to say that we don't need to handle shared
writers well.  Shared readers seem like a more valuable use cases
(thin provisioning).  I'm getting overwhelmed with the thought
exercise of automatically moving inodes to common ancestors and back
charging the sharers for shared_usage.  I haven't wrapped my head
around how these shared data pages will get protected.  It seems like
they'd no longer be protected by child min watermarks.

So I know this thread opened with the claim "both memcg and blkcg must
be looking at the same picture.  Deviating them is highly likely to
lead to long-term issues forcing us to look at this again anyway, only
with far more baggage."  But I'm still wondering if the following is
simpler:
(1) leave memcg as a per page controller.
(2) maintain a per inode i_memcg which is set to the common dirtying
ancestor.  If not shared then it'll point to the memcg that the page
was charged to.
(3) when memcg dirtying page pressure is seen, walk up the cgroup tree
writing dirty inodes, this will write shared inodes using blkcg
priority of the respective levels.
(4) background limit wb_check_background_flush() and time based
wb_check_old_data_flush() can feel free to attack shared inodes to
hopefully restore them to non-shared state.
For non-shared inodes, this should behave the same.  For shared inodes
it should only affect those in the hierarchy which is sharing.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Making memcg track ownership per address_space or anon_vma

2015-02-04 Thread Greg Thelen

On Wed, Feb 04 2015, Tejun Heo wrote:

> Hello,
>
> On Tue, Feb 03, 2015 at 03:30:31PM -0800, Greg Thelen wrote:
>> If a machine has several top level memcg trying to get some form of
>> isolation (using low, min, soft limit) then a shared libc will be
>> moved to the root memcg where it's not protected from global memory
>> pressure.  At least with the current per page accounting such shared
>> pages often land into some protected memcg.
>
> Yes, it becomes interesting with the low limit as the pressure
> direction is reversed but at the same time overcommitting low limits
> doesn't lead to a sane setup to begin with as it's asking for global
> OOMs anyway, which means that things like libc would end up competing
> at least fairly with other pages for global pressure and should stay
> in memory under most circumstances, which may or may not be
> sufficient.

I agree.  Clarification... I don't plan to overcommit low or min limits.
On machines without overcommited min limits the existing system offers
some protection for shared libs from global reclaim.  Pushing them to
root doesn't.

> Hmm need to think more about it but this only becomes a problem
> with the root cgroup because it doesn't have min setting which is
> expected to be inclusive of all descendants, right?  Maybe the right
> thing to do here is treating the inodes which get pushed to the root
> as a special case and we can implement a mechanism where the root is
> effectively borrowing from the mins of its children which doesn't have
> to be completely correct - e.g. just charge it against all children
> repeatedly and if any has min protection, put it under min protection.
> IOW, make it the baseload for all of them.

I think the linux-next low (and the TBD min) limits also have the
problem for more than just the root memcg.  I'm thinking of a 2M file
shared between C and D below.  The file will be charged to common parent
B.

A
+-B(usage=2M lim=3M min=2M)
  +-C  (usage=0  lim=2M min=1M shared_usage=2M)
  +-D  (usage=0  lim=2M min=1M shared_usage=2M)
  \-E  (usage=0  lim=2M min=0)

The problem arises if A/B/E allocates more than 1M of private
reclaimable file data.  This pushes A/B into reclaim which will reclaim
both the shared file from A/B and private file from A/B/E.  In contrast,
the current per-page memcg would've protected the shared file in either
C or D leaving A/B reclaim to only attack A/B/E.

Pinning the shared file to either C or D, using TBD policy such as mount
option, would solve this for tightly shared files.  But for wide fanout
file (libc) the admin would need to assign a global bucket and this
would be a pain to size due to various job requirements.

>> If two cgroups collude they can use more memory than their limit and
>> oom the entire machine.  Admittedly the current per-page system isn't
>> perfect because deleting a memcg which contains mlocked memory
>> (referenced by a remote memcg) moves the mlocked memory to root
>> resulting in the same issue.  But I'd argue this is more likely with
>
> Hmmm... why does it do that?  Can you point me to where it's
> happening?

My mistake, I was thinking of older kernels which reparent memory.
Though I can't say v3.19-rc7 handles this collusion any better.  Instead
of reparenting the mlocked memory, it's left in an invisible (offline)
memcg.  Unlike older kernels the memory doesn't appear in
root/memory.stat[unevictable], instead it buried in
root/memory.stat[total_unevictable] which includes mlocked memory in
visible (online) and invisible (offline) children.

>> the RFC because it doesn't involve the cgroup deletion/reparenting.  A
>
> One approach could be expanding on the forementioned scheme and make
> all sharing cgroups to get charged for the shared inodes they're
> using, which should render such collusions entirely pointless.
> e.g. let's say we start with the following.
>
>   A   (usage=48M)
>   +-B (usage=16M)
>   \-C (usage=32M)
>
> And let's say, C starts accessing an inode which is 8M and currently
> associated with B.
>
>   A   (usage=48M, hosted= 8M)
>   +-B (usage= 8M, shared= 8M)
>   \-C (usage=32M, shared= 8M)
>
> The only extra charging that we'd be doing is charing C with extra
> 8M.  Let's say another cgroup D gets created and uses 4M.
>
>   A   (usage=56M, hosted= 8M)
>   +-B (usage= 8M, shared= 8M)
>   +-C (usage=32M, shared= 8M)
>   \-D (usage= 8M)
>
> and it also accesses the inode.
>
>   A   (usage=56M, hosted= 8M)
>   +-B (usage= 8M, shared= 8M)
>   +-C (usage=32M, shared= 8M)
>   \-D (usage= 8M, shared= 8M)
>
> We&#

Re: [RFC] Making memcg track ownership per address_space or anon_vma

2015-02-05 Thread Greg Thelen


On Thu, Feb 05 2015, Tejun Heo wrote:

> Hello, Greg.
>
> On Wed, Feb 04, 2015 at 03:51:01PM -0800, Greg Thelen wrote:
>> I think the linux-next low (and the TBD min) limits also have the
>> problem for more than just the root memcg.  I'm thinking of a 2M file
>> shared between C and D below.  The file will be charged to common parent
>> B.
>> 
>>  A
>>  +-B(usage=2M lim=3M min=2M)
>>+-C  (usage=0  lim=2M min=1M shared_usage=2M)
>>+-D  (usage=0  lim=2M min=1M shared_usage=2M)
>>\-E  (usage=0  lim=2M min=0)
>> 
>> The problem arises if A/B/E allocates more than 1M of private
>> reclaimable file data.  This pushes A/B into reclaim which will reclaim
>> both the shared file from A/B and private file from A/B/E.  In contrast,
>> the current per-page memcg would've protected the shared file in either
>> C or D leaving A/B reclaim to only attack A/B/E.
>> 
>> Pinning the shared file to either C or D, using TBD policy such as mount
>> option, would solve this for tightly shared files.  But for wide fanout
>> file (libc) the admin would need to assign a global bucket and this
>> would be a pain to size due to various job requirements.
>
> Shouldn't we be able to handle it the same way as I proposed for
> handling sharing?  The above would look like
>
>   A
>   +-B(usage=2M lim=3M min=2M hosted_usage=2M)
> +-C  (usage=0  lim=2M min=1M shared_usage=2M)
> +-D  (usage=0  lim=2M min=1M shared_usage=2M)
> \-E  (usage=0  lim=2M min=0)
>
> Now, we don't wanna use B's min verbatim on the hosted inodes shared
> by children but we're unconditionally charging the shared amount to
> all sharing children, which means that we're eating into the min
> settings of all participating children, so, we should be able to use
> sum of all sharing children's min-covered amount as the inode's min,
> which of course is to be contained inside the min of the parent.
>
> Above, we're charging 2M to C and D, each of which has 1M min which is
> being consumed by the shared charge (the shared part won't get
> reclaimed from the internal pressure of children, so we're really
> taking that part away from it).  Summing them up, the shared inode
> would have 2M protection which is honored as long as B as a whole is
> under its 3M limit.  This is similar to creating a dedicated child for
> each shared resource for low limits.  The downside is that we end up
> guarding the shared inodes more than non-shared ones, but, after all,
> we're charging it to everybody who's using it.
>
> Would something like this work?

Maybe, but I want to understand more about how pressure works in the
child.  As C (or D) allocates non shared memory does it perform reclaim
to ensure that its (C.usage + C.shared_usage < C.lim).  Given C's
shared_usage is linked into B.LRU it wouldn't be naturally reclaimable
by C.  Are you thinking that charge failures on cgroups with non zero
shared_usage would, as needed, induce reclaim of parent's hosted_usage?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Making memcg track ownership per address_space or anon_vma

2015-02-05 Thread Greg Thelen

On Thu, Feb 05 2015, Tejun Heo wrote:

> Hey,
>
> On Thu, Feb 05, 2015 at 02:05:19PM -0800, Greg Thelen wrote:
>> >A
>> >+-B(usage=2M lim=3M min=2M hosted_usage=2M)
>> >  +-C  (usage=0  lim=2M min=1M shared_usage=2M)
>> >  +-D  (usage=0  lim=2M min=1M shared_usage=2M)
>> >  \-E  (usage=0  lim=2M min=0)
> ...
>> Maybe, but I want to understand more about how pressure works in the
>> child.  As C (or D) allocates non shared memory does it perform reclaim
>> to ensure that its (C.usage + C.shared_usage < C.lim).  Given C's
>
> Yes.
>
>> shared_usage is linked into B.LRU it wouldn't be naturally reclaimable
>> by C.  Are you thinking that charge failures on cgroups with non zero
>> shared_usage would, as needed, induce reclaim of parent's hosted_usage?
>
> Hmmm I'm not really sure but why not?  If we properly account for
> the low protection when pushing inodes to the parent, I don't think
> it'd break anything.  IOW, allow the amount beyond the sum of low
> limits to be reclaimed when one of the sharers is under pressure.
>
> Thanks.

I'm not saying that it'd break anything.  I think it's required that
children perform reclaim on shared data hosted in the parent.  The child
is limited by shared_usage, so it needs ability to reclaim it.  So I
think we're in agreement.  Child will reclaim parent's hosted_usage when
the child is charged for shared_usage.  Ideally the only parental memory
reclaimed in this situation would be shared.  But I think (though I
can't claim to have followed the new memcg philosophy discussions) that
internal nodes in the cgroup tree (i.e. parents) do not have any
resources charged directly to them.  All resources are charged to leaf
cgroups which linger until resources are uncharged.  Thus the LRUs of
parent will only contain hosted (shared) memory.  This thankfully focus
parental reclaim easy on shared pages.  Child pressure will,
unfortunately, reclaim shared pages used by any container.  But if
shared pages were charged all sharing containers, then it will help
relieve pressure in the caller.

So  this is  a system  which charges  all cgroups  using a  shared inode
(recharge on read) for all resident pages of that shared inode.  There's
only one copy of the page in memory on just one LRU, but the page may be
charged to multiple container's (shared_)usage.

Perhaps I missed it, but what happens when a child's limit is
insufficient to accept all pages shared by its siblings?  Example
starting with 2M cached of a shared file:

A
+-B(usage=2M lim=3M hosted_usage=2M)
  +-C  (usage=0  lim=2M shared_usage=2M)
  +-D  (usage=0  lim=2M shared_usage=2M)
  \-E  (usage=0  lim=1M shared_usage=0)

If E faults in a new 4K page within the shared file, then E is a sharing
participant so it'd be charged the 2M+4K, which pushes E over it's
limit.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Making memcg track ownership per address_space or anon_vma

2015-02-03 Thread Greg Thelen

On Mon, Feb 2, 2015 at 11:46 AM, Tejun Heo  wrote:
> Hey,
>
> On Mon, Feb 02, 2015 at 10:26:44PM +0300, Konstantin Khlebnikov wrote:
>
>> Keeping shared inodes in common ancestor is reasonable.
>> We could schedule asynchronous moving when somebody opens or mmaps
>> inode from outside of its current cgroup. But it's not clear when
>> inode should be moved into opposite direction: when inode should
>> become private and how detect if it's no longer shared.
>>
>> For example each inode could keep yet another pointer to memcg where
>> it will track subtree of cgroups where it was accessed in past 5
>> minutes or so. And sometimes that informations goes into moving thread.
>>
>> Actually I don't see other options except that time-based estimation:
>> tracking all cgroups for each inode is too expensive, moving pages
>> from one lru to another is expensive too. So, moving inodes back and
>> forth at each access from the outside world is not an option.
>> That should be rare operation which runs in background or in reclaimer.
>
> Right, what strategy to use for migration is up for debate, even for
> moving to the common ancestor.  e.g. should we do that on the first
> access?  In the other direction, it get more interesting.  Let's say
> if we decide to move back an inode to a descendant, what if that
> triggers OOM condition?  Do we still go through it and cause OOM in
> the target?  Do we even want automatic moving in this direction?
>
> For explicit cases, userland can do FADV_DONTNEED, I suppose.
>
> Thanks.
>
> --
> tejun

I don't have any killer objections, most of my worries are isolation concerns.

If a machine has several top level memcg trying to get some form of
isolation (using low, min, soft limit) then a shared libc will be
moved to the root memcg where it's not protected from global memory
pressure.  At least with the current per page accounting such shared
pages often land into some protected memcg.

If two cgroups collude they can use more memory than their limit and
oom the entire machine.  Admittedly the current per-page system isn't
perfect because deleting a memcg which contains mlocked memory
(referenced by a remote memcg) moves the mlocked memory to root
resulting in the same issue.  But I'd argue this is more likely with
the RFC because it doesn't involve the cgroup deletion/reparenting.  A
possible tweak to shore up the current system is to move such mlocked
pages to the memcg of the surviving locker.  When the machine is oom
it's often nice to examine memcg state to determine which container is
using the memory.  Tracking down who's contributing to a shared
container is non-trivial.

I actually have a set of patches which add a memcg=M mount option to
memory backed file systems.  I was planning on proposing them,
regardless of this RFC, and this discussion makes them even more
appealing.  If we go in this direction, then we'd need a similar
notion for disk based filesystems.  As Konstantin suggested, it'd be
really nice to specify charge policy on a per file, or directory, or
bind mount basis.  This allows shared files to be deterministically
charged to a known container.  We'd need to flesh out the policies:
e.g. if two bind mound each specify different charge targets for the
same inode, I guess we just pick one.  Though the nature of this
catch-all shared container is strange.  Presumably a machine manager
would need to create it as an unlimited container (or at least as big
as the sum of all shared files) so that any app which decided it wants
to mlock all shared files has a way to without ooming the shared
container.  In the current per-page approach it's possible to lock
shared libs.  But the machine manager would need to decide how much
system ram to set aside for this catch-all shared container.

When there's large incidental sharing, then things get sticky.  A
periodic filesystem scanner (e.g. virus scanner, or grep foo -r /) in
a small container would pull all pages to the root memcg where they
are exposed to root pressure which breaks isolation.  This is
concerning.  Perhaps the such accesses could be decorated with
(O_NO_MOVEMEM).

So this RFC change will introduce significant change to user space
machine managers and perturb isolation.  Is the resulting system
better?  It's not clear, it's the devil know vs devil unknown.  Maybe
it'd be easier if the memcg's I'm talking about were not allowed to
share page cache (aka copy-on-read) even for files which are jointly
visible.  That would provide today's interface while avoiding the
problematic sharing.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Making memcg track ownership per address_space or anon_vma

2015-02-11 Thread Greg Thelen

On Tue, Feb 10, 2015 at 6:19 PM, Tejun Heo  wrote:
> Hello, again.
>
> On Sat, Feb 07, 2015 at 09:38:39AM -0500, Tejun Heo wrote:
>> If we can argue that memcg and blkcg having different views is
>> meaningful and characterize and justify the behaviors stemming from
>> the deviation, sure, that'd be fine, but I don't think we have that as
>> of now.
>
> If we assume that memcg and blkcg having different views is something
> which represents an acceptable compromise considering the use cases
> and implementation convenience - IOW, if we assume that read-sharing
> is something which can happen regularly while write sharing is a
> corner case and that while not completely correct the existing
> self-corrective behavior from tracking ownership per-page at the point
> of instantiation is good enough (as a memcg under pressure is likely
> to give up shared pages to be re-instantiated by another sharer w/
> more budget), we need to do the impedance matching between memcg and
> blkcg at the writeback layer.
>
> The main issue there is that the last chain of IO pressure propagation
> is realized by making individual dirtying tasks to converge on a
> common target dirty ratio point which naturally depending on those
> tasks seeing the same picture in terms of the current write bandwidth
> and available memory and how much of it is dirty.  Tasks dirtying
> pages belonging to the same memcg while some of them are mostly being
> written out by a different blkcg would wreck the mechanism.  It won't
> be difficult for one subset to make the other to consider themselves
> under severe IO pressure when there actually isn't one in that group
> possibly stalling and starving those tasks unduly.  At more basic
> level, it's just wrong for one group to be writing out significant
> amount for another.
>
> These issues can persist indefinitely if we follow the same
> instantiator-owns rule for inode writebacks.  Even if we reset the
> ownership when an inode becomes clea, it wouldn't work as it can be
> dirtied over and over again while under writeback, and when things
> like this happen, the behavior may become extremely difficult to
> understand or characterize.  We don't have visibility into how
> individual pages of an inode get distributed across multiple cgroups,
> who's currently responsible for writing back a specific inode or how
> dirty ratio mechanism is behaving in the face of the unexpected
> combination of parameters.
>
> Even if we assume that write sharing is a fringe case, we need
> something better than first-whatever rule when choosing which blkcg is
> responsible for writing a shared inode out.  There needs to be a
> constant corrective pressure so that incidental and temporary sharings
> don't end up screwing up the mechanism for an extended period of time.
>
> Greg mentioned chossing the closest ancestor of the sharers, which
> basically pushes inode sharing policy implmentation down to writeback
> from memcg.  This could work but we end up with the same collusion
> problem as when this is used for memcg and it's even more difficult to
> solve this at writeback layer - we'd have to communicate the shared
> state all the way down to block layer and then implement a mechanism
> there to take corrective measures and even after that we're likely to
> end up with prolonged state where dirty ratio propagation is
> essentially broken as the dirtier and writer would be seeing different
> pictures.
>
> So, based on the assumption that write sharings are mostly incidental
> and temporary (ie. we're basically declaring that we don't support
> persistent write sharing), how about something like the following?
>
> 1. memcg contiues per-page tracking.
>
> 2. Each inode is associated with a single blkcg at a given time and
>written out by that blkcg.
>
> 3. While writing back, if the number of pages from foreign memcg's is
>higher than certain ratio of total written pages, the inode is
>marked as disowned and the writeback instance is optionally
>terminated early.  e.g. if the ratio of foreign pages is over 50%
>after writing out the number of pages matching 5s worth of write
>bandwidth for the bdi, mark the inode as disowned.
>
> 4. On the following dirtying of the inode, the inode is associated
>with the matching blkcg of the dirtied page.  Note that this could
>be the next cycle as the inode could already have been marked dirty
>by the time the above condition triggered.  In that case, the
>following writeback would be terminated early too.
>
> This should provide sufficient corrective pressure so that incidental
> and temporary sharing of an inode doesn't become a persistent issue
> while keeping the complexity necessary for implementing such pressure
> fairly minimal and self-contained.  Also, the changes necessary for
> individual filesystems would be minimal.
>
> I think this should work well enough as long as the forementioned
> assumptions are true - IOW, if we maintain tha

Re: [RFC] Making memcg track ownership per address_space or anon_vma

2015-02-11 Thread Greg Thelen

On Wed, Feb 11, 2015 at 12:33 PM, Tejun Heo  wrote:
[...]
>> page count to throttle based on blkcg's bandwidth.  Note: memcg
>> doesn't yet have dirty page counts, but several of us have made
>> attempts at adding the counters.  And it shouldn't be hard to get them
>> merged.
>
> Can you please post those?

Will do.  Rebasing and testing needed, so it won't be today.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Making memcg track ownership per address_space or anon_vma

2015-01-29 Thread Greg Thelen


On Thu, Jan 29 2015, Tejun Heo wrote:

> Hello,
>
> Since the cgroup writeback patchset[1] have been posted, several
> people brought up concerns about the complexity of allowing an inode
> to be dirtied against multiple cgroups is necessary for the purpose of
> writeback and it is true that a significant amount of complexity (note
> that bdi still needs to be split, so it's still not trivial) can be
> removed if we assume that an inode always belongs to one cgroup for
> the purpose of writeback.
>
> However, as mentioned before, this issue is directly linked to whether
> memcg needs to track the memory ownership per-page.  If there are
> valid use cases where the pages of an inode must be tracked to be
> owned by different cgroups, cgroup writeback must be able to handle
> that situation properly.  If there aren't no such cases, the cgroup
> writeback support can be simplified but again we should put memcg on
> the same cadence and enforce per-inode (or per-anon_vma) ownership
> from the beginning.  The conclusion can be either way - per-page or
> per-inode - but both memcg and blkcg must be looking at the same
> picture.  Deviating them is highly likely to lead to long-term issues
> forcing us to look at this again anyway, only with far more baggage.
>
> One thing to note is that the per-page tracking which is currently
> employed by memcg seems to have been born more out of conveninence
> rather than requirements for any actual use cases.  Per-page ownership
> makes sense iff pages of an inode have to be associated with different
> cgroups - IOW, when an inode is accessed by multiple cgroups; however,
> currently, memcg assigns a page to its instantiating memcg and leaves
> it at that till the page is released.  This means that if a page is
> instantiated by one cgroup and then subsequently accessed only by a
> different cgroup, whether the page's charge gets moved to the cgroup
> which is actively using it is purely incidental.  If the page gets
> reclaimed and released at some point, it'll be moved.  If not, it
> won't.
>
> AFAICS, the only case where the current per-page accounting works
> properly is when disjoint sections of an inode are used by different
> cgroups and the whole thing hinges on whether this use case justifies
> all the added overhead including page->mem_cgroup pointer and the
> extra complexity in the writeback layer.  FWIW, I'm doubtful.
> Johannes, Michal, Greg, what do you guys think?
>
> If the above use case - a huge file being actively accssed disjointly
> by multiple cgroups - isn't significant enough and there aren't other
> use cases that I missed which can benefit from the per-page tracking
> that's currently implemented, it'd be logical to switch to per-inode
> (or per-anon_vma or per-slab) ownership tracking.  For the short term,
> even just adding an extra ownership information to those containing
> objects and inherting those to page->mem_cgroup could work although
> it'd definitely be beneficial to eventually get rid of
> page->mem_cgroup.
>
> As with per-page, when the ownership terminates is debatable w/
> per-inode tracking.  Also, supporting some form of shared accounting
> across different cgroups may be useful (e.g. shared library's memory
> being equally split among anyone who accesses it); however, these
> aren't likely to be major and trying to do something smart may affect
> other use cases adversely, so it'd probably be best to just keep it
> dumb and clear the ownership when the inode loses all pages (a cgroup
> can disown such inode through FADV_DONTNEED if necessary).
>
> What do you guys think?  If making memcg track ownership at per-inode
> level, even for just the unified hierarchy, is the direction we can
> take, I'll go ahead and simplify the cgroup writeback patchset.
>
> Thanks.

I find simplification appealing.  But I not sure it will fly, if for no
other reason than the shared accountings.  I'm ignoring intentional
sharing, used by carefully crafted apps, and just thinking about
incidental sharing (e.g. libc).

Example:

$ mkdir small
$ echo 1M > small/memory.limit_in_bytes
$ (echo $BASHPID > small/cgroup.procs && exec sleep 1h) &

$ mkdir big
$ echo 10G > big/memory.limit_in_bytes
$ (echo $BASHPID > big/cgroup.procs && exec mlockall_database 1h) &


Assuming big/mlockall_database mlocks all of libc, then it will oom kill
the small memcg because libc is owned by small due it having touched it
first.  It'd be hard to figure out what small did wrong to deserve the
oom kill.

FWIW we've been using memcg writeback where inodes have a memcg
writeback owner.  Once multiple memcg write to an inode then the inode
becomes writeback shared which makes it more likely to be written.  Once
cleaned the inode is then again able to be privately owned:
https://lkml.org/lkml/2011/8/17/200
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org

Re: [GIT PULL] ext4 bug fixes for 4.6

2016-04-09 Thread Greg Thelen


Theodore Ts'o wrote:

> The following changes since commit 243d50678583100855862bc084b8b307eea67f68:
>
>   Merge branch 'overlayfs-linus' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs (2016-03-22 
> 13:11:15 -0700)
>
n> are available in the git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
> tags/ext4_for_linus_stable
>
> for you to fetch changes up to c325a67c72903e1cc30e990a15ce745bda0dbfde:
>
>   ext4: ignore quota mount options if the quota feature is enabled 
> (2016-04-03 17:03:37 -0400)
>
> 
> These changes contains a fix for overlayfs interacting with some
> (badly behaved) dentry code in various file systems.  These have been
> reviewed by Al and the respective file system mtinainers and are going
> through the ext4 tree for convenience.
>
> This also has a few ext4 encryption bug fixes that were discovered in
> Android testing (yes, we will need to get these sync'ed up with the
> fs/crypto code; I'll take care of that).  It also has some bug fixes
> and a change to ignore the legacy quota options to allow for xfstests
> regression testing of ext4's internal quota feature and to be more
> consistent with how xfs handles this case.
>
> 
> Dan Carpenter (1):
>   ext4 crypto: fix some error handling
>
> Filipe Manana (1):
>   btrfs: fix crash/invalid memory access on fsync when using overlayfs
>
> Jan Kara (1):
>   ext4: retry block allocation for failed DIO and DAX writes
>
> Miklos Szeredi (4):
>   fs: add file_dentry()
>   nfs: use file_dentry()
>   ext4: use dget_parent() in ext4_file_open()
>   ext4: use file_dentry()
>
> Theodore Ts'o (7):
>   ext4: check if in-inode xattr is corrupted in 
> ext4_expand_extra_isize_ea()
>   ext4 crypto: don't let data integrity writebacks fail with ENOMEM
>   ext4 crypto: use dget_parent() in ext4_d_revalidate()
>   ext4: allow readdir()'s of large empty directories to be interrupted

Ted,

I've been testing 5b5b7fd185e9 (linus/master) and seeing that
interrupted readdir() now returns duplicate dirents.

Reverting commit 1028b55bafb7 ("ext4: allow readdir()'s of large empty
directories to be interrupted") avoids duplicates.

On 5b5b7fd185e9 a SIGPROF to the test program below occasionally causes
"already seen name" error.  On older kernels, it runs indefinitely
without error.

/*
  mkdir /tmp/foo
  cd /tmp/foo
  for i in $(seq 0 599); do touch $i; done
  /tmp/scanner &
  kill -PROF %1
 */

#include 
#include 
#include 
#include 
#include 
#include 

static void handler(int unused)
{
}

int main() {
enum { count = 600 };
char seen[count];
struct dirent *ent;
DIR *dir;
struct sigaction sa = {0};

sa.sa_handler = handler;
if (sigemptyset(&sa.sa_mask))
err(1, "sigemptyset");
sa.sa_flags = SA_RESTART;
if (sigaction(SIGPROF, &sa, NULL))
err(1, "sigaction");

for (;;) {
memset(seen, 0, sizeof(seen));
dir = opendir(".");
if (dir == NULL)
err(1, "opendir(.)");
while ((ent = readdir(dir)) != NULL) {
int idx;

if ((strcmp(ent->d_name, ".") == 0) ||
(strcmp(ent->d_name, "..") == 0)) {
continue;
}
idx = atoi(ent->d_name);
if (idx >= count) {
errx(1, "bogus name %s index %d", ent->d_name, 
idx);
}
if (seen[idx]) {
errx(1, "already seen name %s index %d", 
ent->d_name, idx);
}
seen[idx] = 1;
}
if (closedir(dir))
err(1, "closedir(.)");
}
}

>   ext4: add lockdep annotations for i_data_sem
>   ext4: avoid calling dquot_get_next_id() if quota is not enabled
>   ext4: ignore quota mount options if the quota feature is enabled
>
>  fs/btrfs/file.c|  2 +-
>  fs/dcache.c|  5 -
>  fs/ext4/crypto.c   | 49 +
>  fs/ext4/dir.c  |  5 +
>  fs/ext4/ext4.h | 29 +++--
>  fs/ext4/file.c | 12 
>  fs/ext4/inode.c| 58 
> --
>  fs/ext4/move_extent.c  | 11 +--
>  fs/ext4/namei.c|  5 +
>  fs/ext4/page-io.c  | 14 +-
>  fs/ext4/readpage.c |  2 +-
>  fs/ext4/super.c| 61 
> +++--
>  fs/ext4/xattr.c| 32 
>  fs/nfs/dir.c   |  6

Re: [PATCH 1/5] mm: uncharge kmem pages from generic free_page path

2015-09-30 Thread Greg Thelen


Vladimir Davydov wrote:

> Currently, to charge a page to kmemcg one should use alloc_kmem_pages
> helper. When the page is not needed anymore it must be freed with
> free_kmem_pages helper, which will uncharge the page before freeing it.
> Such a design is acceptable for thread info pages and kmalloc large
> allocations, which are currently the only users of alloc_kmem_pages, but
> it gets extremely inconvenient if one wants to make use of batched free
> (e.g. to charge page tables - see release_pages) or page reference
> counter (pipe buffers - see anon_pipe_buf_release).
>
> To overcome this limitation, this patch moves kmemcg uncharge code to
> the generic free path and zaps free_kmem_pages helper. To distinguish
> kmem pages from other page types, it makes alloc_kmem_pages initialize
> page->_mapcount to a special value and introduces a new PageKmem helper,
> which returns true if it sees this value.
>
> Signed-off-by: Vladimir Davydov 
> ---
>  include/linux/gfp.h|  3 ---
>  include/linux/page-flags.h | 22 ++
>  kernel/fork.c  |  2 +-
>  mm/page_alloc.c| 26 --
>  mm/slub.c  |  2 +-
>  mm/swap.c  |  3 ++-
>  6 files changed, 34 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index f92cbd2f4450..b46147c45966 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -384,9 +384,6 @@ extern void *__alloc_page_frag(struct page_frag_cache *nc,
>  unsigned int fragsz, gfp_t gfp_mask);
>  extern void __free_page_frag(void *addr);
>  
> -extern void __free_kmem_pages(struct page *page, unsigned int order);
> -extern void free_kmem_pages(unsigned long addr, unsigned int order);
> -
>  #define __free_page(page) __free_pages((page), 0)
>  #define free_page(addr) free_pages((addr), 0)
>  
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 416509e26d6d..a190719c2f46 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -594,6 +594,28 @@ static inline void __ClearPageBalloon(struct page *page)
>  }
>  
>  /*
> + * PageKmem() returns true if the page was allocated with alloc_kmem_pages().
> + */
> +#define PAGE_KMEM_MAPCOUNT_VALUE (-512)
> +
> +static inline int PageKmem(struct page *page)
> +{
> + return atomic_read(&page->_mapcount) == PAGE_KMEM_MAPCOUNT_VALUE;
> +}
> +
> +static inline void __SetPageKmem(struct page *page)
> +{
> + VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);
> + atomic_set(&page->_mapcount, PAGE_KMEM_MAPCOUNT_VALUE);
> +}

What do you think about several special mapcount values for various
types of kmem?

It's helps user and administrators break down memory usage.

A nice equation is:
  memory.usage_in_bytes = memory.stat[file + anon + unevictable + kmem]

Next, it's helpful to be able to breakdown kmem into:
  kmem = stack + pgtable + slab + ...

On one hand (and the kernel I use internally) we can use separate per
memcg counters for each kmem type.  Then reconstitute memory.kmem as
needed by adding them together.  But using keeping a single kernel kmem
counter is workable if there is a way to breakdown the memory charge to
a container (e.g. by walking /proc/kpageflags-ish or per memcg
memory.kpageflags-ish file).

> +static inline void __ClearPageKmem(struct page *page)
> +{
> + VM_BUG_ON_PAGE(!PageKmem(page), page);
> + atomic_set(&page->_mapcount, -1);
> +}
> +
> +/*
>   * If network-based swap is enabled, sl*b must keep track of whether pages
>   * were allocated from pfmemalloc reserves.
>   */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 2845623fb582..c23f8a17e99e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -169,7 +169,7 @@ static struct thread_info *alloc_thread_info_node(struct 
> task_struct *tsk,
>  
>  static inline void free_thread_info(struct thread_info *ti)
>  {
> - free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
> + free_pages((unsigned long)ti, THREAD_SIZE_ORDER);
>  }
>  # else
>  static struct kmem_cache *thread_info_cache;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48aaf7b9f253..88d85367c81e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -942,6 +942,10 @@ static bool free_pages_prepare(struct page *page, 
> unsigned int order)
>  
>   if (PageAnon(page))
>   page->mapping = NULL;
> + if (PageKmem(page)) {
> + memcg_kmem_uncharge_pages(page, order);
> + __ClearPageKmem(page);
> + }
>   bad += free_pages_check(page);
>   for (i = 1; i < (1 << order); i++) {
>   if (compound)
> @@ -3434,6 +3438,8 @@ struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned 
> int order)
>   return NULL;
>   page = alloc_pages(gfp_mask, order);
>   memcg_kmem_commit_charge(page, memcg, order);
> + if (page)
> + __SetPageKmem(page);
>   return page;
>  }
>

[PATCH v2] writeback: use exact memcg dirty counts

2019-03-29 Thread Greg Thelen

Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
memory.stat reporting") memcg dirty and writeback counters are managed
as:
1) per-memcg per-cpu values in range of [-32..32]
2) per-memcg atomic counter
When a per-cpu counter cannot fit in [-32..32] it's flushed to the
atomic.  Stat readers only check the atomic.
Thus readers such as balance_dirty_pages() may see a nontrivial error
margin: 32 pages per cpu.
Assuming 100 cpus:
   4k x86 page_size:  13 MiB error per memcg
  64k ppc page_size: 200 MiB error per memcg
Considering that dirty+writeback are used together for some decisions
the errors double.

This inaccuracy can lead to undeserved oom kills.  One nasty case is
when all per-cpu counters hold positive values offsetting an atomic
negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
balance_dirty_pages() only consults the atomic and does not consider
throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
13..200 MiB range then there's absolutely no dirty throttling, which
burdens vmscan with only dirty+writeback pages thus resorting to oom
kill.

It could be argued that tiny containers are not supported, but it's more
subtle.  It's the amount the space available for file lru that matters.
If a container has memory.max-200MiB of non reclaimable memory, then it
will also suffer such oom kills on a 100 cpu machine.

The following test reliably ooms without this patch.  This patch avoids
oom kills.

  $ cat test
  mount -t cgroup2 none /dev/cgroup
  cd /dev/cgroup
  echo +io +memory > cgroup.subtree_control
  mkdir test
  cd test
  echo 10M > memory.max
  (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
  (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)

  $ cat memcg-writeback-stress.c
  /*
   * Dirty pages from all but one cpu.
   * Clean pages from the non dirtying cpu.
   * This is to stress per cpu counter imbalance.
   * On a 100 cpu machine:
   * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
   * - per memcg atomic is -99*32 pages
   * - thus the complete dirty limit: sum of all counters 0
   * - balance_dirty_pages() only sees atomic count -99*32 pages, which
   *   it max()s to 0.
   * - So a workload can dirty -99*32 pages before balance_dirty_pages()
   *   cares.
   */
  #define _GNU_SOURCE
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 

  static char *buf;
  static int bufSize;

  static void set_affinity(int cpu)
  {
cpu_set_t affinity;

CPU_ZERO(&affinity);
CPU_SET(cpu, &affinity);
if (sched_setaffinity(0, sizeof(affinity), &affinity))
err(1, "sched_setaffinity");
  }

  static void dirty_on(int output_fd, int cpu)
  {
int i, wrote;

set_affinity(cpu);
for (i = 0; i < 32; i++) {
for (wrote = 0; wrote < bufSize; ) {
int ret = write(output_fd, buf+wrote, bufSize-wrote);
if (ret == -1)
err(1, "write");
wrote += ret;
}
}
  }

  int main(int argc, char **argv)
  {
int cpu, flush_cpu = 1, output_fd;
const char *output;

if (argc != 2)
errx(1, "usage: output_file");

output = argv[1];
bufSize = getpagesize();
buf = malloc(getpagesize());
if (buf == NULL)
errx(1, "malloc failed");

output_fd = open(output, O_CREAT|O_RDWR);
if (output_fd == -1)
err(1, "open(%s)", output);

for (cpu = 0; cpu < get_nprocs(); cpu++) {
if (cpu != flush_cpu)
dirty_on(output_fd, cpu);
}

set_affinity(flush_cpu);
if (fsync(output_fd))
err(1, "fsync(%s)", output);
if (close(output_fd))
err(1, "close(%s)", output);
free(buf);
  }

Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
collect exact per memcg counters.  This avoids the aforementioned oom
kills.

This does not affect the overhead of memory.stat, which still reads the
single atomic counter.

Why not use percpu_counter?  memcg already handles cpus going offline,
so no need for that overhead from percpu_counter.  And the
percpu_counter spinlocks are more heavyweight than is required.

It probably also makes sense to use exact dirty and writeback counters
in memcg oom reports.  But that is saved for later.

Cc: sta...@vger.kernel.org # v4.16+
Signed-off-by: Greg Thelen 
---
Changelog since v1:
- Move memcg_exact_page_state() into memcontrol.c.
- Unconditionally gather exact (per cpu) counters in mem_cgroup_wb_stats(), it's
  not called in performance sensitive paths.
- Unconditionally che

Re: [PATCH] writeback: sum memcg dirty counters as needed

2019-03-29 Thread Greg Thelen

Andrew Morton  wrote:

> On Thu,  7 Mar 2019 08:56:32 -0800 Greg Thelen  wrote:
>
>> Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
>> memory.stat reporting") memcg dirty and writeback counters are managed
>> as:
>> 1) per-memcg per-cpu values in range of [-32..32]
>> 2) per-memcg atomic counter
>> When a per-cpu counter cannot fit in [-32..32] it's flushed to the
>> atomic.  Stat readers only check the atomic.
>> Thus readers such as balance_dirty_pages() may see a nontrivial error
>> margin: 32 pages per cpu.
>> Assuming 100 cpus:
>>4k x86 page_size:  13 MiB error per memcg
>>   64k ppc page_size: 200 MiB error per memcg
>> Considering that dirty+writeback are used together for some decisions
>> the errors double.
>> 
>> This inaccuracy can lead to undeserved oom kills.  One nasty case is
>> when all per-cpu counters hold positive values offsetting an atomic
>> negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
>> balance_dirty_pages() only consults the atomic and does not consider
>> throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
>> 13..200 MiB range then there's absolutely no dirty throttling, which
>> burdens vmscan with only dirty+writeback pages thus resorting to oom
>> kill.
>> 
>> It could be argued that tiny containers are not supported, but it's more
>> subtle.  It's the amount the space available for file lru that matters.
>> If a container has memory.max-200MiB of non reclaimable memory, then it
>> will also suffer such oom kills on a 100 cpu machine.
>> 
>> ...
>> 
>> Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
>> collect exact per memcg counters when a memcg is close to the
>> throttling/writeback threshold.  This avoids the aforementioned oom
>> kills.
>> 
>> This does not affect the overhead of memory.stat, which still reads the
>> single atomic counter.
>> 
>> Why not use percpu_counter?  memcg already handles cpus going offline,
>> so no need for that overhead from percpu_counter.  And the
>> percpu_counter spinlocks are more heavyweight than is required.
>> 
>> It probably also makes sense to include exact dirty and writeback
>> counters in memcg oom reports.  But that is saved for later.
>
> Nice changelog, thanks.
>
>> Signed-off-by: Greg Thelen 
>
> Did you consider cc:stable for this?  We may as well - the stablebots
> backport everything which might look slightly like a fix anyway :(

Good idea.  Done in -v2 of the patch.

>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -573,6 +573,22 @@ static inline unsigned long memcg_page_state(struct 
>> mem_cgroup *memcg,
>>  return x;
>>  }
>>  
>> +/* idx can be of type enum memcg_stat_item or node_stat_item */
>> +static inline unsigned long
>> +memcg_exact_page_state(struct mem_cgroup *memcg, int idx)
>> +{
>> +long x = atomic_long_read(&memcg->stat[idx]);
>> +#ifdef CONFIG_SMP
>> +int cpu;
>> +
>> +for_each_online_cpu(cpu)
>> +x += per_cpu_ptr(memcg->stat_cpu, cpu)->count[idx];
>> +if (x < 0)
>> +x = 0;
>> +#endif
>> +return x;
>> +}
>
> This looks awfully heavyweight for an inline function.  Why not make it
> a regular function and avoid the bloat and i-cache consumption?

Done in -v2.

> Also, did you instead consider making this spill the percpu counters
> into memcg->stat[idx]?  That might be more useful for potential future
> callers.  It would become a little more expensive though.

I looked at that approach, but couldn't convince myself it was safe.  I
kept staring at "Remote [...] Write accesses can cause unique problems
due to the relaxed synchronization requirements for this_cpu
operations." from this_cpu_ops.txt.  So I'd like to delay this possible
optimization for a later patch.

Re: [PATCH] writeback: sum memcg dirty counters as needed

2019-03-29 Thread Greg Thelen

Johannes Weiner  wrote:

> On Thu, Mar 07, 2019 at 08:56:32AM -0800, Greg Thelen wrote:
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3880,6 +3880,7 @@ struct wb_domain *mem_cgroup_wb_domain(struct 
>> bdi_writeback *wb)
>>   * @pheadroom: out parameter for number of allocatable pages according to 
>> memcg
>>   * @pdirty: out parameter for number of dirty pages
>>   * @pwriteback: out parameter for number of pages under writeback
>> + * @exact: determines exact counters are required, indicates more work.
>>   *
>>   * Determine the numbers of file, headroom, dirty, and writeback pages in
>>   * @wb's memcg.  File, dirty and writeback are self-explanatory.  Headroom
>> @@ -3890,18 +3891,29 @@ struct wb_domain *mem_cgroup_wb_domain(struct 
>> bdi_writeback *wb)
>>   * ancestors.  Note that this doesn't consider the actual amount of
>>   * available memory in the system.  The caller should further cap
>>   * *@pheadroom accordingly.
>> + *
>> + * Return value is the error precision associated with *@pdirty
>> + * and *@pwriteback.  When @exact is set this a minimal value.
>>   */
>> -void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long 
>> *pfilepages,
>> - unsigned long *pheadroom, unsigned long *pdirty,
>> - unsigned long *pwriteback)
>> +unsigned long
>> +mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
>> +unsigned long *pheadroom, unsigned long *pdirty,
>> +unsigned long *pwriteback, bool exact)
>>  {
>>  struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
>>  struct mem_cgroup *parent;
>> +unsigned long precision;
>>  
>> -*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
>> -
>> +if (exact) {
>> +precision = 0;
>> +*pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY);
>> +*pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK);
>> +} else {
>> +precision = MEMCG_CHARGE_BATCH * num_online_cpus();
>> +*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
>> +*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
>> +}
>>  /* this should eventually include NR_UNSTABLE_NFS */
>> -*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
>>  *pfilepages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) |
>>   (1 << LRU_ACTIVE_FILE));
>>  *pheadroom = PAGE_COUNTER_MAX;
>> @@ -3913,6 +3925,8 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, 
>> unsigned long *pfilepages,
>>  *pheadroom = min(*pheadroom, ceiling - min(ceiling, used));
>>  memcg = parent;
>>  }
>> +
>> +return precision;
>
> Have you considered unconditionally using the exact version here?
>
> It does for_each_online_cpu(), but until very, very recently we did
> this per default for all stats, for years. It only became a problem in
> conjunction with the for_each_memcg loops when frequently reading
> memory stats at the top of a very large hierarchy.
>
> balance_dirty_pages() is called against memcgs that actually own the
> inodes/memory and doesn't do the additional recursive tree collection.
>
> It's also not *that* hot of a function, and in the io path...
>
> It would simplify this patch immensely.

Good idea.  Done in -v2 of the patch:
https://lore.kernel.org/lkml/20190329174609.164344-1-gthe...@google.com/

[PATCH] kbuild: clean compressed initramfs image

2019-07-21 Thread Greg Thelen

Since commit 9e3596b0c653 ("kbuild: initramfs cleanup, set target from
Kconfig") "make clean" leaves behind compressed initramfs images.
Example:
  $ make defconfig
  $ sed -i 
's|CONFIG_INITRAMFS_SOURCE=""|CONFIG_INITRAMFS_SOURCE="/tmp/ir.cpio"|' .config
  $ make olddefconfig
  $ make -s
  $ make -s clean
  $ git clean -ndxf | grep initramfs
  Would remove usr/initramfs_data.cpio.gz

clean rules do not have CONFIG_* context so they do not know which
compression format was used.  Thus they don't know which files to
delete.

Tell clean to delete all possible compression formats.

Once patched usr/initramfs_data.cpio.gz and friends are deleted by
"make clean".

Fixes: 9e3596b0c653 ("kbuild: initramfs cleanup, set target from Kconfig")
Signed-off-by: Greg Thelen 
---
 usr/Makefile | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/usr/Makefile b/usr/Makefile
index 6a89eb019275..e6f7cb2f81db 100644
--- a/usr/Makefile
+++ b/usr/Makefile
@@ -11,6 +11,9 @@ datafile_y = initramfs_data.cpio$(suffix_y)
 datafile_d_y = .$(datafile_y).d
 AFLAGS_initramfs_data.o += -DINITRAMFS_IMAGE="usr/$(datafile_y)"
 
+# clean rules do not have CONFIG_INITRAMFS_COMPRESSION.  So clean up after all
+# possible compression formats.
+clean-files += initramfs_data.cpio*
 
 # Generate builtin.o based on initramfs_data.o
 obj-$(CONFIG_BLK_DEV_INITRD) := initramfs_data.o
-- 
2.22.0.657.g960e92d24f-goog

[PATCH] writeback: fix inode cgroup switching comment

2019-03-04 Thread Greg Thelen

Commit 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb
transaction and use it for stat updates") refers to
inode_switch_wb_work_fn() which never got merged.  Switch the comments
to inode_switch_wbs_work_fn().

Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and 
use it for stat updates")
Signed-off-by: Greg Thelen 
---
 include/linux/backing-dev.h | 2 +-
 include/linux/fs.h  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c28a47cbe355..f9b029180241 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -365,7 +365,7 @@ unlocked_inode_to_wb_begin(struct inode *inode, struct 
wb_lock_cookie *cookie)
rcu_read_lock();
 
/*
-* Paired with store_release in inode_switch_wb_work_fn() and
+* Paired with store_release in inode_switch_wbs_work_fn() and
 * ensures that we see the new wb if we see cleared I_WB_SWITCH.
 */
cookie->locked = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd423fec8d83..08f26046233e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2091,7 +2091,7 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, 
struct file *filp)
  * I_WB_SWITCH Cgroup bdi_writeback switching in progress.  Used to
  * synchronize competing switching instances and to tell
  * wb stat updates to grab the i_pages lock.  See
- * inode_switch_wb_work_fn() for details.
+ * inode_switch_wbs_work_fn() for details.
  *
  * I_OVL_INUSE Used by overlayfs to get exclusive ownership on upper
  * and work dirs among overlayfs mounts.
-- 
2.21.0.352.gf09ad66450-goog

[PATCH] writeback: sum memcg dirty counters as needed

2019-03-07 Thread Greg Thelen

Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
memory.stat reporting") memcg dirty and writeback counters are managed
as:
1) per-memcg per-cpu values in range of [-32..32]
2) per-memcg atomic counter
When a per-cpu counter cannot fit in [-32..32] it's flushed to the
atomic.  Stat readers only check the atomic.
Thus readers such as balance_dirty_pages() may see a nontrivial error
margin: 32 pages per cpu.
Assuming 100 cpus:
   4k x86 page_size:  13 MiB error per memcg
  64k ppc page_size: 200 MiB error per memcg
Considering that dirty+writeback are used together for some decisions
the errors double.

This inaccuracy can lead to undeserved oom kills.  One nasty case is
when all per-cpu counters hold positive values offsetting an atomic
negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
balance_dirty_pages() only consults the atomic and does not consider
throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
13..200 MiB range then there's absolutely no dirty throttling, which
burdens vmscan with only dirty+writeback pages thus resorting to oom
kill.

It could be argued that tiny containers are not supported, but it's more
subtle.  It's the amount the space available for file lru that matters.
If a container has memory.max-200MiB of non reclaimable memory, then it
will also suffer such oom kills on a 100 cpu machine.

The following test reliably ooms without this patch.  This patch avoids
oom kills.

  $ cat test
  mount -t cgroup2 none /dev/cgroup
  cd /dev/cgroup
  echo +io +memory > cgroup.subtree_control
  mkdir test
  cd test
  echo 10M > memory.max
  (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
  (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)

  $ cat memcg-writeback-stress.c
  /*
   * Dirty pages from all but one cpu.
   * Clean pages from the non dirtying cpu.
   * This is to stress per cpu counter imbalance.
   * On a 100 cpu machine:
   * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
   * - per memcg atomic is -99*32 pages
   * - thus the complete dirty limit: sum of all counters 0
   * - balance_dirty_pages() only sees atomic count -99*32 pages, which
   *   it max()s to 0.
   * - So a workload can dirty -99*32 pages before balance_dirty_pages()
   *   cares.
   */
  #define _GNU_SOURCE
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 

  static char *buf;
  static int bufSize;

  static void set_affinity(int cpu)
  {
cpu_set_t affinity;

CPU_ZERO(&affinity);
CPU_SET(cpu, &affinity);
if (sched_setaffinity(0, sizeof(affinity), &affinity))
err(1, "sched_setaffinity");
  }

  static void dirty_on(int output_fd, int cpu)
  {
int i, wrote;

set_affinity(cpu);
for (i = 0; i < 32; i++) {
for (wrote = 0; wrote < bufSize; ) {
int ret = write(output_fd, buf+wrote, bufSize-wrote);
if (ret == -1)
err(1, "write");
wrote += ret;
}
}
  }

  int main(int argc, char **argv)
  {
int cpu, flush_cpu = 1, output_fd;
const char *output;

if (argc != 2)
errx(1, "usage: output_file");

output = argv[1];
bufSize = getpagesize();
buf = malloc(getpagesize());
if (buf == NULL)
errx(1, "malloc failed");

output_fd = open(output, O_CREAT|O_RDWR);
if (output_fd == -1)
err(1, "open(%s)", output);

for (cpu = 0; cpu < get_nprocs(); cpu++) {
if (cpu != flush_cpu)
dirty_on(output_fd, cpu);
}

set_affinity(flush_cpu);
if (fsync(output_fd))
err(1, "fsync(%s)", output);
if (close(output_fd))
err(1, "close(%s)", output);
free(buf);
  }

Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
collect exact per memcg counters when a memcg is close to the
throttling/writeback threshold.  This avoids the aforementioned oom
kills.

This does not affect the overhead of memory.stat, which still reads the
single atomic counter.

Why not use percpu_counter?  memcg already handles cpus going offline,
so no need for that overhead from percpu_counter.  And the
percpu_counter spinlocks are more heavyweight than is required.

It probably also makes sense to include exact dirty and writeback
counters in memcg oom reports.  But that is saved for later.

Signed-off-by: Greg Thelen 
---
 include/linux/memcontrol.h | 33 +
 mm/memcontrol.c| 26 --
 mm/page-writeback.c| 27 +--
 3 fi

[PATCH] trace: fix SKIP_STACK_VALIDATION=1 build

2018-06-08 Thread Greg Thelen

Non gcc-5 builds with CONFIG_STACK_VALIDATION=y and
SKIP_STACK_VALIDATION=1 fail.
Example output:
  /bin/sh: init/.tmp_main.o: Permission denied

commit 96f60dfa5819 ("trace: Use -mcount-record for dynamic ftrace"),
added a mismatched endif.  This causes cmd_objtool to get mistakenly
set.

Relocate endif to balance the newly added -record-mcount check.

Fixes: 96f60dfa5819 ("trace: Use -mcount-record for dynamic ftrace")
Signed-off-by: Greg Thelen 
---
 scripts/Makefile.build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 34d9e9ce97c2..e7889f486ca1 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -239,6 +239,7 @@ cmd_record_mcount = 
\
 "$(CC_FLAGS_FTRACE)" ]; then   \
$(sub_cmd_record_mcount)\
fi;
+endif # -record-mcount
 endif # CONFIG_FTRACE_MCOUNT_RECORD
 
 ifdef CONFIG_STACK_VALIDATION
@@ -263,7 +264,6 @@ ifneq ($(RETPOLINE_CFLAGS),)
   objtool_args += --retpoline
 endif
 endif
-endif
 
 
 ifdef CONFIG_MODVERSIONS
-- 
2.18.0.rc1.242.g61856ae69a-goog

[PATCH] RDMA/ucma: reject AF_IB ip multicast requests

2018-03-29 Thread Greg Thelen

syzbot discovered that ucma_join_ip_multicast() mishandles AF_IB request
addresses.  If an RDMA_USER_CM_CMD_JOIN_IP_MCAST request has
cmd.addr.sa_family=AF_IB then ucma_join_ip_multicast() reads beyond the
end of its cmd.addr.

Reject non IP RDMA_USER_CM_CMD_JOIN_IP_MCAST requests.
RDMA_USER_CM_CMD_JOIN_MCAST is interface for AF_IB multicast.

And add a buffer length safety check.

Fixes: 5bc2b7b397b0 ("RDMA/ucma: Allow user space to specify AF_IB when joining 
multicast")
Signed-off-by: Greg Thelen 
---
 drivers/infiniband/core/ucma.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index e5a1e7d81326..e410e03940ff 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -1423,11 +1423,19 @@ static ssize_t ucma_join_ip_multicast(struct ucma_file 
*file,
if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
return -EFAULT;
 
+   switch (cmd.addr.sin6_family) {
+   case AF_INET:
+   case AF_INET6:
+   break;
+   default:
+   return -EINVAL;
+   }
+
join_cmd.response = cmd.response;
join_cmd.uid = cmd.uid;
join_cmd.id = cmd.id;
join_cmd.addr_size = rdma_addr_size((struct sockaddr *) &cmd.addr);
-   if (!join_cmd.addr_size)
+   if (!join_cmd.addr_size || join_cmd.addr_size > sizeof(cmd.addr))
return -EINVAL;
 
join_cmd.join_flags = RDMA_MC_JOIN_FLAG_FULLMEMBER;
-- 
2.17.0.rc1.321.gba9d0f2565-goog

Re: [PATCH] RDMA/ucma: reject AF_IB ip multicast requests

2018-03-30 Thread Greg Thelen

On Thu, Mar 29, 2018 at 9:24 PM, Greg Thelen  wrote:
> syzbot discovered that ucma_join_ip_multicast() mishandles AF_IB request
> addresses.  If an RDMA_USER_CM_CMD_JOIN_IP_MCAST request has
> cmd.addr.sa_family=AF_IB then ucma_join_ip_multicast() reads beyond the
> end of its cmd.addr.
>
> Reject non IP RDMA_USER_CM_CMD_JOIN_IP_MCAST requests.
> RDMA_USER_CM_CMD_JOIN_MCAST is interface for AF_IB multicast.
>
> And add a buffer length safety check.
>
> Fixes: 5bc2b7b397b0 ("RDMA/ucma: Allow user space to specify AF_IB when 
> joining multicast")
> Signed-off-by: Greg Thelen 
> ---
>  drivers/infiniband/core/ucma.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)

My patch is no longer needed.  linus/master recently picked up
84652aefb347 ("RDMA/ucma: Introduce safer rdma_addr_size() variants")
which fixes the same issue.

> diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
> index e5a1e7d81326..e410e03940ff 100644
> --- a/drivers/infiniband/core/ucma.c
> +++ b/drivers/infiniband/core/ucma.c
> @@ -1423,11 +1423,19 @@ static ssize_t ucma_join_ip_multicast(struct 
> ucma_file *file,
> if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
> return -EFAULT;
>
> +   switch (cmd.addr.sin6_family) {
> +   case AF_INET:
> +   case AF_INET6:
> +   break;
> +   default:
> +   return -EINVAL;
> +   }
> +
> join_cmd.response = cmd.response;
> join_cmd.uid = cmd.uid;
> join_cmd.id = cmd.id;
> join_cmd.addr_size = rdma_addr_size((struct sockaddr *) &cmd.addr);
> -   if (!join_cmd.addr_size)
> +   if (!join_cmd.addr_size || join_cmd.addr_size > sizeof(cmd.addr))
> return -EINVAL;
>
> join_cmd.join_flags = RDMA_MC_JOIN_FLAG_FULLMEMBER;
> --
> 2.17.0.rc1.321.gba9d0f2565-goog
>

[RFC PATCH 0/2] memory.low,min reclaim

2018-04-22 Thread Greg Thelen

   => 0k   
  # F: 4203964k => 4203908k
  # F/g  : 2102016k => 2101992k
  # F/H  : 2101948k => 2101916k
  # i: 4204408k => 4203988k
  # i/j  : 2101984k => 2101936k
  # i/K  : 2102424k => 2102052k
  # L: 4189960k => 3886296k
  # L/m  : 2101968k => 2838704k
  # L/N  : 2087992k => 1047592k

  set -o errexit
  set -o nounset
  set -o pipefail
  
  LIM="$1"; shift
  ANTAGONIST="$1"; shift
  CGPATH=/tmp/cgroup
  
  vmtouch2() {
rm -f "$2"
(echo $BASHPID > "${CGPATH}/$1/cgroup.procs" && exec /tmp/mmap --loop 1 
--file "$2" "$3")
  }
  
  vmtouch() {
# twice to ensure slab caches are warmed up and all objs are charged to 
cgroup.
vmtouch2 "$1" "$2" "$3"
vmtouch2 "$1" "$2" "$3"
  }
  
  dump() {
for i in A A/b A/C A/D A/e F F/g F/H i i/j i/K L L/m L/N; do
  printf "%-5s: %sk\n" $i $(($(cat "${CGPATH}/${i}/memory.current")/1024))
done
  }
  
  rm -f /file_?
  if [[ -e "${CGPATH}/A" ]]; then
rmdir ${CGPATH}/?/? ${CGPATH}/?
  fi
  echo "+memory" > "${CGPATH}/cgroup.subtree_control"
  mkdir "${CGPATH}/A" "${CGPATH}/F" "${CGPATH}/i" "${CGPATH}/L"
  echo "+memory" > "${CGPATH}/A/cgroup.subtree_control"
  echo "+memory" > "${CGPATH}/F/cgroup.subtree_control"
  echo "+memory" > "${CGPATH}/i/cgroup.subtree_control"
  echo "+memory" > "${CGPATH}/L/cgroup.subtree_control"
  mkdir "${CGPATH}/A/b" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/e"
  mkdir "${CGPATH}/F/g" "${CGPATH}/F/H"
  mkdir "${CGPATH}/i/j" "${CGPATH}/i/K"
  mkdir "${CGPATH}/L/m" "${CGPATH}/L/N"
  
  echo 2G > "${CGPATH}/A/memory.${LIM}"
  echo 3G > "${CGPATH}/A/b/memory.${LIM}"
  echo 1G > "${CGPATH}/A/C/memory.${LIM}"
  echo 0 > "${CGPATH}/A/D/memory.${LIM}"
  echo 10G > "${CGPATH}/A/e/memory.${LIM}"
  
  echo 2G > "${CGPATH}/F/memory.${LIM}"
  echo 3G > "${CGPATH}/F/g/memory.${LIM}"
  echo 1G > "${CGPATH}/F/H/memory.${LIM}"
  
  echo 5G > "${CGPATH}/i/memory.${LIM}"
  echo 3G > "${CGPATH}/i/j/memory.${LIM}"
  echo 1G > "${CGPATH}/i/K/memory.${LIM}"
  
  echo 2G > "${CGPATH}/L/memory.${LIM}"
  echo 4G > "${CGPATH}/L/memory.max"
  echo 3G > "${CGPATH}/L/m/memory.${LIM}"
  echo 1G > "${CGPATH}/L/N/memory.${LIM}"
  
  vmtouch A/b /file_b 2G
  vmtouch A/C /file_C 2G
  vmtouch A/D /file_D 2G
  
  vmtouch F/g /file_g 2G
  vmtouch F/H /file_H 2G
  
  vmtouch i/j /file_j 2G
  vmtouch i/K /file_K 2G
  
  vmtouch L/m /file_m 2G
  vmtouch L/N /file_N 2G
  
  vmtouch2 "$ANTAGONIST" /file_ant 150G
  echo
  echo "after $ANTAGONIST antagonist"
  dump
  
  rmdir "${CGPATH}/A/b" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/e"
  rmdir "${CGPATH}/F/g" "${CGPATH}/F/H"
  rmdir "${CGPATH}/i/j" "${CGPATH}/i/K"
  rmdir "${CGPATH}/L/m" "${CGPATH}/L/N"
  rmdir "${CGPATH}/A" "${CGPATH}/F" "${CGPATH}/i" "${CGPATH}/L"
  rm /file_ant /file_b /file_C /file_D /file_g /file_H /file_j /file_K

Greg Thelen (2):
  memcg: fix memory.low
  memcg: add memory.min

 include/linux/memcontrol.h |  8 +
 mm/memcontrol.c| 70 ++
 mm/vmscan.c|  3 ++
 3 files changed, 75 insertions(+), 6 deletions(-)

-- 
2.17.0.484.g0c8726318c-goog

[RFC PATCH 1/2] memcg: fix memory.low

2018-04-22 Thread Greg Thelen

When targeting reclaim to a memcg, protect that memcg from reclaim is
memory consumption of any level is below respective memory.low.

Signed-off-by: Greg Thelen 
---
 mm/memcontrol.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 670e99b68aa6..9668f620203a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5341,8 +5341,8 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * @root: the top ancestor of the sub-tree being checked
  * @memcg: the memory cgroup to check
  *
- * Returns %true if memory consumption of @memcg, and that of all
- * ancestors up to (but not including) @root, is below the normal range.
+ * Returns %true if memory consumption of @memcg, or any of its ancestors
+ * up to (but not including) @root, is below the normal range.
  *
  * @root is exclusive; it is never low when looked at directly and isn't
  * checked when traversing the hierarchy.
@@ -5379,12 +5379,12 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct 
mem_cgroup *memcg)
if (memcg == root)
return false;
 
+   /* If any level is under, then protect @memcg from reclaim */
for (; memcg != root; memcg = parent_mem_cgroup(memcg)) {
-   if (page_counter_read(&memcg->memory) >= memcg->low)
-   return false;
+   if (page_counter_read(&memcg->memory) <= memcg->low)
+   return true; /* protect from reclaim */
}
-
-   return true;
+   return false; /* not protected from reclaim */
 }
 
 /**
-- 
2.17.0.484.g0c8726318c-goog

[RFC PATCH 2/2] memcg: add memory.min

2018-04-22 Thread Greg Thelen

The new memory.min limit is similar to memory.low, just no bypassing it
when reclaim is desparate.  Prefer oom kills before reclaim memory below
memory.min.  Sharing more code with memory_cgroup_low() is possible, but
the idea is posted here for simplicity.

Signed-off-by: Greg Thelen 
---
 include/linux/memcontrol.h |  8 ++
 mm/memcontrol.c| 58 ++
 mm/vmscan.c|  3 ++
 3 files changed, 69 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c46016bb25eb..22bb4a88653a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -178,6 +178,7 @@ struct mem_cgroup {
struct page_counter tcpmem;
 
/* Normal memory consumption range */
+   unsigned long min;
unsigned long low;
unsigned long high;
 
@@ -281,6 +282,7 @@ static inline bool mem_cgroup_disabled(void)
return !cgroup_subsys_enabled(memory_cgrp_subsys);
 }
 
+bool mem_cgroup_min(struct mem_cgroup *root, struct mem_cgroup *memcg);
 bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
 
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
@@ -726,6 +728,12 @@ static inline void mem_cgroup_event(struct mem_cgroup 
*memcg,
 {
 }
 
+static inline bool mem_cgroup_min(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
+{
+   return false;
+}
+
 static inline bool mem_cgroup_low(struct mem_cgroup *root,
  struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9668f620203a..b2aaed4003b4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5058,6 +5058,36 @@ static u64 memory_current_read(struct 
cgroup_subsys_state *css,
return (u64)page_counter_read(&memcg->memory) * PAGE_SIZE;
 }
 
+static int memory_min_show(struct seq_file *m, void *v)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+   unsigned long min = READ_ONCE(memcg->min);
+
+   if (min == PAGE_COUNTER_MAX)
+   seq_puts(m, "max\n");
+   else
+   seq_printf(m, "%llu\n", (u64)min * PAGE_SIZE);
+
+   return 0;
+}
+
+static ssize_t memory_min_write(struct kernfs_open_file *of,
+   char *buf, size_t nbytes, loff_t off)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+   unsigned long min;
+   int err;
+
+   buf = strstrip(buf);
+   err = page_counter_memparse(buf, "max", &min);
+   if (err)
+   return err;
+
+   memcg->min = min;
+
+   return nbytes;
+}
+
 static int memory_low_show(struct seq_file *m, void *v)
 {
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5288,6 +5318,12 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = memory_current_read,
},
+   {
+   .name = "min",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .seq_show = memory_min_show,
+   .write = memory_min_write,
+   },
{
.name = "low",
.flags = CFTYPE_NOT_ON_ROOT,
@@ -5336,6 +5372,28 @@ struct cgroup_subsys memory_cgrp_subsys = {
.early_init = 0,
 };
 
+/**
+ * mem_cgroup_min returns true for a memcg below its min limit.  Such memcg are
+ * excempt from reclaim.
+ */
+bool mem_cgroup_min(struct mem_cgroup *root, struct mem_cgroup *memcg)
+{
+   if (mem_cgroup_disabled())
+   return false;
+
+   if (!root)
+   root = root_mem_cgroup;
+
+   if (memcg == root)
+   return false;
+
+   for (; memcg != root; memcg = parent_mem_cgroup(memcg)) {
+   if (page_counter_read(&memcg->memory) <= memcg->min)
+   return true; /* protect */
+   }
+   return false; /* !protect */
+}
+
 /**
  * mem_cgroup_low - check if memory consumption is below the normal range
  * @root: the top ancestor of the sub-tree being checked
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cd5dc3faaa57..15ae19a38ad5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2539,6 +2539,9 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
unsigned long reclaimed;
unsigned long scanned;
 
+   if (mem_cgroup_min(root, memcg))
+   continue;
+
if (mem_cgroup_low(root, memcg)) {
if (!sc->memcg_low_reclaim) {
sc->memcg_low_skipped = 1;
-- 
2.17.0.484.g0c8726318c-goog

Re: [RFC PATCH 0/2] memory.low,min reclaim

2018-04-23 Thread Greg Thelen

On Mon, Apr 23, 2018 at 3:38 AM Roman Gushchin  wrote:

> Hi, Greg!

> On Sun, Apr 22, 2018 at 01:26:10PM -0700, Greg Thelen wrote:
> > Roman's previously posted memory.low,min patches add per memcg effective
> > low limit to detect overcommitment of parental limits.  But if we flip
> > low,min reclaim to bail if usage<{low,min} at any level, then we don't
need
> > an effective low limit, which makes the code simpler.  When parent
limits
> > are overcommited memory.min will oom kill, which is more drastic but
makes
> > the memory.low a simpler concept.  If memcg a/b wants oom kill before
> > reclaim, then give it to them.  It seems a bit strange for
a/b/memory.low's
> > behaviour to depend on a/c/memory.low (i.e. a/b.low is strong unless
> > a/b.low+a/c.low exceed a.low).

> It's actually not strange: a/b and a/c are sharing a common resource:
> a/memory.low.

> Exactly as a/b/memory.max and a/c/memory.max are sharing a/memory.max.
> If there are sibling cgroups which are consuming memory, a cgroup can't
> exceed parent's memory.max, even if its memory.max is grater.

> >
> > I think there might be a simpler way (ableit it doesn't yet include
> > Documentation):
> > - memcg: fix memory.low
> > - memcg: add memory.min
> >  3 files changed, 75 insertions(+), 6 deletions(-)
> >
> > The idea of this alternate approach is for memory.low,min to avoid
reclaim
> > if any portion of under-consideration memcg ancestry is under respective
> > limit.

> This approach has a significant downside: it breaks hierarchical
constraints
> for memory.low/min. There are two important outcomes:

> 1) Any leaf's memory.low/min value is respected, even if parent's value
>   is lower or even 0. It's not possible anymore to limit the amount
of
>   protected memory for a sub-tree.
>   This is especially bad in case of delegation.

As someone who has been using something like memory.min for a while, I have
cases where it needs to be a strong protection.  Such jobs prefer oom kill
to reclaim.  These jobs know they need X MB of memory.  But I guess it's on
me to avoid configuring machines which overcommit memory.min at such cgroup
levels all the way to the root.

> 2) If a cgroup has an ancestor with the usage under its memory.low/min,
>   it becomes protection, even if its memory.low/min is 0. So it
becomes
>   impossible to have unprotected cgroups in protected sub-tree.

Fair point.

One use case is where a non trivial job which has several memory accounting
subcontainers.  Is there a way to only set memory.low at the top and have
the offer protection to the job?
The case I'm thinking of is:
% cd /cgroup
% echo +memory > cgroup.subtree_control
% mkdir top
% echo +memory > top/cgroup.subtree_control
% mkdir top/part1 top/part2
% echo 1GB > top/memory.min
% (echo $BASHPID > top/part1/cgroup.procs && part1)
% (echo $BASHPID > top/part2/cgroup.procs && part2)

Empirically it's been measured that the entire workload (/top) needs 1GB to
perform well.  But we don't care how the memory is distributed between
part1,part2.  Is the strategy for that to set /top, /top/part1.min, and
/top/part2.min to 1GB?

What do you think about exposing emin and elow to user space?  I think that
would reduce admin/user confusion in situations where memory.min is
internally discounted.

(tangent) Delegation in v2 isn't something I've been able to fully
internalize yet.
The "no interior processes" rule challenges my notion of subdelegation.
My current model is where a system controller creates a container C with
C.min and then starts client manager process M in C.  Then M can choose
to further divide C's resources (e.g. C/S).  This doesn't seem possible
because v2 doesn't allow for interior processes.  So the system manager
would need to create C, set C.low, create C/sub_manager, create
C/sub_resources, set C/sub_manager.low, set C/sub_resources.low, then start
M in C/sub_manager.  Then sub_manager can create and manage
C/sub_resources/S.

PS: Thanks for the memory.low and memory.min work.  Regardless of how we
proceed it's better than the upstream memory.soft_limit_in_bytes!

[PATCH] memcg: mark memcg1_events static const

2018-05-03 Thread Greg Thelen

Mark memcg1_events static: it's only used by memcontrol.c.
And mark it const: it's not modified.

Signed-off-by: Greg Thelen 
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2bd3df3d101a..c9c7e5ea0e2f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3083,7 +3083,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void 
*v)
 #endif /* CONFIG_NUMA */
 
 /* Universal VM events cgroup1 shows, original sort order */
-unsigned int memcg1_events[] = {
+static const unsigned int memcg1_events[] = {
PGPGIN,
PGPGOUT,
PGFAULT,
-- 
2.17.0.441.gb46fe60e1d-goog

[PATCH] IB: remove redundant INFINIBAND kconfig dependencies

2018-05-03 Thread Greg Thelen

INFINIBAND_ADDR_TRANS depends on INFINIBAND.  So there's no need for
options which depend INFINIBAND_ADDR_TRANS to also depend on INFINIBAND.
Remove the unnecessary INFINIBAND depends.

Signed-off-by: Greg Thelen 
---
 drivers/infiniband/ulp/srpt/Kconfig | 2 +-
 drivers/nvme/host/Kconfig   | 2 +-
 drivers/nvme/target/Kconfig | 2 +-
 drivers/staging/lustre/lnet/Kconfig | 2 +-
 fs/cifs/Kconfig | 2 +-
 net/9p/Kconfig  | 2 +-
 net/rds/Kconfig | 2 +-
 net/sunrpc/Kconfig  | 2 +-
 8 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/ulp/srpt/Kconfig 
b/drivers/infiniband/ulp/srpt/Kconfig
index fb8b7182f05e..25bf6955b6d0 100644
--- a/drivers/infiniband/ulp/srpt/Kconfig
+++ b/drivers/infiniband/ulp/srpt/Kconfig
@@ -1,6 +1,6 @@
 config INFINIBAND_SRPT
tristate "InfiniBand SCSI RDMA Protocol target support"
-   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE
+   depends on INFINIBAND_ADDR_TRANS && TARGET_CORE
---help---
 
  Support for the SCSI RDMA Protocol (SRP) Target driver. The
diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index 88a8b5916624..dbb7464c018c 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -27,7 +27,7 @@ config NVME_FABRICS
 
 config NVME_RDMA
tristate "NVM Express over Fabrics RDMA host driver"
-   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK
+   depends on INFINIBAND_ADDR_TRANS && BLOCK
select NVME_CORE
select NVME_FABRICS
select SG_POOL
diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
index 3c7b61ddb0d1..7595664ee753 100644
--- a/drivers/nvme/target/Kconfig
+++ b/drivers/nvme/target/Kconfig
@@ -27,7 +27,7 @@ config NVME_TARGET_LOOP
 
 config NVME_TARGET_RDMA
tristate "NVMe over Fabrics RDMA target support"
-   depends on INFINIBAND && INFINIBAND_ADDR_TRANS
+   depends on INFINIBAND_ADDR_TRANS
depends on NVME_TARGET
select SGL_ALLOC
help
diff --git a/drivers/staging/lustre/lnet/Kconfig 
b/drivers/staging/lustre/lnet/Kconfig
index ad049e6f24e4..f3b1ad4bd3dc 100644
--- a/drivers/staging/lustre/lnet/Kconfig
+++ b/drivers/staging/lustre/lnet/Kconfig
@@ -34,7 +34,7 @@ config LNET_SELFTEST
 
 config LNET_XPRT_IB
tristate "LNET infiniband support"
-   depends on LNET && PCI && INFINIBAND && INFINIBAND_ADDR_TRANS
+   depends on LNET && PCI && INFINIBAND_ADDR_TRANS
default LNET && INFINIBAND
help
  This option allows the LNET users to use infiniband as an
diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
index 5f132d59dfc2..d61e2de8d0eb 100644
--- a/fs/cifs/Kconfig
+++ b/fs/cifs/Kconfig
@@ -197,7 +197,7 @@ config CIFS_SMB311
 
 config CIFS_SMB_DIRECT
bool "SMB Direct support (Experimental)"
-   depends on CIFS=m && INFINIBAND && INFINIBAND_ADDR_TRANS || CIFS=y && 
INFINIBAND=y && INFINIBAND_ADDR_TRANS=y
+   depends on CIFS=m && INFINIBAND_ADDR_TRANS || CIFS=y && 
INFINIBAND_ADDR_TRANS=y
help
  Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1.
  SMB Direct allows transferring SMB packets over RDMA. If unsure,
diff --git a/net/9p/Kconfig b/net/9p/Kconfig
index e6014e0e51f7..46c39f7da444 100644
--- a/net/9p/Kconfig
+++ b/net/9p/Kconfig
@@ -32,7 +32,7 @@ config NET_9P_XEN
 
 
 config NET_9P_RDMA
-   depends on INET && INFINIBAND && INFINIBAND_ADDR_TRANS
+   depends on INET && INFINIBAND_ADDR_TRANS
tristate "9P RDMA Transport (Experimental)"
help
  This builds support for an RDMA transport.
diff --git a/net/rds/Kconfig b/net/rds/Kconfig
index bffde4b46c5d..1a31502ee7db 100644
--- a/net/rds/Kconfig
+++ b/net/rds/Kconfig
@@ -8,7 +8,7 @@ config RDS
 
 config RDS_RDMA
tristate "RDS over Infiniband"
-   depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS
+   depends on RDS && INFINIBAND_ADDR_TRANS
---help---
  Allow RDS to use Infiniband as a transport.
  This transport supports RDMA operations.
diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
index ac09ca803296..6358e5271070 100644
--- a/net/sunrpc/Kconfig
+++ b/net/sunrpc/Kconfig
@@ -50,7 +50,7 @@ config SUNRPC_DEBUG
 
 config SUNRPC_XPRT_RDMA
tristate "RPC-over-RDMA transport"
-   depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS
+   depends on SUNRPC && INFINIBAND_ADDR_TRANS
default SUNRPC && INFINIBAND
select SG_POOL
help
-- 
2.17.0.441.gb46fe60e1d-goog

Re: [PATCH 3/5] ib_srpt: depend on INFINIBAND_ADDR_TRANS

2018-05-03 Thread Greg Thelen

On Tue, May 1, 2018 at 1:48 PM Jason Gunthorpe  wrote:

> On Tue, May 01, 2018 at 03:08:57AM +0000, Greg Thelen wrote:
> > On Mon, Apr 30, 2018 at 4:35 PM Jason Gunthorpe  wrote:
> >
> > > On Wed, Apr 25, 2018 at 03:33:39PM -0700, Greg Thelen wrote:
> > > > INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided
symbols.
> > > > So declare the kconfig dependency.  This is necessary to allow for
> > > > enabling INFINIBAND without INFINIBAND_ADDR_TRANS.
> > > >
> > > > Signed-off-by: Greg Thelen 
> > > > Cc: Tarick Bedeir 
> > > >  drivers/infiniband/ulp/srpt/Kconfig | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/infiniband/ulp/srpt/Kconfig
> > b/drivers/infiniband/ulp/srpt/Kconfig
> > > > index 31ee83d528d9..fb8b7182f05e 100644
> > > > +++ b/drivers/infiniband/ulp/srpt/Kconfig
> > > > @@ -1,6 +1,6 @@
> > > >  config INFINIBAND_SRPT
> > > >   tristate "InfiniBand SCSI RDMA Protocol target support"
> > > > - depends on INFINIBAND && TARGET_CORE
> > > > + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE
> >
> > > Isn't INFINIBAND && INFINIBAND_ADDR_TRANS a bit redundant? Can't have
> > > INFINIBAND_ADDR_TRANS without INFINIBAND.
> >
> > By kconfig INFINIBAND_ADDR_TRANS depends on INFINIBAND.  So yes, it
seems
> > redundant.  I don't know if anyone has designs to break this dependency
and
> > allow for ADDR_TRANS without INFINIBAND.  Assuming not, I'd be willing
to
> > amend my series removing redundant INFINIBAND and a followup series to
> > remove it from similar depends.  Though I'm not familiar with rdma dev
tree
> > lifecycle.  Is rdma/for-rc a throw away branch (akin to linux-next), or
> > will it be merged into linus/master?  If throwaway, then we can amend
> > its patches, otherwise followups will be needed.
> >
> > Let me know what you'd prefer.  Thanks.

> I think a single update patch to fix all of these, and the
> pre-existing ones would be great

I just posted the cleanup: https://lkml.org/lkml/2018/5/3/1045

[PATCH v2] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-13 Thread Greg Thelen

Allow INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
Change-Id: I6fbbf8a432e467710fa65e4904b7d61880b914e5
---
 drivers/infiniband/Kconfig | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index ee270e065ba9..2a972ed6851b 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING
  pages on demand instead.
 
 config INFINIBAND_ADDR_TRANS
-   bool
+   bool "RDMA/CM"
depends on INFINIBAND
default y
+   ---help---
+ Support for RDMA communication manager (CM).
+ This allows for a generic connection abstraction over RDMA.
 
 config INFINIBAND_ADDR_TRANS_CONFIGFS
bool
-- 
2.17.0.484.g0c8726318c-goog

Re: [PATCH v2] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-14 Thread Greg Thelen

On Sat, Apr 14, 2018 at 8:13 AM Dennis Dalessandro <
dennis.dalessan...@intel.com> wrote:

> On 4/13/2018 1:27 PM, Greg Thelen wrote:
> > Allow INFINIBAND without INFINIBAND_ADDR_TRANS.
> >
> > Signed-off-by: Greg Thelen 
> > Cc: Tarick Bedeir 
> > Change-Id: I6fbbf8a432e467710fa65e4904b7d61880b914e5

> Forgot to remove the Gerrit thing.

> -Denny

Ack.  My bad.  Will repost.  Unfortunately checkpatch didn't notice.

[PATCH v3] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-14 Thread Greg Thelen

Allow INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/infiniband/Kconfig | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index ee270e065ba9..2a972ed6851b 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING
  pages on demand instead.
 
 config INFINIBAND_ADDR_TRANS
-   bool
+   bool "RDMA/CM"
depends on INFINIBAND
default y
+   ---help---
+ Support for RDMA communication manager (CM).
+ This allows for a generic connection abstraction over RDMA.
 
 config INFINIBAND_ADDR_TRANS_CONFIGFS
bool
-- 
2.17.0.484.g0c8726318c-goog

Re: [PATCH] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-15 Thread Greg Thelen

On Sun, Apr 15, 2018 at 5:06 AM Christoph Hellwig  wrote:

> On Fri, Apr 13, 2018 at 12:06:44AM -0700, Greg Thelen wrote:
> > Allow INFINIBAND without INFINIBAND_ADDR_TRANS.

> Why?  We are pushing everyone heavily to use RDMA/CM, so making it
> optional seems rather counter-intuitive.

Bugs.  We don't currently use CM.  But fuzzy testing has been finding a
fair number CM bugs, so we were thinking of disabling it until we need it.

> You'll also have to fix tons of ULPs to explicitly depend on
> INFINIBAND_ADDR_TRANS, or make code conditional on it.

I think I've identified the set of options which use
INFINIBAND_ADDR_TRANS without a kconfig depends:
* CIFS_SMB_DIRECT
* INFINIBAND_SRPT
* NVME_RDMA
* NVME_TARGET_RDMA

I have patches for the above, but need to finish the commit logs.  Let me
know if they'll be nacked and I'll just patch my kernel and forget
upstreaming.

[PATCH 1/2] kasan: drain quarantine of memcg slab objects

2016-12-20 Thread Greg Thelen

Per memcg slab accounting and kasan have a problem with kmem_cache
destruction.
- kmem_cache_create() allocates a kmem_cache, which is used for
  allocations from processes running in root (top) memcg.
- Processes running in non root memcg and allocating with either
  __GFP_ACCOUNT or from a SLAB_ACCOUNT cache use a per memcg kmem_cache.
- Kasan catches use-after-free by having kfree() and kmem_cache_free()
  defer freeing of objects.  Objects are placed in a quarantine.
- kmem_cache_destroy() destroys root and non root kmem_caches.  It takes
  care to drain the quarantine of objects from the root memcg's
  kmem_cache, but ignores objects associated with non root memcg.  This
  causes leaks because quarantined per memcg objects refer to per memcg
  kmem cache being destroyed.

To see the problem:
1) create a slab cache with kmem_cache_create(,,,SLAB_ACCOUNT,)
2) from non root memcg, allocate and free a few objects from cache
3) dispose of the cache with kmem_cache_destroy()
kmem_cache_destroy() will trigger a "Slab cache still has objects"
warning indicating that the per memcg kmem_cache structure was leaked.

Fix the leak by draining kasan quarantined objects allocated from non
root memcg.

Racing memcg deletion is tricky, but handled.  kmem_cache_destroy() =>
shutdown_memcg_caches() => __shutdown_memcg_cache() => shutdown_cache()
flushes per memcg quarantined objects, even if that memcg has been
rmdir'd and gone through memcg_deactivate_kmem_caches().

This leak only affects destroyed SLAB_ACCOUNT kmem caches when kasan is
enabled.  So I don't think it's worth patching stable kernels.

Signed-off-by: Greg Thelen 
---
 include/linux/kasan.h | 4 ++--
 mm/kasan/kasan.c  | 2 +-
 mm/kasan/quarantine.c | 1 +
 mm/slab_common.c  | 4 +++-
 4 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 820c0ad54a01..c908b25bf5a5 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -52,7 +52,7 @@ void kasan_free_pages(struct page *page, unsigned int order);
 void kasan_cache_create(struct kmem_cache *cache, size_t *size,
unsigned long *flags);
 void kasan_cache_shrink(struct kmem_cache *cache);
-void kasan_cache_destroy(struct kmem_cache *cache);
+void kasan_cache_shutdown(struct kmem_cache *cache);
 
 void kasan_poison_slab(struct page *page);
 void kasan_unpoison_object_data(struct kmem_cache *cache, void *object);
@@ -98,7 +98,7 @@ static inline void kasan_cache_create(struct kmem_cache 
*cache,
  size_t *size,
  unsigned long *flags) {}
 static inline void kasan_cache_shrink(struct kmem_cache *cache) {}
-static inline void kasan_cache_destroy(struct kmem_cache *cache) {}
+static inline void kasan_cache_shutdown(struct kmem_cache *cache) {}
 
 static inline void kasan_poison_slab(struct page *page) {}
 static inline void kasan_unpoison_object_data(struct kmem_cache *cache,
diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c
index 0e9505f66ec1..8d020ad5b74a 100644
--- a/mm/kasan/kasan.c
+++ b/mm/kasan/kasan.c
@@ -428,7 +428,7 @@ void kasan_cache_shrink(struct kmem_cache *cache)
quarantine_remove_cache(cache);
 }
 
-void kasan_cache_destroy(struct kmem_cache *cache)
+void kasan_cache_shutdown(struct kmem_cache *cache)
 {
quarantine_remove_cache(cache);
 }
diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c
index baabaad4a4aa..fb362cb19157 100644
--- a/mm/kasan/quarantine.c
+++ b/mm/kasan/quarantine.c
@@ -273,6 +273,7 @@ static void per_cpu_remove_cache(void *arg)
qlist_free_all(&to_free, cache);
 }
 
+/* Free all quarantined objects belonging to cache. */
 void quarantine_remove_cache(struct kmem_cache *cache)
 {
unsigned long flags;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 329b03843863..d3c8602dea5d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -455,6 +455,9 @@ EXPORT_SYMBOL(kmem_cache_create);
 static int shutdown_cache(struct kmem_cache *s,
struct list_head *release, bool *need_rcu_barrier)
 {
+   /* free asan quarantined objects */
+   kasan_cache_shutdown(s);
+
if (__kmem_cache_shutdown(s) != 0)
return -EBUSY;
 
@@ -715,7 +718,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
get_online_cpus();
get_online_mems();
 
-   kasan_cache_destroy(s);
mutex_lock(&slab_mutex);
 
s->refcount--;
-- 
2.8.0.rc3.226.g39d4020

[PATCH 2/2] kasan: add memcg kmem_cache test

2016-12-20 Thread Greg Thelen

Make a kasan test which uses a SLAB_ACCOUNT slab cache.  If the test is
run within a non default memcg, then it uncovers the bug fixed by
"kasan: drain quarantine of memcg slab objects"[1].

If run without fix [1] it shows "Slab cache still has objects", and the
kmem_cache structure is leaked.
Here's an unpatched kernel test:
$ dmesg -c > /dev/null
$ mkdir /sys/fs/cgroup/memory/test
$ echo $$ > /sys/fs/cgroup/memory/test/tasks
$ modprobe test_kasan 2> /dev/null
$ dmesg | grep -B1 still
[ 123.456789] kasan test: memcg_accounted_kmem_cache allocate memcg accounted 
object
[ 124.456789] kmem_cache_destroy test_cache: Slab cache still has objects

Kernels with fix [1] don't have the "Slab cache still has objects"
warning or the underlying leak.

The new test runs and passes in the default (root) memcg, though in the
root memcg it won't uncover the problem fixed by [1].

Signed-off-by: Greg Thelen 
---
 lib/test_kasan.c | 34 ++
 1 file changed, 34 insertions(+)

diff --git a/lib/test_kasan.c b/lib/test_kasan.c
index fbdf87920093..0b1d3140fbb8 100644
--- a/lib/test_kasan.c
+++ b/lib/test_kasan.c
@@ -11,6 +11,7 @@
 
 #define pr_fmt(fmt) "kasan test: %s " fmt, __func__
 
+#include 
 #include 
 #include 
 #include 
@@ -331,6 +332,38 @@ static noinline void __init kmem_cache_oob(void)
kmem_cache_destroy(cache);
 }
 
+static noinline void __init memcg_accounted_kmem_cache(void)
+{
+   int i;
+   char *p;
+   size_t size = 200;
+   struct kmem_cache *cache;
+
+   cache = kmem_cache_create("test_cache", size, 0, SLAB_ACCOUNT, NULL);
+   if (!cache) {
+   pr_err("Cache allocation failed\n");
+   return;
+   }
+
+   pr_info("allocate memcg accounted object\n");
+   /*
+* Several allocations with a delay to allow for lazy per memcg kmem
+* cache creation.
+*/
+   for (i = 0; i < 5; i++) {
+   p = kmem_cache_alloc(cache, GFP_KERNEL);
+   if (!p) {
+   pr_err("Allocation failed\n");
+   goto free_cache;
+   }
+   kmem_cache_free(cache, p);
+   msleep(100);
+   }
+
+free_cache:
+   kmem_cache_destroy(cache);
+}
+
 static char global_array[10];
 
 static noinline void __init kasan_global_oob(void)
@@ -460,6 +493,7 @@ static int __init kmalloc_tests_init(void)
kmalloc_uaf_memset();
kmalloc_uaf2();
kmem_cache_oob();
+   memcg_accounted_kmem_cache();
kasan_stack_oob();
kasan_global_oob();
ksize_unpoisons_memory();
-- 
2.8.0.rc3.226.g39d4020

Re: [PATCH] memcg: make mem_cgroup_read_stat() unsigned

2015-09-25 Thread Greg Thelen


Michal Hocko wrote:

> On Tue 22-09-15 15:16:32, Greg Thelen wrote:
>> mem_cgroup_read_stat() returns a page count by summing per cpu page
>> counters.  The summing is racy wrt. updates, so a transient negative sum
>> is possible.  Callers don't want negative values:
>> - mem_cgroup_wb_stats() doesn't want negative nr_dirty or nr_writeback.
>
> OK, this can confuse dirty throttling AFAIU
>
>> - oom reports and memory.stat shouldn't show confusing negative usage.
>
> I guess this is not earth shattering.
>
>> - tree_usage() already avoids negatives.
>> 
>> Avoid returning negative page counts from mem_cgroup_read_stat() and
>> convert it to unsigned.
>> 
>> Signed-off-by: Greg Thelen 
>
> I guess we want that for stable 4.2 because of the dirty throttling
> part. Longterm we should use generic per-cpu counter.
>
> Acked-by: Michal Hocko 
>
> Thanks!

Correct, this is not an earth shattering patch.  The patch only filters
out negative memcg stat values from mem_cgroup_read_stat() callers.
Negative values should only be temporary due to stat update races.  So
I'm not sure it's worth sending it to stable.  I've heard no reports of
it troubling anyone.  The worst case without this patch is that memcg
temporarily burps up a negative dirty and/or writeback count which
causes balance_dirty_pages() to sleep for a (at most) 200ms nap
(MAX_PAUSE).  Ccing Tejun in case there are more serious consequences to
balance_dirty_pages() occasionally seeing a massive (underflowed) dirty
or writeback count.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 4.3-rc1 dirty page count underflow (cgroup-related?)

2015-09-17 Thread Greg Thelen


Dave Hansen wrote:

> I've been seeing some strange behavior with 4.3-rc1 kernels on my Ubuntu
> 14.04.3 system.  The system will run fine for a few hours, but suddenly
> start becoming horribly I/O bound.  A compile of perf for instance takes
> 20-30 minutes and the compile seems entirely I/O bound.  But, the SSD is
> only seeing tens or hundreds of KB/s of writes.
>
> Looking at some writeback tracepoints shows it hitting
> balance_dirty_pages() pretty hard with a pretty large number of dirty
> pages. :)
>
>>   ld-27008 [000] ...1 88895.190770: balance_dirty_pages: bdi
>> 8:0: limit=234545 setpoint=204851 dirty=18446744073709513951
>> bdi_setpoint=184364 bdi_dirty=33 dirty_ratelimit=24 task_ratelimit=0
>> dirtied=1 dirtied_pause=0 paused=0 pause=136 period=136 think=0
>> cgroup=/user/1000.user/c2.session
>
> So something is underflowing dirty.
>
> I added the attached patch and got a warning pretty quickly, so this
> looks pretty reproducible for me.
>
> I'm not 100% sure this is from the 4.3 merge window.  I was running the
> 4.2-rcs, but they seemed to have their own issues.  Ubuntu seems to be
> automatically creating some cgroups, so they're definitely in play here.
>
> Any ideas what is going on?
>
>> [   12.415472] [ cut here ]
>> [   12.415481] WARNING: CPU: 1 PID: 1684 at mm/page-writeback.c:2435 
>> account_page_cleaned+0x101/0x110()
>> [   12.415483] MEM_CGROUP_STAT_DIRTY bogus
>> [   12.415484] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 
>> iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack 
>> nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp 
>> bridge stp llc iptable_filter ip_tables ebtable_nat ebtables x_tables 
>> dm_crypt cmac rfcomm bnep arc4 iwldvm mac80211 btusb snd_hda_codec_hdmi 
>> iwlwifi btrtl btbcm btintel snd_hda_codec_realtek snd_hda_codec_generic 
>> bluetooth snd_hda_intel snd_hda_codec cfg80211 snd_hwdep snd_hda_core 
>> intel_rapl iosf_mbi hid_logitech_hidpp x86_pkg_temp_thermal snd_pcm coretemp 
>> ghash_clmulni_intel thinkpad_acpi joydev snd_seq_midi snd_seq_midi_event 
>> snd_rawmidi nvram snd_seq snd_timer snd_seq_device wmi snd soundcore mac_hid 
>> lpc_ich aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd 
>> kvm_intel kvm hid_logitech_dj sdhci_pci sdhci usbhid hid
>> [   12.415538] CPU: 1 PID: 1684 Comm: indicator-keybo Not tainted 
>> 4.3.0-rc1-dirty #25
>> [   12.415540] Hardware name: LENOVO 2325AR2/2325AR2, BIOS G2ETA4WW (2.64 ) 
>> 04/09/2015
>> [   12.415542]  81aa8172 8800c926ba30 8132e3e2 
>> 8800c926ba78
>> [   12.415544]  8800c926ba68 8105e386 ea000fca4700 
>> 880409b96420
>> [   12.415547]  8803ef979490 880403ff4800  
>> 8800c926bac8
>> [   12.415550] Call Trace:
>> [   12.41]  [] dump_stack+0x4b/0x69
>> [   12.415560]  [] warn_slowpath_common+0x86/0xc0
>> [   12.415563]  [] warn_slowpath_fmt+0x4c/0x50
>> [   12.415566]  [] account_page_cleaned+0x101/0x110
>> [   12.415568]  [] cancel_dirty_page+0xbd/0xf0
>> [   12.415571]  [] try_to_free_buffers+0x94/0xb0
>> [   12.415575]  [] 
>> jbd2_journal_try_to_free_buffers+0x100/0x130
>> [   12.415578]  [] ext4_releasepage+0x52/0xa0
>> [   12.415582]  [] try_to_release_page+0x35/0x50
>> [   12.415585]  [] block_invalidatepage+0x113/0x130
>> [   12.415587]  [] ext4_invalidatepage+0x5e/0xb0
>> [   12.415590]  [] ext4_da_invalidatepage+0x40/0x310
>> [   12.415593]  [] truncate_inode_page+0x83/0x90
>> [   12.415595]  [] truncate_inode_pages_range+0x199/0x730
>> [   12.415598]  [] ? __wake_up+0x44/0x50
>> [   12.415600]  [] ? jbd2_journal_stop+0x1ba/0x3b0
>> [   12.415603]  [] ? ext4_unlink+0x2f4/0x330
>> [   12.415607]  [] ? __inode_wait_for_writeback+0x6d/0xc0
>> [   12.415609]  [] truncate_inode_pages_final+0x4c/0x60
>> [   12.415612]  [] ext4_evict_inode+0x116/0x4c0
>> [   12.415615]  [] evict+0xbc/0x190
>> [   12.415617]  [] iput+0x17d/0x1e0
>> [   12.415620]  [] do_unlinkat+0x1ab/0x2b0
>> [   12.415622]  [] SyS_unlink+0x16/0x20
>> [   12.415626]  [] entry_SYSCALL_64_fastpath+0x12/0x6a
>> [   12.415628] ---[ end trace 6fba1ddd3d240e13 ]---
>> [   12.418211] [ cut here ]

I'm not denying the issue, bug the WARNING splat isn't necessarily
catching a problem.  The corresponding code comes from your debug patch:
+   
WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > 
(1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");

This only checks a single cpu's counter, which can be negative.  The sum
of all counters is what matters.
Imagine:
cpu1) dirty page: inc
cpu2) clean page: dec
The sum is properly zero, but cpu2 is -1, which will trigger the WARN.

I'll look at the code and also see if I can reproduce the failure using
mem_cgroup_read_stat() for all of the new WARNs.

Did you notice if the global /proc/meminfo:Dirty count also underflowed?
--
To unsubscribe from this list: send th

Re: 4.3-rc1 dirty page count underflow (cgroup-related?)

2015-09-18 Thread Greg Thelen


Greg Thelen wrote:

> Dave Hansen wrote:
>
>> I've been seeing some strange behavior with 4.3-rc1 kernels on my Ubuntu
>> 14.04.3 system.  The system will run fine for a few hours, but suddenly
>> start becoming horribly I/O bound.  A compile of perf for instance takes
>> 20-30 minutes and the compile seems entirely I/O bound.  But, the SSD is
>> only seeing tens or hundreds of KB/s of writes.
>>
>> Looking at some writeback tracepoints shows it hitting
>> balance_dirty_pages() pretty hard with a pretty large number of dirty
>> pages. :)
>>
>>>   ld-27008 [000] ...1 88895.190770: balance_dirty_pages: bdi
>>> 8:0: limit=234545 setpoint=204851 dirty=18446744073709513951
>>> bdi_setpoint=184364 bdi_dirty=33 dirty_ratelimit=24 task_ratelimit=0
>>> dirtied=1 dirtied_pause=0 paused=0 pause=136 period=136 think=0
>>> cgroup=/user/1000.user/c2.session
>>
>> So something is underflowing dirty.
>>
>> I added the attached patch and got a warning pretty quickly, so this
>> looks pretty reproducible for me.
>>
>> I'm not 100% sure this is from the 4.3 merge window.  I was running the
>> 4.2-rcs, but they seemed to have their own issues.  Ubuntu seems to be
>> automatically creating some cgroups, so they're definitely in play here.
>>
>> Any ideas what is going on?
>>
>>> [   12.415472] [ cut here ]
>>> [   12.415481] WARNING: CPU: 1 PID: 1684 at mm/page-writeback.c:2435 
>>> account_page_cleaned+0x101/0x110()
>>> [   12.415483] MEM_CGROUP_STAT_DIRTY bogus
>>> [   12.415484] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 
>>> iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 
>>> xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM 
>>> iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables 
>>> ebtable_nat ebtables x_tables dm_crypt cmac rfcomm bnep arc4 iwldvm 
>>> mac80211 btusb snd_hda_codec_hdmi iwlwifi btrtl btbcm btintel 
>>> snd_hda_codec_realtek snd_hda_codec_generic bluetooth snd_hda_intel 
>>> snd_hda_codec cfg80211 snd_hwdep snd_hda_core intel_rapl iosf_mbi 
>>> hid_logitech_hidpp x86_pkg_temp_thermal snd_pcm coretemp 
>>> ghash_clmulni_intel thinkpad_acpi joydev snd_seq_midi snd_seq_midi_event 
>>> snd_rawmidi nvram snd_seq snd_timer snd_seq_device wmi snd soundcore 
>>> mac_hid lpc_ich aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper 
>>> cryptd kvm_intel kvm hid_logitech_dj sdhci_pci sdhci usbhid hid
>>> [   12.415538] CPU: 1 PID: 1684 Comm: indicator-keybo Not tainted 
>>> 4.3.0-rc1-dirty #25
>>> [   12.415540] Hardware name: LENOVO 2325AR2/2325AR2, BIOS G2ETA4WW (2.64 ) 
>>> 04/09/2015
>>> [   12.415542]  81aa8172 8800c926ba30 8132e3e2 
>>> 8800c926ba78
>>> [   12.415544]  8800c926ba68 8105e386 ea000fca4700 
>>> 880409b96420
>>> [   12.415547]  8803ef979490 880403ff4800  
>>> 8800c926bac8
>>> [   12.415550] Call Trace:
>>> [   12.41]  [] dump_stack+0x4b/0x69
>>> [   12.415560]  [] warn_slowpath_common+0x86/0xc0
>>> [   12.415563]  [] warn_slowpath_fmt+0x4c/0x50
>>> [   12.415566]  [] account_page_cleaned+0x101/0x110
>>> [   12.415568]  [] cancel_dirty_page+0xbd/0xf0
>>> [   12.415571]  [] try_to_free_buffers+0x94/0xb0
>>> [   12.415575]  [] 
>>> jbd2_journal_try_to_free_buffers+0x100/0x130
>>> [   12.415578]  [] ext4_releasepage+0x52/0xa0
>>> [   12.415582]  [] try_to_release_page+0x35/0x50
>>> [   12.415585]  [] block_invalidatepage+0x113/0x130
>>> [   12.415587]  [] ext4_invalidatepage+0x5e/0xb0
>>> [   12.415590]  [] ext4_da_invalidatepage+0x40/0x310
>>> [   12.415593]  [] truncate_inode_page+0x83/0x90
>>> [   12.415595]  [] truncate_inode_pages_range+0x199/0x730
>>> [   12.415598]  [] ? __wake_up+0x44/0x50
>>> [   12.415600]  [] ? jbd2_journal_stop+0x1ba/0x3b0
>>> [   12.415603]  [] ? ext4_unlink+0x2f4/0x330
>>> [   12.415607]  [] ? __inode_wait_for_writeback+0x6d/0xc0
>>> [   12.415609]  [] truncate_inode_pages_final+0x4c/0x60
>>> [   12.415612]  [] ext4_evict_inode+0x116/0x4c0
>>> [   12.415615]  [] evict+0xbc/0x190
>>> [   12.415617]  [] iput+0x17d/0x1e0
>>> [   12.415620]  [] do_unlinkat+0x1ab/0x2b0
>>> [   12.415622]  [] SyS_unlink+0x16/0x20
>>> [   12.415626]  [] entry_SYSCALL_64_fastpath+0x12/0x6a
>>> [   12.415628] ---[ end trace 6fba1ddd3d240e1

Re: [PATCH] memcg: make mem_cgroup_read_stat() unsigned

2015-09-22 Thread Greg Thelen

Andrew Morton wrote:

> On Tue, 22 Sep 2015 15:16:32 -0700 Greg Thelen  wrote:
>
>> mem_cgroup_read_stat() returns a page count by summing per cpu page
>> counters.  The summing is racy wrt. updates, so a transient negative sum
>> is possible.  Callers don't want negative values:
>> - mem_cgroup_wb_stats() doesn't want negative nr_dirty or nr_writeback.
>> - oom reports and memory.stat shouldn't show confusing negative usage.
>> - tree_usage() already avoids negatives.
>>
>> Avoid returning negative page counts from mem_cgroup_read_stat() and
>> convert it to unsigned.
>
> Someone please remind me why this code doesn't use the existing
> percpu_counter library which solved this problem years ago.
>
>>   for_each_possible_cpu(cpu)
>
> and which doesn't iterate across offlined CPUs.

I found [1] and [2] discussing memory layout differences between:
a) existing memcg hand rolled per cpu arrays of counters
vs
b) array of generic percpu_counter
The current approach was claimed to have lower memory overhead and
better cache behavior.

I assume it's pretty straightforward to create generic
percpu_counter_array routines which memcg could use.  Possibly something
like this could be made general enough could be created to satisfy
vmstat, but less clear.

[1] http://www.spinics.net/lists/cgroups/msg06216.html
[2] https://lkml.org/lkml/2014/9/11/1057
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] memcg: remove pcp_counter_lock

2015-09-22 Thread Greg Thelen

Commit 733a572e66d2 ("memcg: make mem_cgroup_read_{stat|event}() iterate
possible cpus instead of online") removed the last use of the per memcg
pcp_counter_lock but forgot to remove the variable.

Kill the vestigial variable.

Signed-off-by: Greg Thelen 
---
 include/linux/memcontrol.h | 1 -
 mm/memcontrol.c| 1 -
 2 files changed, 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ad800e62cb7a..6452ff4c463f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -242,7 +242,6 @@ struct mem_cgroup {
 * percpu counter.
 */
struct mem_cgroup_stat_cpu __percpu *stat;
-   spinlock_t pcp_counter_lock;
 
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
struct cg_proto tcp_mem;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ddaeba34e09..da21143550c0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4179,7 +4179,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
if (memcg_wb_domain_init(memcg, GFP_KERNEL))
goto out_free_stat;
 
-   spin_lock_init(&memcg->pcp_counter_lock);
return memcg;
 
 out_free_stat:
-- 
2.6.0.rc0.131.gf624c3d

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] memcg: make mem_cgroup_read_stat() unsigned

2015-09-23 Thread Greg Thelen


Andrew Morton wrote:

> On Tue, 22 Sep 2015 17:42:13 -0700 Greg Thelen  wrote:
>
>> Andrew Morton wrote:
>> 
>> > On Tue, 22 Sep 2015 15:16:32 -0700 Greg Thelen  wrote:
>> >
>> >> mem_cgroup_read_stat() returns a page count by summing per cpu page
>> >> counters.  The summing is racy wrt. updates, so a transient negative sum
>> >> is possible.  Callers don't want negative values:
>> >> - mem_cgroup_wb_stats() doesn't want negative nr_dirty or nr_writeback.
>> >> - oom reports and memory.stat shouldn't show confusing negative usage.
>> >> - tree_usage() already avoids negatives.
>> >>
>> >> Avoid returning negative page counts from mem_cgroup_read_stat() and
>> >> convert it to unsigned.
>> >
>> > Someone please remind me why this code doesn't use the existing
>> > percpu_counter library which solved this problem years ago.
>> >
>> >>   for_each_possible_cpu(cpu)
>> >
>> > and which doesn't iterate across offlined CPUs.
>> 
>> I found [1] and [2] discussing memory layout differences between:
>> a) existing memcg hand rolled per cpu arrays of counters
>> vs
>> b) array of generic percpu_counter
>> The current approach was claimed to have lower memory overhead and
>> better cache behavior.
>> 
>> I assume it's pretty straightforward to create generic
>> percpu_counter_array routines which memcg could use.  Possibly something
>> like this could be made general enough could be created to satisfy
>> vmstat, but less clear.
>> 
>> [1] http://www.spinics.net/lists/cgroups/msg06216.html
>> [2] https://lkml.org/lkml/2014/9/11/1057
>
> That all sounds rather bogus to me.  __percpu_counter_add() doesn't
> modify struct percpu_counter at all except for when the cpu-local
> counter overflows the configured batch size.  And for the memcg
> application I suspect we can set the batch size to INT_MAX...

Nod.  The memory usage will be a bit larger, but the code reuse is
attractive.  I dusted off Vladimir's
https://lkml.org/lkml/2014/9/11/710.  Next step is to benchmark it
before posting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] fs, seqfile: always allow oom killer

2015-09-25 Thread Greg Thelen

Since commit 5cec38ac866b ("fs, seq_file: fallback to vmalloc instead of
oom kill processes") seq_buf_alloc() avoids calling the oom killer for
PAGE_SIZE or smaller allocations; but larger allocations can use the oom
killer via vmalloc().  Thus reads of small files can return ENOMEM, but
larger files use the oom killer to avoid ENOMEM.

Memory overcommit requires use of the oom killer to select a victim
regardless of file size.

Enable oom killer for small seq_buf_alloc() allocations.

Signed-off-by: David Rientjes 
Signed-off-by: Greg Thelen 
---
 fs/seq_file.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/seq_file.c b/fs/seq_file.c
index 225586e141ca..a8e288755f24 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -25,12 +25,17 @@ static void seq_set_overflow(struct seq_file *m)
 static void *seq_buf_alloc(unsigned long size)
 {
void *buf;
+   gfp_t gfp = GFP_KERNEL;
 
/*
-* __GFP_NORETRY to avoid oom-killings with high-order allocations -
-* it's better to fall back to vmalloc() than to kill things.
+* For high order allocations, use __GFP_NORETRY to avoid oom-killing -
+* it's better to fall back to vmalloc() than to kill things.  For small
+* allocations, just use GFP_KERNEL which will oom kill, thus no need
+* for vmalloc fallback.
 */
-   buf = kmalloc(size, GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN);
+   if (size > PAGE_SIZE)
+   gfp |= __GFP_NORETRY | __GFP_NOWARN;
+   buf = kmalloc(size, gfp);
if (!buf && size > PAGE_SIZE)
buf = vmalloc(size);
return buf;
-- 
2.6.0.rc2.230.g3dd15c0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 4.3-rc1 dirty page count underflow (cgroup-related?)

2015-09-21 Thread Greg Thelen


Dave Hansen wrote:

> On 09/17/2015 11:09 PM, Greg Thelen wrote:
>> I'm not denying the issue, bug the WARNING splat isn't necessarily
>> catching a problem.  The corresponding code comes from your debug patch:
>> +
>> WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > 
>> (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
>> 
>> This only checks a single cpu's counter, which can be negative.  The sum
>> of all counters is what matters.
>> Imagine:
>> cpu1) dirty page: inc
>> cpu2) clean page: dec
>> The sum is properly zero, but cpu2 is -1, which will trigger the WARN.
>> 
>> I'll look at the code and also see if I can reproduce the failure using
>> mem_cgroup_read_stat() for all of the new WARNs.
>
> D'oh.  I'll replace those with the proper mem_cgroup_read_stat() and
> test with your patch to see if anything still triggers.

Thanks Dave.

Here's what I think we should use to fix the issue.  I tagged this for
v4.2 stable given the way that unpatched performance falls apart without
warning or workaround (besides deleting and recreating affected memcg).
Feedback welcome.

>From f5c39c2e8471c10fe0464ca7b6e6f743ce6920a6 Mon Sep 17 00:00:00 2001
From: Greg Thelen 
Date: Sat, 19 Sep 2015 16:21:18 -0700
Subject: [PATCH] memcg: fix dirty page migration

The problem starts with a file backed dirty page which is charged to a
memcg.  Then page migration is used to move oldpage to newpage.
Migration:
- copies the oldpage's data to newpage
- clears oldpage.PG_dirty
- sets newpage.PG_dirty
- uncharges oldpage from memcg
- charges newpage to memcg

Clearing oldpage.PG_dirty decrements the charged memcg's dirty page
count.  However, because newpage is not yet charged, setting
newpage.PG_dirty does not increment the memcg's dirty page count.  After
migration completes newpage.PG_dirty is eventually cleared, often in
account_page_cleaned().  At this time newpage is charged to a memcg so
the memcg's dirty page count is decremented which causes underflow
because the count was not previously incremented by migration.  This
underflow causes balance_dirty_pages() to see a very large unsigned
number of dirty memcg pages which leads to aggressive throttling of
buffered writes by processes in non root memcg.

This issue:
- can harm performance of non root memcg buffered writes.
- can report too small (even negative) values in
  memory.stat[(total_)dirty] counters of all memcg, including the root.

To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce
page_memcg() and set_page_memcg() helpers.

Test:
0) setup and enter limited memcg
mkdir /sys/fs/cgroup/test
echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes
echo $$ > /sys/fs/cgroup/test/cgroup.procs

1) buffered writes baseline
dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
sync
grep ^dirty /sys/fs/cgroup/test/memory.stat

2) buffered writes with compaction antagonist to induce migration
yes 1 > /proc/sys/vm/compact_memory &
rm -rf /data/tmp/foo
dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
kill %
sync
grep ^dirty /sys/fs/cgroup/test/memory.stat

3) buffered writes without antagonist, should match baseline
rm -rf /data/tmp/foo
dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
sync
grep ^dirty /sys/fs/cgroup/test/memory.stat

   (speed, dirty residue)
 unpatched   patched
1) 841 MB/s 0 dirty pages  886 MB/s 0 dirty pages
2) 611 MB/s -33427456 dirty pages  793 MB/s 0 dirty pages
3) 114 MB/s -33427456 dirty pages  891 MB/s 0 dirty pages

Notice that unpatched baseline performance (1) fell after
migration (3): 841 -> 114 MB/s.  In the patched kernel, post
migration performance matches baseline.

Fixes: c4843a7593a9 ("memcg: add per cgroup dirty page accounting")
Cc:  # 4.2+
Reported-by: Dave Hansen 
Signed-off-by: Greg Thelen 
---
 include/linux/mm.h | 21 +
 mm/migrate.c   | 12 +++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 91c08f6f0dc9..80001de019ba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -905,6 +905,27 @@ static inline void set_page_links(struct page *page, enum 
zone_type zone,
 #endif
 }
 
+#ifdef CONFIG_MEMCG
+static inline struct mem_cgroup *page_memcg(struct page *page)
+{
+   return page->mem_cgroup;
+}
+
+static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg)
+{
+   page->mem_cgroup = memcg;
+}
+#else
+static inline struct mem_cgroup *page_memcg(struct page *page)
+{
+   return NULL;
+}
+
+static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg)
+{
+}
+#endif
+
 /*
  * Some inline

[PATCH] memcg: make mem_cgroup_read_stat() unsigned

2015-09-22 Thread Greg Thelen

mem_cgroup_read_stat() returns a page count by summing per cpu page
counters.  The summing is racy wrt. updates, so a transient negative sum
is possible.  Callers don't want negative values:
- mem_cgroup_wb_stats() doesn't want negative nr_dirty or nr_writeback.
- oom reports and memory.stat shouldn't show confusing negative usage.
- tree_usage() already avoids negatives.

Avoid returning negative page counts from mem_cgroup_read_stat() and
convert it to unsigned.

Signed-off-by: Greg Thelen 
---
 mm/memcontrol.c | 30 ++
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ddaeba34e09..2633e9be4a99 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -644,12 +644,14 @@ mem_cgroup_largest_soft_limit_node(struct 
mem_cgroup_tree_per_zone *mctz)
 }
 
 /*
+ * Return page count for single (non recursive) @memcg.
+ *
  * Implementation Note: reading percpu statistics for memcg.
  *
  * Both of vmstat[] and percpu_counter has threshold and do periodic
  * synchronization to implement "quick" read. There are trade-off between
  * reading cost and precision of value. Then, we may have a chance to implement
- * a periodic synchronizion of counter in memcg's counter.
+ * a periodic synchronization of counter in memcg's counter.
  *
  * But this _read() function is used for user interface now. The user accounts
  * memory usage by memory cgroup and he _always_ requires exact value because
@@ -659,17 +661,24 @@ mem_cgroup_largest_soft_limit_node(struct 
mem_cgroup_tree_per_zone *mctz)
  *
  * If there are kernel internal actions which can make use of some not-exact
  * value, and reading all cpu value can be performance bottleneck in some
- * common workload, threashold and synchonization as vmstat[] should be
+ * common workload, threashold and synchronization as vmstat[] should be
  * implemented.
  */
-static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
-enum mem_cgroup_stat_index idx)
+static unsigned long
+mem_cgroup_read_stat(struct mem_cgroup *memcg, enum mem_cgroup_stat_index idx)
 {
long val = 0;
int cpu;
 
+   /* Per-cpu values can be negative, use a signed accumulator */
for_each_possible_cpu(cpu)
val += per_cpu(memcg->stat->count[idx], cpu);
+   /*
+* Summing races with updates, so val may be negative.  Avoid exposing
+* transient negative values.
+*/
+   if (val < 0)
+   val = 0;
return val;
 }
 
@@ -1254,7 +1263,7 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, 
struct task_struct *p)
for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
continue;
-   pr_cont(" %s:%ldKB", mem_cgroup_stat_names[i],
+   pr_cont(" %s:%luKB", mem_cgroup_stat_names[i],
K(mem_cgroup_read_stat(iter, i)));
}
 
@@ -2819,14 +2828,11 @@ static unsigned long tree_stat(struct mem_cgroup *memcg,
   enum mem_cgroup_stat_index idx)
 {
struct mem_cgroup *iter;
-   long val = 0;
+   unsigned long val = 0;
 
-   /* Per-cpu values can be negative, use a signed accumulator */
for_each_mem_cgroup_tree(iter, memcg)
val += mem_cgroup_read_stat(iter, idx);
 
-   if (val < 0) /* race ? */
-   val = 0;
return val;
 }
 
@@ -3169,7 +3175,7 @@ static int memcg_stat_show(struct seq_file *m, void *v)
for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
continue;
-   seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i],
+   seq_printf(m, "%s %lu\n", mem_cgroup_stat_names[i],
   mem_cgroup_read_stat(memcg, i) * PAGE_SIZE);
}
 
@@ -3194,13 +3200,13 @@ static int memcg_stat_show(struct seq_file *m, void *v)
   (u64)memsw * PAGE_SIZE);
 
for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
-   long long val = 0;
+   unsigned long long val = 0;
 
if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
continue;
for_each_mem_cgroup_tree(mi, memcg)
val += mem_cgroup_read_stat(mi, i) * PAGE_SIZE;
-   seq_printf(m, "total_%s %lld\n", mem_cgroup_stat_names[i], val);
+   seq_printf(m, "total_%s %llu\n", mem_cgroup_stat_names[i], val);
}
 
for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) {
-- 
2.6.0.rc0.131.gf624c3d

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel&qu

[PATCH] powerpc/powernv/pci: add ifdef to avoid dead code

2020-06-13 Thread Greg Thelen

Commit dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE
configuration") removed a couple pnv_ioda_setup_bus_dma() calls.  The
only remaining calls are behind CONFIG_IOMMU_API.  Thus builds without
CONFIG_IOMMU_API see:
  arch/powerpc/platforms/powernv/pci-ioda.c:1888:13: error: 
'pnv_ioda_setup_bus_dma' defined but not used

Add CONFIG_IOMMU_API ifdef guard to avoid dead code.

Fixes: dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE configuration")
Signed-off-by: Greg Thelen 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 73a63efcf855..f7762052b7c4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1885,6 +1885,7 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
return false;
 }
 
+#ifdef CONFIG_IOMMU_API
 static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
 {
struct pci_dev *dev;
@@ -1897,6 +1898,7 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe 
*pe, struct pci_bus *bus)
pnv_ioda_setup_bus_dma(pe, dev->subordinate);
}
 }
+#endif
 
 static inline __be64 __iomem *pnv_ioda_get_inval_reg(struct pnv_phb *phb,
 bool real_mode)
-- 
2.27.0.290.gba653c62da-goog

[PATCH] e1000e: add ifdef to avoid dead code

2020-06-13 Thread Greg Thelen

Commit e086ba2fccda ("e1000e: disable s0ix entry and exit flows for ME
systems") added e1000e_check_me() but it's only called from
CONFIG_PM_SLEEP protected code.  Thus builds without CONFIG_PM_SLEEP
see:
  drivers/net/ethernet/intel/e1000e/netdev.c:137:13: warning: 'e1000e_check_me' 
defined but not used [-Wunused-function]

Add CONFIG_PM_SLEEP ifdef guard to avoid dead code.

Fixes: e086ba2fccda ("e1000e: disable s0ix entry and exit flows for ME systems")
Signed-off-by: Greg Thelen 
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index a279f4fa9962..165f0aea22c9 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -107,6 +107,7 @@ static const struct e1000_reg_info e1000_reg_info_tbl[] = {
{0, NULL}
 };
 
+#ifdef CONFIG_PM_SLEEP
 struct e1000e_me_supported {
u16 device_id;  /* supported device ID */
 };
@@ -145,6 +146,7 @@ static bool e1000e_check_me(u16 device_id)
 
return false;
 }
+#endif /* CONFIG_PM_SLEEP */
 
 /**
  * __ew32_prepare - prepare to write to MAC CSR register on certain parts
-- 
2.27.0.290.gba653c62da-goog

[PATCH v2] powerpc/powernv/pci: use ifdef to avoid dead code

2020-06-14 Thread Greg Thelen

Commit dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE
configuration") removed a couple pnv_ioda_setup_bus_dma() calls.  The
only remaining calls are behind CONFIG_IOMMU_API.  Thus builds without
CONFIG_IOMMU_API see:
  arch/powerpc/platforms/powernv/pci-ioda.c:1888:13: error: 
'pnv_ioda_setup_bus_dma' defined but not used

Move pnv_ioda_setup_bus_dma() under CONFIG_IOMMU_API to avoid dead code.

Fixes: dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE configuration")
Signed-off-by: Greg Thelen 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 73a63efcf855..743d840712da 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1885,19 +1885,6 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
return false;
 }
 
-static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
-{
-   struct pci_dev *dev;
-
-   list_for_each_entry(dev, &bus->devices, bus_list) {
-   set_iommu_table_base(&dev->dev, pe->table_group.tables[0]);
-   dev->dev.archdata.dma_offset = pe->tce_bypass_base;
-
-   if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
-   pnv_ioda_setup_bus_dma(pe, dev->subordinate);
-   }
-}
-
 static inline __be64 __iomem *pnv_ioda_get_inval_reg(struct pnv_phb *phb,
 bool real_mode)
 {
@@ -2501,6 +2488,19 @@ static long pnv_pci_ioda2_unset_window(struct 
iommu_table_group *table_group,
 #endif
 
 #ifdef CONFIG_IOMMU_API
+static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
+{
+   struct pci_dev *dev;
+
+   list_for_each_entry(dev, &bus->devices, bus_list) {
+   set_iommu_table_base(&dev->dev, pe->table_group.tables[0]);
+   dev->dev.archdata.dma_offset = pe->tce_bypass_base;
+
+   if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
+   pnv_ioda_setup_bus_dma(pe, dev->subordinate);
+   }
+}
+
 unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
__u64 window_size, __u32 levels)
 {
-- 
2.27.0.290.gba653c62da-goog

Re: [PATCH] shmem, memcg: enable memcg aware shrinker

2020-06-04 Thread Greg Thelen

Yang Shi  wrote:

> On Sun, May 31, 2020 at 8:22 PM Greg Thelen  wrote:
>>
>> Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged
>> shrinkers during memcg shrink_slab()") a memcg aware shrinker is only
>> called when the per-memcg per-node shrinker_map indicates that the
>> shrinker may have objects to release to the memcg and node.
>>
>> shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs
>> shrinker which advertises per memcg and numa awareness.  The shmem
>> shrinker releases memory by splitting hugepages that extend beyond
>> i_size.
>>
>> Shmem does not currently set bits in shrinker_map.  So, starting with
>> b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under
>> pressure.  This leads to undeserved memcg OOM kills.
>> Example that reliably sees memcg OOM kill in unpatched kernel:
>>   FS=/tmp/fs
>>   CONTAINER=/cgroup/memory/tmpfs_shrinker
>>   mkdir -p $FS
>>   mount -t tmpfs -o huge=always nodev $FS
>>   # Create 1000 MB container, which shouldn't suffer OOM.
>>   mkdir $CONTAINER
>>   echo 1000M > $CONTAINER/memory.limit_in_bytes
>>   echo $BASHPID >> $CONTAINER/cgroup.procs
>>   # Create 4000 files.  Ideally each file uses 4k data page + a little
>>   # metadata.  Assume 8k total per-file, 32MB (4000*8k) should easily
>>   # fit within container's 1000 MB.  But if data pages use 2MB
>>   # hugepages (due to aggressive huge=always) then files consume 8GB,
>>   # which hits memcg 1000 MB limit.
>>   for i in {1..4000}; do
>> echo . > $FS/$i
>>   done
>
> It looks all the inodes which have tail THP beyond i_size are on one
> single list, then the shrinker actually just splits the first
> nr_to_scan inodes. But since the list is not memcg aware, so it seems
> it may split the THPs which are not charged to the victim memcg and
> the victim memcg still may suffer from pre-mature oom, right?

Correct.  shmem_unused_huge_shrink() is not memcg aware.  In response to
memcg pressure it will split the post-i_size tails of nr_to_scan tmpfs
inodes regardless of if they're charged to the under-pressure memcg.
do_shrink_slab() looks like it'll repeatedly call
shmem_unused_huge_shrink().  So it will split tails of many inodes.  So
I think it'll avoid the oom by over shrinking.  This is not ideal.  But
it seems better than undeserved oom kill.

I think the solution (as Kirill Tkhai suggested) a memcg-aware index
would solve both:
1) avoid premature oom by registering shrinker to responding to memcg
   pressure
2) avoid shrinking/splitting inodes unrelated to the under-pressure
   memcg

I can certainly look into that (thanks Kirill for the pointers).  In the
short term I'm still interested in avoiding premature OOMs with the
original thread (i.e. restore pre-4.19 behavior to shmem shrinker for
memcg pressure).  I plan to test and repost v2.

Re: [PATCH v18 11/14] Documentation: Add documents for DAMON

2020-07-27 Thread Greg Thelen

SeongJae Park  wrote:

> From: SeongJae Park 
>
> This commit adds documents for DAMON under
> `Documentation/admin-guide/mm/damon/` and `Documentation/vm/damon/`.
>
> Signed-off-by: SeongJae Park 
> ---
>  Documentation/admin-guide/mm/damon/guide.rst | 157 ++
>  Documentation/admin-guide/mm/damon/index.rst |  15 +
>  Documentation/admin-guide/mm/damon/plans.rst |  29 ++
>  Documentation/admin-guide/mm/damon/start.rst |  98 ++
>  Documentation/admin-guide/mm/damon/usage.rst | 298 +++
>  Documentation/admin-guide/mm/index.rst   |   1 +
>  Documentation/vm/damon/api.rst   |  20 ++
>  Documentation/vm/damon/eval.rst  | 222 ++
>  Documentation/vm/damon/faq.rst   |  59 
>  Documentation/vm/damon/index.rst |  32 ++
>  Documentation/vm/damon/mechanisms.rst| 165 ++
>  Documentation/vm/index.rst   |   1 +
>  12 files changed, 1097 insertions(+)
>  create mode 100644 Documentation/admin-guide/mm/damon/guide.rst
>  create mode 100644 Documentation/admin-guide/mm/damon/index.rst
>  create mode 100644 Documentation/admin-guide/mm/damon/plans.rst
>  create mode 100644 Documentation/admin-guide/mm/damon/start.rst
>  create mode 100644 Documentation/admin-guide/mm/damon/usage.rst
>  create mode 100644 Documentation/vm/damon/api.rst
>  create mode 100644 Documentation/vm/damon/eval.rst
>  create mode 100644 Documentation/vm/damon/faq.rst
>  create mode 100644 Documentation/vm/damon/index.rst
>  create mode 100644 Documentation/vm/damon/mechanisms.rst
>
> diff --git a/Documentation/admin-guide/mm/damon/guide.rst 
> b/Documentation/admin-guide/mm/damon/guide.rst
> new file mode 100644
> index ..c51fb843efaa
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/damon/guide.rst
> @@ -0,0 +1,157 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==
> +Optimization Guide
> +==
> +
> +This document helps you estimating the amount of benefit that you could get
> +from DAMON-based optimizations, and describes how you could achieve it.  You
> +are assumed to already read :doc:`start`.
> +
> +
> +Check The Signs
> +===
> +
> +No optimization can provide same extent of benefit to every case.  Therefore
> +you should first guess how much improvements you could get using DAMON.  If
> +some of below conditions match your situation, you could consider using 
> DAMON.
> +
> +- *Low IPC and High Cache Miss Ratios.*  Low IPC means most of the CPU time 
> is
> +  spent waiting for the completion of time-consuming operations such as 
> memory
> +  access, while high cache miss ratios mean the caches don't help it well.
> +  DAMON is not for cache level optimization, but DRAM level.  However,
> +  improving DRAM management will also help this case by reducing the memory
> +  operation latency.
> +- *Memory Over-commitment and Unknown Users.*  If you are doing memory
> +  overcommitment and you cannot control every user of your system, a memory
> +  bank run could happen at any time.  You can estimate when it will happen
> +  based on DAMON's monitoring results and act earlier to avoid or deal better
> +  with the crisis.
> +- *Frequent Memory Pressure.*  Frequent memory pressure means your system has
> +  wrong configurations or memory hogs.  DAMON will help you find the right
> +  configuration and/or the criminals.
> +- *Heterogeneous Memory System.*  If your system is utilizing memory devices
> +  that placed between DRAM and traditional hard disks, such as non-volatile
> +  memory or fast SSDs, DAMON could help you utilizing the devices more
> +  efficiently.
> +
> +
> +Profile
> +===
> +
> +If you found some positive signals, you could start by profiling your 
> workloads
> +using DAMON.  Find major workloads on your systems and analyze their data
> +access pattern to find something wrong or can be improved.  The DAMON user
> +space tool (``damo``) will be useful for this.
> +
> +We recommend you to start from working set size distribution check using 
> ``damo
> +report wss``.  If the distribution is ununiform or quite different from what
> +you estimated, you could consider `Memory Configuration`_ optimization.
> +
> +Then, review the overall access pattern in heatmap form using ``damo report
> +heats``.  If it shows a simple pattern consists of a small number of memory
> +regions having high contrast of access temperature, you could consider manual
> +`Program Modification`_.
> +
> +If you still want to absorb more benefits, you should develop `Personalized
> +DAMON Application`_ for your special case.
> +
> +You don't need to take only one approach among the above plans, but you could
> +use multiple of the above approaches to maximize the benefit.
> +
> +
> +Optimize
> +
> +
> +If the profiling result also says it's worth trying some optimization, you
> +could consider below approaches.  Note that some of the below approaches 
> assume
> +that y

Re: [PATCH v18 06/14] mm/damon: Implement callbacks for the virtual memory address spaces

2020-07-27 Thread Greg Thelen

SeongJae Park  wrote:

> From: SeongJae Park 
>
> This commit introduces a reference implementation of the address space
> specific low level primitives for the virtual address space, so that
> users of DAMON can easily monitor the data accesses on virtual address
> spaces of specific processes by simply configuring the implementation to
> be used by DAMON.
>
> The low level primitives for the fundamental access monitoring are
> defined in two parts:
> 1. Identification of the monitoring target address range for the address
> space.
> 2. Access check of specific address range in the target space.
>
> The reference implementation for the virtual address space provided by
> this commit is designed as below.
>
> PTE Accessed-bit Based Access Check
> ---
>
> The implementation uses PTE Accessed-bit for basic access checks.  That
> is, it clears the bit for next sampling target page and checks whether
> it set again after one sampling period.  To avoid disturbing other
> Accessed bit users such as the reclamation logic, the implementation
> adjusts the ``PG_Idle`` and ``PG_Young`` appropriately, as same to the
> 'Idle Page Tracking'.
>
> VMA-based Target Address Range Construction
> ---
>
> Only small parts in the super-huge virtual address space of the
> processes are mapped to physical memory and accessed.  Thus, tracking
> the unmapped address regions is just wasteful.  However, because DAMON
> can deal with some level of noise using the adaptive regions adjustment
> mechanism, tracking every mapping is not strictly required but could
> even incur a high overhead in some cases.  That said, too huge unmapped
> areas inside the monitoring target should be removed to not take the
> time for the adaptive mechanism.
>
> For the reason, this implementation converts the complex mappings to
> three distinct regions that cover every mapped area of the address
> space.  Also, the two gaps between the three regions are the two biggest
> unmapped areas in the given address space.  The two biggest unmapped
> areas would be the gap between the heap and the uppermost mmap()-ed
> region, and the gap between the lowermost mmap()-ed region and the stack
> in most of the cases.  Because these gaps are exceptionally huge in
> usual address spacees, excluding these will be sufficient to make a
> reasonable trade-off.  Below shows this in detail::
>
> 
> 
> 
> (small mmap()-ed regions and munmap()-ed regions)
> 
> 
> 
>
> Signed-off-by: SeongJae Park 
> Reviewed-by: Leonard Foerster 
> ---
>  include/linux/damon.h |   6 +
>  mm/damon.c| 474 ++
>  2 files changed, 480 insertions(+)
>
> diff --git a/include/linux/damon.h b/include/linux/damon.h
> index 3c0b92a679e8..310d36d123b3 100644
> --- a/include/linux/damon.h
> +++ b/include/linux/damon.h
> @@ -144,6 +144,12 @@ struct damon_ctx {
>   void (*aggregate_cb)(struct damon_ctx *context);
>  };
>  
> +/* Reference callback implementations for virtual memory */
> +void kdamond_init_vm_regions(struct damon_ctx *ctx);
> +void kdamond_update_vm_regions(struct damon_ctx *ctx);
> +void kdamond_prepare_vm_access_checks(struct damon_ctx *ctx);
> +unsigned int kdamond_check_vm_accesses(struct damon_ctx *ctx);
> +
>  int damon_set_pids(struct damon_ctx *ctx, int *pids, ssize_t nr_pids);
>  int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
>   unsigned long aggr_int, unsigned long regions_update_int,
> diff --git a/mm/damon.c b/mm/damon.c
> index b844924b9fdb..386780739007 100644
> --- a/mm/damon.c
> +++ b/mm/damon.c
> @@ -9,6 +9,9 @@
>   * This file is constructed in below parts.
>   *
>   * - Functions and macros for DAMON data structures
> + * - Functions for the initial monitoring target regions construction
> + * - Functions for the dynamic monitoring target regions update
> + * - Functions for the access checking of the regions
>   * - Functions for DAMON core logics and features
>   * - Functions for the DAMON programming interface
>   * - Functions for the module loading/unloading
> @@ -196,6 +199,477 @@ static unsigned long damon_region_sz_limit(struct 
> damon_ctx *ctx)
>   return sz;
>  }
>  
> +/*
> + * Get the mm_struct of the given task
> + *
> + * Caller _must_ put the mm_struct after use, unless it is NULL.
> + *
> + * Returns the mm_struct of the task on success, NULL on failure
> + */
> +static struct mm_struct *damon_get_mm(struct damon_task *t)
> +{
> + struct task_struct *task;
> + struct mm_struct *mm;
> +
> + task = damon_get_task_struct(t);
> + if (!task)
> + return NULL;
> +
> + mm = get_task_mm(task);
> + put_task_struct(task);
> + return mm;
> +}
> +
> +/*
> + * Functions for the initial monitoring target regions construction
> + */
> +
> +/*
> + * Size-evenly split a region into 'nr_pieces' small regions
> + *
> + * Returns 0 on success, o

Re: [PATCH v2] powerpc/powernv/pci: use ifdef to avoid dead code

2020-07-18 Thread Greg Thelen

Oliver O'Halloran  wrote:

> On Mon, Jun 15, 2020 at 9:33 AM Greg Thelen  wrote:
>>
>> Commit dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE
>> configuration") removed a couple pnv_ioda_setup_bus_dma() calls.  The
>> only remaining calls are behind CONFIG_IOMMU_API.  Thus builds without
>> CONFIG_IOMMU_API see:
>>   arch/powerpc/platforms/powernv/pci-ioda.c:1888:13: error: 
>> 'pnv_ioda_setup_bus_dma' defined but not used
>>
>> Move pnv_ioda_setup_bus_dma() under CONFIG_IOMMU_API to avoid dead code.
>
> Doh! Thanks for the fix.
>
> Reviewed-by: Oliver O'Halloran 

Is there anything else needed from me on this patch?
Given that it fixes a 5.8 commit I figured it'd be 5.8 material.

[PATCH] selftests: more general make nesting support

2020-07-28 Thread Greg Thelen

selftests can be built from the toplevel kernel makefile (e.g. make
kselftest-all) or directly (make -C tools/testing/selftests all).

The toplevel kernel makefile explicitly disables implicit rules with
"MAKEFLAGS += -rR", which is passed to tools/testing/selftests.  Some
selftest makefiles require implicit make rules, which is why
commit 67d8712dcc70 ("selftests: Fix build failures when invoked from
kselftest target") reenables implicit rules by clearing MAKEFLAGS if
MAKELEVEL=1.

So far so good.  However, if the toplevel makefile is called from an
outer makefile then MAKELEVEL will be elevated, which breaks the
MAKELEVEL equality test.
Example wrapped makefile error:
  $ cat ~/Makefile
  all:
$(MAKE) defconfig
$(MAKE) kselftest-all
  $ make -sf ~/Makefile
futex_wait_timeout.c /src/tools/testing/selftests/kselftest_harness.h   
/src/tools/testing/selftests/kselftest.h ../include/futextest.h 
../include/atomic.h ../include/logging.h -lpthread -lrt -o 
/src/tools/testing/selftests/futex/functional/futex_wait_timeout
  make[4]: futex_wait_timeout.c: Command not found

Rather than checking $(MAKELEVEL), check for $(LINK.c), which is a more
direct side effect of "make -R".  This enables arbitrary makefile
nesting.

Signed-off-by: Greg Thelen 
---
 tools/testing/selftests/Makefile | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 1195bd85af38..289a2e4b3f6f 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -84,10 +84,10 @@ endif
 # of the targets gets built.
 FORCE_TARGETS ?=
 
-# Clear LDFLAGS and MAKEFLAGS if called from main
-# Makefile to avoid test build failures when test
-# Makefile doesn't have explicit build rules.
-ifeq (1,$(MAKELEVEL))
+# Clear LDFLAGS and MAKEFLAGS when implicit rules are missing.  This provides
+# implicit rules to sub-test Makefiles which avoids build failures in test
+# Makefile that don't have explicit build rules.
+ifeq (,$(LINK.c))
 override LDFLAGS =
 override MAKEFLAGS =
 endif
-- 
2.28.0.rc0.142.g3c755180ce-goog

Re: [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed

2020-07-01 Thread Greg Thelen

Dave Hansen  wrote:

> From: Keith Busch 
>
> Migrating pages had been allocating the new page before it was actually
> needed. Subsequent operations may still fail, which would have to handle
> cleaning up the newly allocated page when it was never used.
>
> Defer allocating the page until we are actually ready to make use of
> it, after locking the original page. This simplifies error handling,
> but should not have any functional change in behavior. This is just
> refactoring page migration so the main part can more easily be reused
> by other code.

Is there any concern that the src page is now held PG_locked over the
dst page allocation, which might wander into
reclaim/cond_resched/oom_kill?  I don't have a deadlock in mind.  I'm
just wondering about the additional latency imposed on unrelated threads
who want access src page.

> #Signed-off-by: Keith Busch 

Is commented Signed-off-by intentional?  Same applies to later patches.

> Signed-off-by: Dave Hansen 
> Cc: Keith Busch 
> Cc: Yang Shi 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> ---
>
>  b/mm/migrate.c |  148 
> -
>  1 file changed, 75 insertions(+), 73 deletions(-)
>
> diff -puN mm/migrate.c~0007-mm-migrate-Defer-allocating-new-page-until-needed 
> mm/migrate.c
> --- a/mm/migrate.c~0007-mm-migrate-Defer-allocating-new-page-until-needed 
> 2020-06-29 16:34:37.896312607 -0700
> +++ b/mm/migrate.c2020-06-29 16:34:37.900312607 -0700
> @@ -1014,56 +1014,17 @@ out:
>   return rc;
>  }
>  
> -static int __unmap_and_move(struct page *page, struct page *newpage,
> - int force, enum migrate_mode mode)
> +static int __unmap_and_move(new_page_t get_new_page,
> + free_page_t put_new_page,
> + unsigned long private, struct page *page,
> + enum migrate_mode mode,
> + enum migrate_reason reason)
>  {
>   int rc = -EAGAIN;
>   int page_was_mapped = 0;
>   struct anon_vma *anon_vma = NULL;
>   bool is_lru = !__PageMovable(page);
> -
> - if (!trylock_page(page)) {
> - if (!force || mode == MIGRATE_ASYNC)
> - goto out;
> -
> - /*
> -  * It's not safe for direct compaction to call lock_page.
> -  * For example, during page readahead pages are added locked
> -  * to the LRU. Later, when the IO completes the pages are
> -  * marked uptodate and unlocked. However, the queueing
> -  * could be merging multiple pages for one bio (e.g.
> -  * mpage_readpages). If an allocation happens for the
> -  * second or third page, the process can end up locking
> -  * the same page twice and deadlocking. Rather than
> -  * trying to be clever about what pages can be locked,
> -  * avoid the use of lock_page for direct compaction
> -  * altogether.
> -  */
> - if (current->flags & PF_MEMALLOC)
> - goto out;
> -
> - lock_page(page);
> - }
> -
> - if (PageWriteback(page)) {
> - /*
> -  * Only in the case of a full synchronous migration is it
> -  * necessary to wait for PageWriteback. In the async case,
> -  * the retry loop is too short and in the sync-light case,
> -  * the overhead of stalling is too much
> -  */
> - switch (mode) {
> - case MIGRATE_SYNC:
> - case MIGRATE_SYNC_NO_COPY:
> - break;
> - default:
> - rc = -EBUSY;
> - goto out_unlock;
> - }
> - if (!force)
> - goto out_unlock;
> - wait_on_page_writeback(page);
> - }
> + struct page *newpage;
>  
>   /*
>* By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
> @@ -1082,6 +1043,12 @@ static int __unmap_and_move(struct page
>   if (PageAnon(page) && !PageKsm(page))
>   anon_vma = page_get_anon_vma(page);
>  
> + newpage = get_new_page(page, private);
> + if (!newpage) {
> + rc = -ENOMEM;
> + goto out;
> + }
> +
>   /*
>* Block others from accessing the new page when we get around to
>* establishing additional references. We are usually the only one
> @@ -1091,11 +1058,11 @@ static int __unmap_and_move(struct page
>* This is much like races on refcount of oldpage: just don't BUG().
>*/
>   if (unlikely(!trylock_page(newpage)))
> - goto out_unlock;
> + goto out_put;
>  
>   if (unlikely(!is_lru)) {
>   rc = move_to_new_page(newpage, page, mode);
> - goto out_unlock_both;
> + goto out_unlock;
>   }
>  
>   /*
> @@ -1114,7 +1081,7 @@

< 1 2 3 >

101 - 200 of 209 matches

Mail list logo