from:"Greg Thelen"

[PATCH] selftests: make kselftest-clean remove libynl outputs

2024-10-05 Thread Greg Thelen

Starting with 6.12 commit 85585b4bc8d8 ("selftests: add ncdevmem, netcat
for devmem TCP") kselftest-all creates additional outputs that
kselftest-clean does not cleanup:
  $ make defconfig
  $ make kselftest-all
  $ make kselftest-clean
  $ git clean -ndxf | grep tools/net
  Would remove tools/net/ynl/lib/__pycache__/
  Would remove tools/net/ynl/lib/ynl.a
  Would remove tools/net/ynl/lib/ynl.d
  Would remove tools/net/ynl/lib/ynl.o

Make kselftest-clean remove the newly added net/ynl outputs.

Fixes: 85585b4bc8d8 ("selftests: add ncdevmem, netcat for devmem TCP")
Signed-off-by: Greg Thelen 
---
 tools/testing/selftests/net/ynl.mk | 4 
 1 file changed, 4 insertions(+)

diff --git a/tools/testing/selftests/net/ynl.mk 
b/tools/testing/selftests/net/ynl.mk
index 59cb26cf3f73..1ef24119def0 100644
--- a/tools/testing/selftests/net/ynl.mk
+++ b/tools/testing/selftests/net/ynl.mk
@@ -19,3 +19,7 @@ $(YNL_OUTPUTS): CFLAGS += \
 $(OUTPUT)/libynl.a:
$(Q)$(MAKE) -C $(top_srcdir)/tools/net/ynl GENS="$(YNL_GENS)" libynl.a
$(Q)cp $(top_srcdir)/tools/net/ynl/libynl.a $(OUTPUT)/libynl.a
+
+EXTRA_CLEAN += \
+   $(top_srcdir)/tools/net/ynl/lib/__pycache__ \
+   $(top_srcdir)/tools/net/ynl/lib/*.[ado]
-- 
2.47.0.rc0.187.ge670bccf7e-goog

Re: [PATCH 07/10] mm/vmscan: add helper for querying ability to age anonymous pages

2021-03-19 Thread Greg Thelen

Dave Hansen  wrote:

> From: Dave Hansen 
>
> Anonymous pages are kept on their own LRU(s).  These lists could
> theoretically always be scanned and maintained.  But, without swap,
> there is currently nothing the kernel can *do* with the results of a
> scanned, sorted LRU for anonymous pages.
>
> A check for '!total_swap_pages' currently serves as a valid check as
> to whether anonymous LRUs should be maintained.  However, another
> method will be added shortly: page demotion.
>
> Abstract out the 'total_swap_pages' checks into a helper, give it a
> logically significant name, and check for the possibility of page
> demotion.

Reviewed-by: Greg Thelen 

> Signed-off-by: Dave Hansen 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> Cc: David Hildenbrand 
> Cc: osalvador 
> ---
>
>  b/mm/vmscan.c |   28 +---
>  1 file changed, 25 insertions(+), 3 deletions(-)
>
> diff -puN mm/vmscan.c~mm-vmscan-anon-can-be-aged mm/vmscan.c
> --- a/mm/vmscan.c~mm-vmscan-anon-can-be-aged  2021-03-04 15:35:58.935806422 
> -0800
> +++ b/mm/vmscan.c 2021-03-04 15:35:58.942806422 -0800
> @@ -2517,6 +2517,26 @@ out:
>   }
>  }
>  
> +/*
> + * Anonymous LRU management is a waste if there is
> + * ultimately no way to reclaim the memory.
> + */
> +bool anon_should_be_aged(struct lruvec *lruvec)

Should this be static?

> +{
> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
> + /* Aging the anon LRU is valuable if swap is present: */
> + if (total_swap_pages > 0)
> + return true;
> +
> + /* Also valuable if anon pages can be demoted: */
> + if (next_demotion_node(pgdat->node_id) >= 0)
> + return true;
> +
> + /* No way to reclaim anon pages.  Should not age anon LRUs: */
> + return false;
> +}
> +
>  static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  {
>   unsigned long nr[NR_LRU_LISTS];
> @@ -2626,7 +2646,8 @@ static void shrink_lruvec(struct lruvec
>* Even if we did not try to evict anon pages at all, we want to
>* rebalance the anon lru active/inactive ratio.
>*/
> - if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON))
> + if (anon_should_be_aged(lruvec) &&
> + inactive_is_low(lruvec, LRU_INACTIVE_ANON))
>   shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>  sc, LRU_ACTIVE_ANON);
>  }
> @@ -3455,10 +3476,11 @@ static void age_active_anon(struct pglis
>   struct mem_cgroup *memcg;
>   struct lruvec *lruvec;
>  
> - if (!total_swap_pages)
> + lruvec = mem_cgroup_lruvec(NULL, pgdat);
> +
> + if (!anon_should_be_aged(lruvec))
>   return;
>  
> - lruvec = mem_cgroup_lruvec(NULL, pgdat);
>   if (!inactive_is_low(lruvec, LRU_INACTIVE_ANON))
>   return;
>  
> _

Re: + revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch added to -mm tree

2020-12-07 Thread Greg Thelen

a...@linux-foundation.org wrote:

> The patch titled
>  Subject: revert "mm/filemap: add static for function 
> __add_to_page_cache_locked"
> has been added to the -mm tree.  Its filename is
>  
> revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch
>
> This patch should soon appear at
> 
> https://ozlabs.org/~akpm/mmots/broken-out/revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch
> and later at
> 
> https://ozlabs.org/~akpm/mmotm/broken-out/revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch
>
> Before you just go and hit "reply", please:
>a) Consider who else should be cc'ed
>b) Prefer to cc a suitable mailing list as well
>c) Ideally: find the original patch on the mailing list and do a
>   reply-to-all to that, adding suitable additional cc's
>
> *** Remember to use Documentation/process/submit-checklist.rst when testing 
> your code ***
>
> The -mm tree is included into linux-next and is updated
> there every 3-4 working days
>
> --
> From: Andrew Morton 
> Subject: revert "mm/filemap: add static for function 
> __add_to_page_cache_locked"
>
> Revert 3351b16af494 ("mm/filemap: add static for function
> __add_to_page_cache_locked") due to build issues with
> CONFIG_DEBUG_INFO_BTF=y.
>
> Link: 
> https://lkml.kernel.org/r/caadnvqj6tmzbxvtrobueh6qa0h+q7yaskxrvvvxhqr3kbzd...@mail.gmail.com
> Cc: Michal Kubecek 
> Cc: Justin Forbes 
> Cc: Alex Shi 
> Cc: Souptick Joarder 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> Cc: Josef Bacik 
> Signed-off-by: Andrew Morton 

Thanks.

Tested-by: Greg Thelen 

> ---
>
>  mm/filemap.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- 
> a/mm/filemap.c~revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked
> +++ a/mm/filemap.c
> @@ -827,7 +827,7 @@ int replace_page_cache_page(struct page
>  }
>  EXPORT_SYMBOL_GPL(replace_page_cache_page);
>  
> -static noinline int __add_to_page_cache_locked(struct page *page,
> +noinline int __add_to_page_cache_locked(struct page *page,
>   struct address_space *mapping,
>   pgoff_t offset, gfp_t gfp,
>   void **shadowp)
> _
>
> Patches currently in -mm which might be from a...@linux-foundation.org are
>
> revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch
> kthread_worker-document-cpu-hotplug-handling-fix.patch
> mm.patch
> mm-remove-the-unuseful-else-after-a-return-checkpatch-fixes.patch
> mm-prevent-gup_fast-from-racing-with-cow-during-fork-checkpatch-fixes.patch
> mm-swap_state-skip-meaningless-swap-cache-readahead-when-ra_infowin-==-0-fix.patch
> mm-vmallocc-__vmalloc_area_node-avoid-32-bit-overflow.patch
> mm-cma-improve-pr_debug-log-in-cma_release-fix.patch
> mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix-2.patch
> lib-cmdline_kunit-add-a-new-test-suite-for-cmdline-api-fix.patch
> ilog2-improve-ilog2-for-constant-arguments-checkpatch-fixes.patch
> lib-test_bitmapc-add-for_each_set_clump-test-cases-checkpatch-fixes.patch
> checkpatch-fix-typo_spelling-check-for-words-with-apostrophe-fix.patch
> resource-fix-kernel-doc-markups-checkpatch-fixes.patch
> linux-next-rejects.patch
> kmap-stupid-hacks-to-make-it-compile.patch
> init-kconfig-dont-assume-scripts-lld-versionsh-is-executable.patch
> set_memory-allow-set_direct_map__noflush-for-multiple-pages-fix.patch
> arch-mm-wire-up-memfd_secret-system-call-were-relevant-fix.patch
> kernel-forkc-export-kernel_thread-to-modules.patch

Re: [PATCH] mm/filemap: add static for function __add_to_page_cache_locked

2020-12-07 Thread Greg Thelen

Alex Shi  wrote:

> 在 2020/11/11 上午3:50, Andrew Morton 写道:
>> On Tue, 10 Nov 2020 08:39:24 +0530 Souptick Joarder  
>> wrote:
>> 
>>> On Fri, Nov 6, 2020 at 4:55 PM Alex Shi  wrote:

 Otherwise it cause gcc warning:
   ^~~
 ../mm/filemap.c:830:14: warning: no previous prototype for
 ‘__add_to_page_cache_locked’ [-Wmissing-prototypes]
  noinline int __add_to_page_cache_locked(struct page *page,
   ^~
>>>
>>> Is CONFIG_DEBUG_INFO_BTF enabled in your .config ?
>> 
>> hm, yes.
>
> When the config enabled, compiling looks good untill pahole tool
> used to get BTF info, but I still failed on a right version pahole
>> 1.16. Sorry.

I'm seeing an issue with this patch.  My build system has pahole v1.17,
but I don't think the pahole version is key.

$ git checkout 3351b16af494  # recently added to linus/master
$ make defconfig
$ make menuconfig  # set CONFIG_BPF_SYSCALL and CONFIG_DEBUG_INFO_BTF
$ make V=1
+ ./tools/bpf/resolve_btfids/resolve_btfids vmlinux
FAILED unresolved symbol __add_to_page_cache_locked

Reverting 3351b16af494 ("mm/filemap: add static for function
__add_to_page_cache_locked") fixes the issue.

I don't see the warning which motivated this patch, but maybe it
requires particular a .config or gcc version.  Perhaps adding a
__add_to_page_cache_locked() prototype would meet avoid it.  But I
haven't studied enough on BTF to know if there's a better answer.

 Signed-off-by: Alex Shi 
 Cc: Andrew Morton 
 Cc: linux...@kvack.org
 Cc: linux-kernel@vger.kernel.org
 ---
  mm/filemap.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/mm/filemap.c b/mm/filemap.c
 index d90614f501da..249cf489f5df 100644
 --- a/mm/filemap.c
 +++ b/mm/filemap.c
 @@ -827,7 +827,7 @@ int replace_page_cache_page(struct page *old, struct 
 page *new, gfp_t gfp_mask)
  }
  EXPORT_SYMBOL_GPL(replace_page_cache_page);

 -noinline int __add_to_page_cache_locked(struct page *page,
 +static noinline int __add_to_page_cache_locked(struct page *page,
 struct address_space *mapping,
 pgoff_t offset, gfp_t gfp,
 void **shadowp)
>> 
>> It's unclear to me whether BTF_ID() requires that the target symbol be
>> non-static.  It doesn't actually reference the symbol:
>> 
>> #define BTF_ID(prefix, name) \
>> __BTF_ID(__ID(__BTF_ID__##prefix##__##name##__))
>> 
>
> The above usage make me thought BTF don't require the symbol, though
> the symbol still exist in vmlinux with 'static'.
>
> So any comments of this, Alexei? 
>
>> Alexei, can you please comment?
>>

Re: [PATCH] mm: mmap_lock: fix use-after-free race and css ref leak in tracepoints

2020-12-01 Thread Greg Thelen

Axel Rasmussen  wrote:

> On Mon, Nov 30, 2020 at 5:34 PM Shakeel Butt  wrote:
>>
>> On Mon, Nov 30, 2020 at 3:43 PM Axel Rasmussen  
>> wrote:
>> >
>> > syzbot reported[1] a use-after-free introduced in 0f818c4bc1f3. The bug
>> > is that an ongoing trace event might race with the tracepoint being
>> > disabled (and therefore the _unreg() callback being called). Consider
>> > this ordering:
>> >
>> > T1: trace event fires, get_mm_memcg_path() is called
>> > T1: get_memcg_path_buf() returns a buffer pointer
>> > T2: trace_mmap_lock_unreg() is called, buffers are freed
>> > T1: cgroup_path() is called with the now-freed buffer
>>
>> Any reason to use the cgroup_path instead of the cgroup_ino? There are
>> other examples of trace points using cgroup_ino and no need to
>> allocate buffers. Also cgroup namespace might complicate the path
>> usage.
>
> Hmm, so in general I would love to use a numeric identifier instead of a 
> string.
>
> I did some reading, and it looks like the cgroup_ino() mainly has to
> do with writeback, instead of being just a general identifier?
> https://www.kernel.org/doc/Documentation/cgroup-v2.txt
>
> There is cgroup_id() which I think is almost what I'd want, but there
> are a couple problems with it:
>
> - I don't know of a way for userspace to translate IDs -> paths, to
> make them human readable?

The id => name map can be built from user space with a tree walk.
Example:

$ find /sys/fs/cgroup/memory -type d -printf '%i %P\n'  
# ~ [main] 
20387 init.scope 
31 system.slice

> - Also I think the ID implementation we use for this is "dense",
> meaning if a cgroup is removed, its ID is likely to be quickly reused.
>
>>
>> >
>> > The solution in this commit is to modify trace_mmap_lock_unreg() to
>> > first stop new buffers from being handed out, and then to wait (spin)
>> > until any existing buffer references are dropped (i.e., those trace
>> > events complete).
>> >
>> > I have a simple reproducer program which spins up two pools of threads,
>> > doing the following in a tight loop:
>> >
>> >   Pool 1:
>> >   mmap(NULL, 4096, PROT_READ | PROT_WRITE,
>> >MAP_PRIVATE | MAP_ANONYMOUS, -1, 0)
>> >   munmap()
>> >
>> >   Pool 2:
>> >   echo 1 > /sys/kernel/debug/tracing/events/mmap_lock/enable
>> >   echo 0 > /sys/kernel/debug/tracing/events/mmap_lock/enable
>> >
>> > This triggers the use-after-free very quickly. With this patch, I let it
>> > run for an hour without any BUGs.
>> >
>> > While fixing this, I also noticed and fixed a css ref leak. Previously
>> > we called get_mem_cgroup_from_mm(), but we never called css_put() to
>> > release that reference. get_mm_memcg_path() now does this properly.
>> >
>> > [1]: https://syzkaller.appspot.com/bug?extid=19e6dd9943972fa1c58a
>> >
>> > Fixes: 0f818c4bc1f3 ("mm: mmap_lock: add tracepoints around lock 
>> > acquisition")
>>
>> The original patch is in mm tree, so the SHA1 is not stabilized.
>> Usually Andrew squash the fixes into the original patches.
>
> Ah, I added this because it also shows up in linux-next, under the
> next-20201130 tag. I'll remove it in v2, squashing is fine. :)
>
>>
>> > Signed-off-by: Axel Rasmussen 
>> > ---
>> >  mm/mmap_lock.c | 100 +
>> >  1 file changed, 85 insertions(+), 15 deletions(-)
>> >
>> > diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
>> > index 12af8f1b8a14..be38dc58278b 100644
>> > --- a/mm/mmap_lock.c
>> > +++ b/mm/mmap_lock.c
>> > @@ -3,6 +3,7 @@
>> >  #include 
>> >
>> >  #include 
>> > +#include 
>> >  #include 
>> >  #include 
>> >  #include 
>> > @@ -18,13 +19,28 @@ EXPORT_TRACEPOINT_SYMBOL(mmap_lock_released);
>> >  #ifdef CONFIG_MEMCG
>> >
>> >  /*
>> > - * Our various events all share the same buffer (because we don't want or 
>> > need
>> > - * to allocate a set of buffers *per event type*), so we need to protect 
>> > against
>> > - * concurrent _reg() and _unreg() calls, and count how many _reg() calls 
>> > have
>> > - * been made.
>> > + * This is unfortunately complicated... _reg() and _unreg() may be called
>> > + * in parallel, separately for each of our three event types. To save 
>> > memory,
>> > + * all of the event types share the same buffers. Furthermore, trace 
>> > events
>> > + * might happen in parallel with _unreg(); we need to ensure we don't 
>> > free the
>> > + * buffers before all inflights have finished. Because these events happen
>> > + * "frequently", we also want to prevent new inflights from starting once 
>> > the
>> > + * _unreg() process begins. And, for performance reasons, we want to 
>> > avoid any
>> > + * locking in the trace event path.
>> > + *
>> > + * So:
>> > + *
>> > + * - Use a spinlock to serialize _reg() and _unreg() calls.
>> > + * - Keep track of nested _reg() calls with a lock-protected counter.
>> > + * - Define a flag indicating whether or not unregistration

Re: [PATCH] selftests: more general make nesting support

2020-08-05 Thread Greg Thelen

Shuah Khan  wrote:

> On 8/5/20 1:36 PM, Greg Thelen wrote:
>> On Tue, Jul 28, 2020 at 12:32 AM Greg Thelen  wrote:
>>>
>>> selftests can be built from the toplevel kernel makefile (e.g. make
>>> kselftest-all) or directly (make -C tools/testing/selftests all).
>>>
>>> The toplevel kernel makefile explicitly disables implicit rules with
>>> "MAKEFLAGS += -rR", which is passed to tools/testing/selftests.  Some
>>> selftest makefiles require implicit make rules, which is why
>>> commit 67d8712dcc70 ("selftests: Fix build failures when invoked from
>>> kselftest target") reenables implicit rules by clearing MAKEFLAGS if
>>> MAKELEVEL=1.
>>>
>>> So far so good.  However, if the toplevel makefile is called from an
>>> outer makefile then MAKELEVEL will be elevated, which breaks the
>>> MAKELEVEL equality test.
>>> Example wrapped makefile error:
>>>$ cat ~/Makefile
>>>all:
>>>  $(MAKE) defconfig
>>>  $(MAKE) kselftest-all
>>>$ make -sf ~/Makefile
>>>  futex_wait_timeout.c /src/tools/testing/selftests/kselftest_harness.h  
>>>  /src/tools/testing/selftests/kselftest.h ../include/futextest.h 
>>> ../include/atomic.h ../include/logging.h -lpthread -lrt -o 
>>> /src/tools/testing/selftests/futex/functional/futex_wait_timeout
>>>make[4]: futex_wait_timeout.c: Command not found
>>>
>>> Rather than checking $(MAKELEVEL), check for $(LINK.c), which is a more
>>> direct side effect of "make -R".  This enables arbitrary makefile
>>> nesting.
>>>
>>> Signed-off-by: Greg Thelen 
>>> ---
>>>   tools/testing/selftests/Makefile | 8 
>>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/tools/testing/selftests/Makefile 
>>> b/tools/testing/selftests/Makefile
>>> index 1195bd85af38..289a2e4b3f6f 100644
>>> --- a/tools/testing/selftests/Makefile
>>> +++ b/tools/testing/selftests/Makefile
>>> @@ -84,10 +84,10 @@ endif
>>>   # of the targets gets built.
>>>   FORCE_TARGETS ?=
>>>
>>> -# Clear LDFLAGS and MAKEFLAGS if called from main
>>> -# Makefile to avoid test build failures when test
>>> -# Makefile doesn't have explicit build rules.
>>> -ifeq (1,$(MAKELEVEL))
>>> +# Clear LDFLAGS and MAKEFLAGS when implicit rules are missing.  This 
>>> provides
>>> +# implicit rules to sub-test Makefiles which avoids build failures in test
>>> +# Makefile that don't have explicit build rules.
>>> +ifeq (,$(LINK.c))
>>>   override LDFLAGS =
>>>   override MAKEFLAGS =
>>>   endif
>>> --
>>> 2.28.0.rc0.142.g3c755180ce-goog
>> 
>> Is there any feedback on this patch?  It's not high priority but something 
>> that
>> will help me make more use of selftests.
>> 
>
> No issues with the patch. I will apply it once the merge window ends.
>
> thanks,
> -- Shuah

Great.  Thank you.

Re: [PATCH] selftests: more general make nesting support

2020-08-05 Thread Greg Thelen

On Tue, Jul 28, 2020 at 12:32 AM Greg Thelen  wrote:
>
> selftests can be built from the toplevel kernel makefile (e.g. make
> kselftest-all) or directly (make -C tools/testing/selftests all).
>
> The toplevel kernel makefile explicitly disables implicit rules with
> "MAKEFLAGS += -rR", which is passed to tools/testing/selftests.  Some
> selftest makefiles require implicit make rules, which is why
> commit 67d8712dcc70 ("selftests: Fix build failures when invoked from
> kselftest target") reenables implicit rules by clearing MAKEFLAGS if
> MAKELEVEL=1.
>
> So far so good.  However, if the toplevel makefile is called from an
> outer makefile then MAKELEVEL will be elevated, which breaks the
> MAKELEVEL equality test.
> Example wrapped makefile error:
>   $ cat ~/Makefile
>   all:
> $(MAKE) defconfig
> $(MAKE) kselftest-all
>   $ make -sf ~/Makefile
> futex_wait_timeout.c /src/tools/testing/selftests/kselftest_harness.h   
> /src/tools/testing/selftests/kselftest.h ../include/futextest.h 
> ../include/atomic.h ../include/logging.h -lpthread -lrt -o 
> /src/tools/testing/selftests/futex/functional/futex_wait_timeout
>   make[4]: futex_wait_timeout.c: Command not found
>
> Rather than checking $(MAKELEVEL), check for $(LINK.c), which is a more
> direct side effect of "make -R".  This enables arbitrary makefile
> nesting.
>
> Signed-off-by: Greg Thelen 
> ---
>  tools/testing/selftests/Makefile | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/tools/testing/selftests/Makefile 
> b/tools/testing/selftests/Makefile
> index 1195bd85af38..289a2e4b3f6f 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -84,10 +84,10 @@ endif
>  # of the targets gets built.
>  FORCE_TARGETS ?=
>
> -# Clear LDFLAGS and MAKEFLAGS if called from main
> -# Makefile to avoid test build failures when test
> -# Makefile doesn't have explicit build rules.
> -ifeq (1,$(MAKELEVEL))
> +# Clear LDFLAGS and MAKEFLAGS when implicit rules are missing.  This provides
> +# implicit rules to sub-test Makefiles which avoids build failures in test
> +# Makefile that don't have explicit build rules.
> +ifeq (,$(LINK.c))
>  override LDFLAGS =
>  override MAKEFLAGS =
>  endif
> --
> 2.28.0.rc0.142.g3c755180ce-goog

Is there any feedback on this patch?  It's not high priority but something that
will help me make more use of selftests.

[PATCH] selftests: more general make nesting support

2020-07-28 Thread Greg Thelen

selftests can be built from the toplevel kernel makefile (e.g. make
kselftest-all) or directly (make -C tools/testing/selftests all).

The toplevel kernel makefile explicitly disables implicit rules with
"MAKEFLAGS += -rR", which is passed to tools/testing/selftests.  Some
selftest makefiles require implicit make rules, which is why
commit 67d8712dcc70 ("selftests: Fix build failures when invoked from
kselftest target") reenables implicit rules by clearing MAKEFLAGS if
MAKELEVEL=1.

So far so good.  However, if the toplevel makefile is called from an
outer makefile then MAKELEVEL will be elevated, which breaks the
MAKELEVEL equality test.
Example wrapped makefile error:
  $ cat ~/Makefile
  all:
$(MAKE) defconfig
$(MAKE) kselftest-all
  $ make -sf ~/Makefile
futex_wait_timeout.c /src/tools/testing/selftests/kselftest_harness.h   
/src/tools/testing/selftests/kselftest.h ../include/futextest.h 
../include/atomic.h ../include/logging.h -lpthread -lrt -o 
/src/tools/testing/selftests/futex/functional/futex_wait_timeout
  make[4]: futex_wait_timeout.c: Command not found

Rather than checking $(MAKELEVEL), check for $(LINK.c), which is a more
direct side effect of "make -R".  This enables arbitrary makefile
nesting.

Signed-off-by: Greg Thelen 
---
 tools/testing/selftests/Makefile | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 1195bd85af38..289a2e4b3f6f 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -84,10 +84,10 @@ endif
 # of the targets gets built.
 FORCE_TARGETS ?=
 
-# Clear LDFLAGS and MAKEFLAGS if called from main
-# Makefile to avoid test build failures when test
-# Makefile doesn't have explicit build rules.
-ifeq (1,$(MAKELEVEL))
+# Clear LDFLAGS and MAKEFLAGS when implicit rules are missing.  This provides
+# implicit rules to sub-test Makefiles which avoids build failures in test
+# Makefile that don't have explicit build rules.
+ifeq (,$(LINK.c))
 override LDFLAGS =
 override MAKEFLAGS =
 endif
-- 
2.28.0.rc0.142.g3c755180ce-goog

Re: [PATCH v18 06/14] mm/damon: Implement callbacks for the virtual memory address spaces

2020-07-27 Thread Greg Thelen

SeongJae Park  wrote:

> From: SeongJae Park 
>
> This commit introduces a reference implementation of the address space
> specific low level primitives for the virtual address space, so that
> users of DAMON can easily monitor the data accesses on virtual address
> spaces of specific processes by simply configuring the implementation to
> be used by DAMON.
>
> The low level primitives for the fundamental access monitoring are
> defined in two parts:
> 1. Identification of the monitoring target address range for the address
> space.
> 2. Access check of specific address range in the target space.
>
> The reference implementation for the virtual address space provided by
> this commit is designed as below.
>
> PTE Accessed-bit Based Access Check
> ---
>
> The implementation uses PTE Accessed-bit for basic access checks.  That
> is, it clears the bit for next sampling target page and checks whether
> it set again after one sampling period.  To avoid disturbing other
> Accessed bit users such as the reclamation logic, the implementation
> adjusts the ``PG_Idle`` and ``PG_Young`` appropriately, as same to the
> 'Idle Page Tracking'.
>
> VMA-based Target Address Range Construction
> ---
>
> Only small parts in the super-huge virtual address space of the
> processes are mapped to physical memory and accessed.  Thus, tracking
> the unmapped address regions is just wasteful.  However, because DAMON
> can deal with some level of noise using the adaptive regions adjustment
> mechanism, tracking every mapping is not strictly required but could
> even incur a high overhead in some cases.  That said, too huge unmapped
> areas inside the monitoring target should be removed to not take the
> time for the adaptive mechanism.
>
> For the reason, this implementation converts the complex mappings to
> three distinct regions that cover every mapped area of the address
> space.  Also, the two gaps between the three regions are the two biggest
> unmapped areas in the given address space.  The two biggest unmapped
> areas would be the gap between the heap and the uppermost mmap()-ed
> region, and the gap between the lowermost mmap()-ed region and the stack
> in most of the cases.  Because these gaps are exceptionally huge in
> usual address spacees, excluding these will be sufficient to make a
> reasonable trade-off.  Below shows this in detail::
>
> 
> 
> 
> (small mmap()-ed regions and munmap()-ed regions)
> 
> 
> 
>
> Signed-off-by: SeongJae Park 
> Reviewed-by: Leonard Foerster 
> ---
>  include/linux/damon.h |   6 +
>  mm/damon.c| 474 ++
>  2 files changed, 480 insertions(+)
>
> diff --git a/include/linux/damon.h b/include/linux/damon.h
> index 3c0b92a679e8..310d36d123b3 100644
> --- a/include/linux/damon.h
> +++ b/include/linux/damon.h
> @@ -144,6 +144,12 @@ struct damon_ctx {
>   void (*aggregate_cb)(struct damon_ctx *context);
>  };
>  
> +/* Reference callback implementations for virtual memory */
> +void kdamond_init_vm_regions(struct damon_ctx *ctx);
> +void kdamond_update_vm_regions(struct damon_ctx *ctx);
> +void kdamond_prepare_vm_access_checks(struct damon_ctx *ctx);
> +unsigned int kdamond_check_vm_accesses(struct damon_ctx *ctx);
> +
>  int damon_set_pids(struct damon_ctx *ctx, int *pids, ssize_t nr_pids);
>  int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
>   unsigned long aggr_int, unsigned long regions_update_int,
> diff --git a/mm/damon.c b/mm/damon.c
> index b844924b9fdb..386780739007 100644
> --- a/mm/damon.c
> +++ b/mm/damon.c
> @@ -9,6 +9,9 @@
>   * This file is constructed in below parts.
>   *
>   * - Functions and macros for DAMON data structures
> + * - Functions for the initial monitoring target regions construction
> + * - Functions for the dynamic monitoring target regions update
> + * - Functions for the access checking of the regions
>   * - Functions for DAMON core logics and features
>   * - Functions for the DAMON programming interface
>   * - Functions for the module loading/unloading
> @@ -196,6 +199,477 @@ static unsigned long damon_region_sz_limit(struct 
> damon_ctx *ctx)
>   return sz;
>  }
>  
> +/*
> + * Get the mm_struct of the given task
> + *
> + * Caller _must_ put the mm_struct after use, unless it is NULL.
> + *
> + * Returns the mm_struct of the task on success, NULL on failure
> + */
> +static struct mm_struct *damon_get_mm(struct damon_task *t)
> +{
> + struct task_struct *task;
> + struct mm_struct *mm;
> +
> + task = damon_get_task_struct(t);
> + if (!task)
> + return NULL;
> +
> + mm = get_task_mm(task);
> + put_task_struct(task);
> + return mm;
> +}
> +
> +/*
> + * Functions for the initial monitoring target regions construction
> + */
> +
> +/*
> + * Size-evenly split a region into 'nr_pieces' small regions
> + *
> + * Returns 0 on success, o

Re: [PATCH v18 11/14] Documentation: Add documents for DAMON

2020-07-27 Thread Greg Thelen

SeongJae Park  wrote:

> From: SeongJae Park 
>
> This commit adds documents for DAMON under
> `Documentation/admin-guide/mm/damon/` and `Documentation/vm/damon/`.
>
> Signed-off-by: SeongJae Park 
> ---
>  Documentation/admin-guide/mm/damon/guide.rst | 157 ++
>  Documentation/admin-guide/mm/damon/index.rst |  15 +
>  Documentation/admin-guide/mm/damon/plans.rst |  29 ++
>  Documentation/admin-guide/mm/damon/start.rst |  98 ++
>  Documentation/admin-guide/mm/damon/usage.rst | 298 +++
>  Documentation/admin-guide/mm/index.rst   |   1 +
>  Documentation/vm/damon/api.rst   |  20 ++
>  Documentation/vm/damon/eval.rst  | 222 ++
>  Documentation/vm/damon/faq.rst   |  59 
>  Documentation/vm/damon/index.rst |  32 ++
>  Documentation/vm/damon/mechanisms.rst| 165 ++
>  Documentation/vm/index.rst   |   1 +
>  12 files changed, 1097 insertions(+)
>  create mode 100644 Documentation/admin-guide/mm/damon/guide.rst
>  create mode 100644 Documentation/admin-guide/mm/damon/index.rst
>  create mode 100644 Documentation/admin-guide/mm/damon/plans.rst
>  create mode 100644 Documentation/admin-guide/mm/damon/start.rst
>  create mode 100644 Documentation/admin-guide/mm/damon/usage.rst
>  create mode 100644 Documentation/vm/damon/api.rst
>  create mode 100644 Documentation/vm/damon/eval.rst
>  create mode 100644 Documentation/vm/damon/faq.rst
>  create mode 100644 Documentation/vm/damon/index.rst
>  create mode 100644 Documentation/vm/damon/mechanisms.rst
>
> diff --git a/Documentation/admin-guide/mm/damon/guide.rst 
> b/Documentation/admin-guide/mm/damon/guide.rst
> new file mode 100644
> index ..c51fb843efaa
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/damon/guide.rst
> @@ -0,0 +1,157 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==
> +Optimization Guide
> +==
> +
> +This document helps you estimating the amount of benefit that you could get
> +from DAMON-based optimizations, and describes how you could achieve it.  You
> +are assumed to already read :doc:`start`.
> +
> +
> +Check The Signs
> +===
> +
> +No optimization can provide same extent of benefit to every case.  Therefore
> +you should first guess how much improvements you could get using DAMON.  If
> +some of below conditions match your situation, you could consider using 
> DAMON.
> +
> +- *Low IPC and High Cache Miss Ratios.*  Low IPC means most of the CPU time 
> is
> +  spent waiting for the completion of time-consuming operations such as 
> memory
> +  access, while high cache miss ratios mean the caches don't help it well.
> +  DAMON is not for cache level optimization, but DRAM level.  However,
> +  improving DRAM management will also help this case by reducing the memory
> +  operation latency.
> +- *Memory Over-commitment and Unknown Users.*  If you are doing memory
> +  overcommitment and you cannot control every user of your system, a memory
> +  bank run could happen at any time.  You can estimate when it will happen
> +  based on DAMON's monitoring results and act earlier to avoid or deal better
> +  with the crisis.
> +- *Frequent Memory Pressure.*  Frequent memory pressure means your system has
> +  wrong configurations or memory hogs.  DAMON will help you find the right
> +  configuration and/or the criminals.
> +- *Heterogeneous Memory System.*  If your system is utilizing memory devices
> +  that placed between DRAM and traditional hard disks, such as non-volatile
> +  memory or fast SSDs, DAMON could help you utilizing the devices more
> +  efficiently.
> +
> +
> +Profile
> +===
> +
> +If you found some positive signals, you could start by profiling your 
> workloads
> +using DAMON.  Find major workloads on your systems and analyze their data
> +access pattern to find something wrong or can be improved.  The DAMON user
> +space tool (``damo``) will be useful for this.
> +
> +We recommend you to start from working set size distribution check using 
> ``damo
> +report wss``.  If the distribution is ununiform or quite different from what
> +you estimated, you could consider `Memory Configuration`_ optimization.
> +
> +Then, review the overall access pattern in heatmap form using ``damo report
> +heats``.  If it shows a simple pattern consists of a small number of memory
> +regions having high contrast of access temperature, you could consider manual
> +`Program Modification`_.
> +
> +If you still want to absorb more benefits, you should develop `Personalized
> +DAMON Application`_ for your special case.
> +
> +You don't need to take only one approach among the above plans, but you could
> +use multiple of the above approaches to maximize the benefit.
> +
> +
> +Optimize
> +
> +
> +If the profiling result also says it's worth trying some optimization, you
> +could consider below approaches.  Note that some of the below approaches 
> assume
> +that y

Re: [PATCH v2] powerpc/powernv/pci: use ifdef to avoid dead code

2020-07-18 Thread Greg Thelen

Oliver O'Halloran  wrote:

> On Mon, Jun 15, 2020 at 9:33 AM Greg Thelen  wrote:
>>
>> Commit dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE
>> configuration") removed a couple pnv_ioda_setup_bus_dma() calls.  The
>> only remaining calls are behind CONFIG_IOMMU_API.  Thus builds without
>> CONFIG_IOMMU_API see:
>>   arch/powerpc/platforms/powernv/pci-ioda.c:1888:13: error: 
>> 'pnv_ioda_setup_bus_dma' defined but not used
>>
>> Move pnv_ioda_setup_bus_dma() under CONFIG_IOMMU_API to avoid dead code.
>
> Doh! Thanks for the fix.
>
> Reviewed-by: Oliver O'Halloran 

Is there anything else needed from me on this patch?
Given that it fixes a 5.8 commit I figured it'd be 5.8 material.

Re: [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed

2020-07-01 Thread Greg Thelen

Dave Hansen  wrote:

> From: Keith Busch 
>
> Migrating pages had been allocating the new page before it was actually
> needed. Subsequent operations may still fail, which would have to handle
> cleaning up the newly allocated page when it was never used.
>
> Defer allocating the page until we are actually ready to make use of
> it, after locking the original page. This simplifies error handling,
> but should not have any functional change in behavior. This is just
> refactoring page migration so the main part can more easily be reused
> by other code.

Is there any concern that the src page is now held PG_locked over the
dst page allocation, which might wander into
reclaim/cond_resched/oom_kill?  I don't have a deadlock in mind.  I'm
just wondering about the additional latency imposed on unrelated threads
who want access src page.

> #Signed-off-by: Keith Busch 

Is commented Signed-off-by intentional?  Same applies to later patches.

> Signed-off-by: Dave Hansen 
> Cc: Keith Busch 
> Cc: Yang Shi 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> ---
>
>  b/mm/migrate.c |  148 
> -
>  1 file changed, 75 insertions(+), 73 deletions(-)
>
> diff -puN mm/migrate.c~0007-mm-migrate-Defer-allocating-new-page-until-needed 
> mm/migrate.c
> --- a/mm/migrate.c~0007-mm-migrate-Defer-allocating-new-page-until-needed 
> 2020-06-29 16:34:37.896312607 -0700
> +++ b/mm/migrate.c2020-06-29 16:34:37.900312607 -0700
> @@ -1014,56 +1014,17 @@ out:
>   return rc;
>  }
>  
> -static int __unmap_and_move(struct page *page, struct page *newpage,
> - int force, enum migrate_mode mode)
> +static int __unmap_and_move(new_page_t get_new_page,
> + free_page_t put_new_page,
> + unsigned long private, struct page *page,
> + enum migrate_mode mode,
> + enum migrate_reason reason)
>  {
>   int rc = -EAGAIN;
>   int page_was_mapped = 0;
>   struct anon_vma *anon_vma = NULL;
>   bool is_lru = !__PageMovable(page);
> -
> - if (!trylock_page(page)) {
> - if (!force || mode == MIGRATE_ASYNC)
> - goto out;
> -
> - /*
> -  * It's not safe for direct compaction to call lock_page.
> -  * For example, during page readahead pages are added locked
> -  * to the LRU. Later, when the IO completes the pages are
> -  * marked uptodate and unlocked. However, the queueing
> -  * could be merging multiple pages for one bio (e.g.
> -  * mpage_readpages). If an allocation happens for the
> -  * second or third page, the process can end up locking
> -  * the same page twice and deadlocking. Rather than
> -  * trying to be clever about what pages can be locked,
> -  * avoid the use of lock_page for direct compaction
> -  * altogether.
> -  */
> - if (current->flags & PF_MEMALLOC)
> - goto out;
> -
> - lock_page(page);
> - }
> -
> - if (PageWriteback(page)) {
> - /*
> -  * Only in the case of a full synchronous migration is it
> -  * necessary to wait for PageWriteback. In the async case,
> -  * the retry loop is too short and in the sync-light case,
> -  * the overhead of stalling is too much
> -  */
> - switch (mode) {
> - case MIGRATE_SYNC:
> - case MIGRATE_SYNC_NO_COPY:
> - break;
> - default:
> - rc = -EBUSY;
> - goto out_unlock;
> - }
> - if (!force)
> - goto out_unlock;
> - wait_on_page_writeback(page);
> - }
> + struct page *newpage;
>  
>   /*
>* By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
> @@ -1082,6 +1043,12 @@ static int __unmap_and_move(struct page
>   if (PageAnon(page) && !PageKsm(page))
>   anon_vma = page_get_anon_vma(page);
>  
> + newpage = get_new_page(page, private);
> + if (!newpage) {
> + rc = -ENOMEM;
> + goto out;
> + }
> +
>   /*
>* Block others from accessing the new page when we get around to
>* establishing additional references. We are usually the only one
> @@ -1091,11 +1058,11 @@ static int __unmap_and_move(struct page
>* This is much like races on refcount of oldpage: just don't BUG().
>*/
>   if (unlikely(!trylock_page(newpage)))
> - goto out_unlock;
> + goto out_put;
>  
>   if (unlikely(!is_lru)) {
>   rc = move_to_new_page(newpage, page, mode);
> - goto out_unlock_both;
> + goto out_unlock;
>   }
>  
>   /*
> @@ -1114,7 +1081,7 @@

[PATCH v2] powerpc/powernv/pci: use ifdef to avoid dead code

2020-06-14 Thread Greg Thelen

Commit dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE
configuration") removed a couple pnv_ioda_setup_bus_dma() calls.  The
only remaining calls are behind CONFIG_IOMMU_API.  Thus builds without
CONFIG_IOMMU_API see:
  arch/powerpc/platforms/powernv/pci-ioda.c:1888:13: error: 
'pnv_ioda_setup_bus_dma' defined but not used

Move pnv_ioda_setup_bus_dma() under CONFIG_IOMMU_API to avoid dead code.

Fixes: dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE configuration")
Signed-off-by: Greg Thelen 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 73a63efcf855..743d840712da 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1885,19 +1885,6 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
return false;
 }
 
-static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
-{
-   struct pci_dev *dev;
-
-   list_for_each_entry(dev, &bus->devices, bus_list) {
-   set_iommu_table_base(&dev->dev, pe->table_group.tables[0]);
-   dev->dev.archdata.dma_offset = pe->tce_bypass_base;
-
-   if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
-   pnv_ioda_setup_bus_dma(pe, dev->subordinate);
-   }
-}
-
 static inline __be64 __iomem *pnv_ioda_get_inval_reg(struct pnv_phb *phb,
 bool real_mode)
 {
@@ -2501,6 +2488,19 @@ static long pnv_pci_ioda2_unset_window(struct 
iommu_table_group *table_group,
 #endif
 
 #ifdef CONFIG_IOMMU_API
+static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
+{
+   struct pci_dev *dev;
+
+   list_for_each_entry(dev, &bus->devices, bus_list) {
+   set_iommu_table_base(&dev->dev, pe->table_group.tables[0]);
+   dev->dev.archdata.dma_offset = pe->tce_bypass_base;
+
+   if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
+   pnv_ioda_setup_bus_dma(pe, dev->subordinate);
+   }
+}
+
 unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
__u64 window_size, __u32 levels)
 {
-- 
2.27.0.290.gba653c62da-goog

[PATCH] e1000e: add ifdef to avoid dead code

2020-06-13 Thread Greg Thelen

Commit e086ba2fccda ("e1000e: disable s0ix entry and exit flows for ME
systems") added e1000e_check_me() but it's only called from
CONFIG_PM_SLEEP protected code.  Thus builds without CONFIG_PM_SLEEP
see:
  drivers/net/ethernet/intel/e1000e/netdev.c:137:13: warning: 'e1000e_check_me' 
defined but not used [-Wunused-function]

Add CONFIG_PM_SLEEP ifdef guard to avoid dead code.

Fixes: e086ba2fccda ("e1000e: disable s0ix entry and exit flows for ME systems")
Signed-off-by: Greg Thelen 
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index a279f4fa9962..165f0aea22c9 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -107,6 +107,7 @@ static const struct e1000_reg_info e1000_reg_info_tbl[] = {
{0, NULL}
 };
 
+#ifdef CONFIG_PM_SLEEP
 struct e1000e_me_supported {
u16 device_id;  /* supported device ID */
 };
@@ -145,6 +146,7 @@ static bool e1000e_check_me(u16 device_id)
 
return false;
 }
+#endif /* CONFIG_PM_SLEEP */
 
 /**
  * __ew32_prepare - prepare to write to MAC CSR register on certain parts
-- 
2.27.0.290.gba653c62da-goog

[PATCH] powerpc/powernv/pci: add ifdef to avoid dead code

2020-06-13 Thread Greg Thelen

Commit dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE
configuration") removed a couple pnv_ioda_setup_bus_dma() calls.  The
only remaining calls are behind CONFIG_IOMMU_API.  Thus builds without
CONFIG_IOMMU_API see:
  arch/powerpc/platforms/powernv/pci-ioda.c:1888:13: error: 
'pnv_ioda_setup_bus_dma' defined but not used

Add CONFIG_IOMMU_API ifdef guard to avoid dead code.

Fixes: dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE configuration")
Signed-off-by: Greg Thelen 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 73a63efcf855..f7762052b7c4 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1885,6 +1885,7 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct 
pci_dev *pdev,
return false;
 }
 
+#ifdef CONFIG_IOMMU_API
 static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
 {
struct pci_dev *dev;
@@ -1897,6 +1898,7 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe 
*pe, struct pci_bus *bus)
pnv_ioda_setup_bus_dma(pe, dev->subordinate);
}
 }
+#endif
 
 static inline __be64 __iomem *pnv_ioda_get_inval_reg(struct pnv_phb *phb,
 bool real_mode)
-- 
2.27.0.290.gba653c62da-goog

Re: [PATCH] shmem, memcg: enable memcg aware shrinker

2020-06-04 Thread Greg Thelen

Yang Shi  wrote:

> On Sun, May 31, 2020 at 8:22 PM Greg Thelen  wrote:
>>
>> Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged
>> shrinkers during memcg shrink_slab()") a memcg aware shrinker is only
>> called when the per-memcg per-node shrinker_map indicates that the
>> shrinker may have objects to release to the memcg and node.
>>
>> shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs
>> shrinker which advertises per memcg and numa awareness.  The shmem
>> shrinker releases memory by splitting hugepages that extend beyond
>> i_size.
>>
>> Shmem does not currently set bits in shrinker_map.  So, starting with
>> b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under
>> pressure.  This leads to undeserved memcg OOM kills.
>> Example that reliably sees memcg OOM kill in unpatched kernel:
>>   FS=/tmp/fs
>>   CONTAINER=/cgroup/memory/tmpfs_shrinker
>>   mkdir -p $FS
>>   mount -t tmpfs -o huge=always nodev $FS
>>   # Create 1000 MB container, which shouldn't suffer OOM.
>>   mkdir $CONTAINER
>>   echo 1000M > $CONTAINER/memory.limit_in_bytes
>>   echo $BASHPID >> $CONTAINER/cgroup.procs
>>   # Create 4000 files.  Ideally each file uses 4k data page + a little
>>   # metadata.  Assume 8k total per-file, 32MB (4000*8k) should easily
>>   # fit within container's 1000 MB.  But if data pages use 2MB
>>   # hugepages (due to aggressive huge=always) then files consume 8GB,
>>   # which hits memcg 1000 MB limit.
>>   for i in {1..4000}; do
>> echo . > $FS/$i
>>   done
>
> It looks all the inodes which have tail THP beyond i_size are on one
> single list, then the shrinker actually just splits the first
> nr_to_scan inodes. But since the list is not memcg aware, so it seems
> it may split the THPs which are not charged to the victim memcg and
> the victim memcg still may suffer from pre-mature oom, right?

Correct.  shmem_unused_huge_shrink() is not memcg aware.  In response to
memcg pressure it will split the post-i_size tails of nr_to_scan tmpfs
inodes regardless of if they're charged to the under-pressure memcg.
do_shrink_slab() looks like it'll repeatedly call
shmem_unused_huge_shrink().  So it will split tails of many inodes.  So
I think it'll avoid the oom by over shrinking.  This is not ideal.  But
it seems better than undeserved oom kill.

I think the solution (as Kirill Tkhai suggested) a memcg-aware index
would solve both:
1) avoid premature oom by registering shrinker to responding to memcg
   pressure
2) avoid shrinking/splitting inodes unrelated to the under-pressure
   memcg

I can certainly look into that (thanks Kirill for the pointers).  In the
short term I'm still interested in avoiding premature OOMs with the
original thread (i.e. restore pre-4.19 behavior to shmem shrinker for
memcg pressure).  I plan to test and repost v2.

Re: [PATCH] shmem, memcg: enable memcg aware shrinker

2020-06-01 Thread Greg Thelen

Kirill Tkhai  wrote:

> Hi, Greg,
>
> good finding. See comments below.
>
> On 01.06.2020 06:22, Greg Thelen wrote:
>> Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged
>> shrinkers during memcg shrink_slab()") a memcg aware shrinker is only
>> called when the per-memcg per-node shrinker_map indicates that the
>> shrinker may have objects to release to the memcg and node.
>> 
>> shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs
>> shrinker which advertises per memcg and numa awareness.  The shmem
>> shrinker releases memory by splitting hugepages that extend beyond
>> i_size.
>> 
>> Shmem does not currently set bits in shrinker_map.  So, starting with
>> b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under
>> pressure.  This leads to undeserved memcg OOM kills.
>> Example that reliably sees memcg OOM kill in unpatched kernel:
>>   FS=/tmp/fs
>>   CONTAINER=/cgroup/memory/tmpfs_shrinker
>>   mkdir -p $FS
>>   mount -t tmpfs -o huge=always nodev $FS
>>   # Create 1000 MB container, which shouldn't suffer OOM.
>>   mkdir $CONTAINER
>>   echo 1000M > $CONTAINER/memory.limit_in_bytes
>>   echo $BASHPID >> $CONTAINER/cgroup.procs
>>   # Create 4000 files.  Ideally each file uses 4k data page + a little
>>   # metadata.  Assume 8k total per-file, 32MB (4000*8k) should easily
>>   # fit within container's 1000 MB.  But if data pages use 2MB
>>   # hugepages (due to aggressive huge=always) then files consume 8GB,
>>   # which hits memcg 1000 MB limit.
>>   for i in {1..4000}; do
>> echo . > $FS/$i
>>   done
>> 
>> v5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg
>> aware") maintains the per-node per-memcg shrinker bitmap for THP
>> shrinker.  But there's no such logic in shmem.  Make shmem set the
>> per-memcg per-node shrinker bits when it modifies inodes to have
>> shrinkable pages.
>> 
>> Fixes: b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers 
>> during memcg shrink_slab()")
>> Cc:  # 4.19+
>> Signed-off-by: Greg Thelen 
>> ---
>>  mm/shmem.c | 61 +++---
>>  1 file changed, 35 insertions(+), 26 deletions(-)
>> 
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index bd8840082c94..e11090f78cb5 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1002,6 +1002,33 @@ static int shmem_getattr(const struct path *path, 
>> struct kstat *stat,
>>  return 0;
>>  }
>>  
>> +/*
>> + * Expose inode and optional page to shrinker as having a possibly 
>> splittable
>> + * hugepage that reaches beyond i_size.
>> + */
>> +static void shmem_shrinker_add(struct shmem_sb_info *sbinfo,
>> +   struct inode *inode, struct page *page)
>> +{
>> +struct shmem_inode_info *info = SHMEM_I(inode);
>> +
>> +spin_lock(&sbinfo->shrinklist_lock);
>> +/*
>> + * _careful to defend against unlocked access to ->shrink_list in
>> + * shmem_unused_huge_shrink()
>> + */
>> +if (list_empty_careful(&info->shrinklist)) {
>> +list_add_tail(&info->shrinklist, &sbinfo->shrinklist);
>> +sbinfo->shrinklist_len++;
>> +}
>> +spin_unlock(&sbinfo->shrinklist_lock);
>> +
>> +#ifdef CONFIG_MEMCG
>> +if (page && PageTransHuge(page))
>> +memcg_set_shrinker_bit(page->mem_cgroup, page_to_nid(page),
>> +   inode->i_sb->s_shrink.id);
>> +#endif
>> +}
>> +
>>  static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>>  {
>>  struct inode *inode = d_inode(dentry);
>> @@ -1048,17 +1075,13 @@ static int shmem_setattr(struct dentry *dentry, 
>> struct iattr *attr)
>>   * to shrink under memory pressure.
>>   */
>>  if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>> -spin_lock(&sbinfo->shrinklist_lock);
>> -/*
>> - * _careful to defend against unlocked access to
>> - * ->shrink_list in shmem_unused_huge_shrink()
>> - */
>> -if (list_empty_careful(&info->shrinklist)) {
>> -list_add_tail(&info->shrinkl

[PATCH] shmem, memcg: enable memcg aware shrinker

2020-05-31 Thread Greg Thelen

Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged
shrinkers during memcg shrink_slab()") a memcg aware shrinker is only
called when the per-memcg per-node shrinker_map indicates that the
shrinker may have objects to release to the memcg and node.

shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs
shrinker which advertises per memcg and numa awareness.  The shmem
shrinker releases memory by splitting hugepages that extend beyond
i_size.

Shmem does not currently set bits in shrinker_map.  So, starting with
b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under
pressure.  This leads to undeserved memcg OOM kills.
Example that reliably sees memcg OOM kill in unpatched kernel:
  FS=/tmp/fs
  CONTAINER=/cgroup/memory/tmpfs_shrinker
  mkdir -p $FS
  mount -t tmpfs -o huge=always nodev $FS
  # Create 1000 MB container, which shouldn't suffer OOM.
  mkdir $CONTAINER
  echo 1000M > $CONTAINER/memory.limit_in_bytes
  echo $BASHPID >> $CONTAINER/cgroup.procs
  # Create 4000 files.  Ideally each file uses 4k data page + a little
  # metadata.  Assume 8k total per-file, 32MB (4000*8k) should easily
  # fit within container's 1000 MB.  But if data pages use 2MB
  # hugepages (due to aggressive huge=always) then files consume 8GB,
  # which hits memcg 1000 MB limit.
  for i in {1..4000}; do
echo . > $FS/$i
  done

v5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg
aware") maintains the per-node per-memcg shrinker bitmap for THP
shrinker.  But there's no such logic in shmem.  Make shmem set the
per-memcg per-node shrinker bits when it modifies inodes to have
shrinkable pages.

Fixes: b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers during 
memcg shrink_slab()")
Cc:  # 4.19+
Signed-off-by: Greg Thelen 
---
 mm/shmem.c | 61 +++---
 1 file changed, 35 insertions(+), 26 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index bd8840082c94..e11090f78cb5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1002,6 +1002,33 @@ static int shmem_getattr(const struct path *path, struct 
kstat *stat,
return 0;
 }
 
+/*
+ * Expose inode and optional page to shrinker as having a possibly splittable
+ * hugepage that reaches beyond i_size.
+ */
+static void shmem_shrinker_add(struct shmem_sb_info *sbinfo,
+  struct inode *inode, struct page *page)
+{
+   struct shmem_inode_info *info = SHMEM_I(inode);
+
+   spin_lock(&sbinfo->shrinklist_lock);
+   /*
+* _careful to defend against unlocked access to ->shrink_list in
+* shmem_unused_huge_shrink()
+*/
+   if (list_empty_careful(&info->shrinklist)) {
+   list_add_tail(&info->shrinklist, &sbinfo->shrinklist);
+   sbinfo->shrinklist_len++;
+   }
+   spin_unlock(&sbinfo->shrinklist_lock);
+
+#ifdef CONFIG_MEMCG
+   if (page && PageTransHuge(page))
+   memcg_set_shrinker_bit(page->mem_cgroup, page_to_nid(page),
+  inode->i_sb->s_shrink.id);
+#endif
+}
+
 static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
 {
struct inode *inode = d_inode(dentry);
@@ -1048,17 +1075,13 @@ static int shmem_setattr(struct dentry *dentry, struct 
iattr *attr)
 * to shrink under memory pressure.
 */
if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
-   spin_lock(&sbinfo->shrinklist_lock);
-   /*
-* _careful to defend against unlocked access to
-* ->shrink_list in shmem_unused_huge_shrink()
-*/
-   if (list_empty_careful(&info->shrinklist)) {
-   list_add_tail(&info->shrinklist,
-   &sbinfo->shrinklist);
-   sbinfo->shrinklist_len++;
-   }
-   spin_unlock(&sbinfo->shrinklist_lock);
+   struct page *page;
+
+   page = find_get_page(inode->i_mapping,
+   (newsize & HPAGE_PMD_MASK) >> 
PAGE_SHIFT);
+   shmem_shrinker_add(sbinfo, inode, page);
+   if (page)
+   put_page(page);
}
}
}
@@ -1889,21 +1912,7 @@ static int shmem_getpage_gfp(struct inode *inode, 
pgoff_t index,
if (PageTransHuge(page) &&
DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <

[PATCH] kbuild: clean compressed initramfs image

2019-07-21 Thread Greg Thelen

Since commit 9e3596b0c653 ("kbuild: initramfs cleanup, set target from
Kconfig") "make clean" leaves behind compressed initramfs images.
Example:
  $ make defconfig
  $ sed -i 
's|CONFIG_INITRAMFS_SOURCE=""|CONFIG_INITRAMFS_SOURCE="/tmp/ir.cpio"|' .config
  $ make olddefconfig
  $ make -s
  $ make -s clean
  $ git clean -ndxf | grep initramfs
  Would remove usr/initramfs_data.cpio.gz

clean rules do not have CONFIG_* context so they do not know which
compression format was used.  Thus they don't know which files to
delete.

Tell clean to delete all possible compression formats.

Once patched usr/initramfs_data.cpio.gz and friends are deleted by
"make clean".

Fixes: 9e3596b0c653 ("kbuild: initramfs cleanup, set target from Kconfig")
Signed-off-by: Greg Thelen 
---
 usr/Makefile | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/usr/Makefile b/usr/Makefile
index 6a89eb019275..e6f7cb2f81db 100644
--- a/usr/Makefile
+++ b/usr/Makefile
@@ -11,6 +11,9 @@ datafile_y = initramfs_data.cpio$(suffix_y)
 datafile_d_y = .$(datafile_y).d
 AFLAGS_initramfs_data.o += -DINITRAMFS_IMAGE="usr/$(datafile_y)"
 
+# clean rules do not have CONFIG_INITRAMFS_COMPRESSION.  So clean up after all
+# possible compression formats.
+clean-files += initramfs_data.cpio*
 
 # Generate builtin.o based on initramfs_data.o
 obj-$(CONFIG_BLK_DEV_INITRD) := initramfs_data.o
-- 
2.22.0.657.g960e92d24f-goog

Re: [PATCH v4 0/7] mm: reparent slab memory on cgroup removal

2019-06-05 Thread Greg Thelen

Roman Gushchin  wrote:

> # Why do we need this?
>
> We've noticed that the number of dying cgroups is steadily growing on most
> of our hosts in production. The following investigation revealed an issue
> in userspace memory reclaim code [1], accounting of kernel stacks [2],
> and also the mainreason: slab objects.
>
> The underlying problem is quite simple: any page charged
> to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless
> all charged pages are gone. If a slab object is actively used by other 
> cgroups,
> it won't be reclaimed, and will prevent the origin cgroup from being 
> reclaimed.
>
> Slab objects, and first of all vfs cache, is shared between cgroups, which are
> using the same underlying fs, and what's even more important, it's shared
> between multiple generations of the same workload. So if something is running
> periodically every time in a new cgroup (like how systemd works), we do
> accumulate multiple dying cgroups.
>
> Strictly speaking pagecache isn't different here, but there is a key 
> difference:
> we disable protection and apply some extra pressure on LRUs of dying cgroups,
> and these LRUs contain all charged pages.
> My experiments show that with the disabled kernel memory accounting the number
> of dying cgroups stabilizes at a relatively small number (~100, depends on
> memory pressure and cgroup creation rate), and with kernel memory accounting
> it grows pretty steadily up to several thousands.
>
> Memory cgroups are quite complex and big objects (mostly due to percpu stats),
> so it leads to noticeable memory losses. Memory occupied by dying cgroups
> is measured in hundreds of megabytes. I've even seen a host with more than 
> 100Gb
> of memory wasted for dying cgroups. It leads to a degradation of performance
> with the uptime, and generally limits the usage of cgroups.
>
> My previous attempt [3] to fix the problem by applying extra pressure on slab
> shrinker lists caused a regressions with xfs and ext4, and has been reverted 
> [4].
> The following attempts to find the right balance [5, 6] were not successful.
>
> So instead of trying to find a maybe non-existing balance, let's do reparent
> the accounted slabs to the parent cgroup on cgroup removal.
>
>
> # Implementation approach
>
> There is however a significant problem with reparenting of slab memory:
> there is no list of charged pages. Some of them are in shrinker lists,
> but not all. Introducing of a new list is really not an option.
>
> But fortunately there is a way forward: every slab page has a stable pointer
> to the corresponding kmem_cache. So the idea is to reparent kmem_caches
> instead of slab pages.
>
> It's actually simpler and cheaper, but requires some underlying changes:
> 1) Make kmem_caches to hold a single reference to the memory cgroup,
>instead of a separate reference per every slab page.
> 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
>page->kmem_cache->memcg indirection instead. It's used only on
>slab page release, so it shouldn't be a big issue.
> 3) Introduce a refcounter for non-root slab caches. It's required to
>be able to destroy kmem_caches when they become empty and release
>the associated memory cgroup.
>
> There is a bonus: currently we do release empty kmem_caches on cgroup
> removal, however all other are waiting for the releasing of the memory cgroup.
> These refactorings allow kmem_caches to be released as soon as they
> become inactive and free.
>
> Some additional implementation details are provided in corresponding
> commit messages.
>
> # Results
>
> Below is the average number of dying cgroups on two groups of our production
> hosts. They do run some sort of web frontend workload, the memory pressure
> is moderate. As we can see, with the kernel memory reparenting the number
> stabilizes in 60s range; however with the original version it grows almost
> linearly and doesn't show any signs of plateauing. The difference in slab
> and percpu usage between patched and unpatched versions also grows linearly.
> In 7 days it exceeded 200Mb.
>
> day   01234567
> original 56  362  628  752 1070 1250 1490 1560
> patched  23   46   51   55   60   57   67   69
> mem diff(Mb) 22   74  123  152  164  182  214  241

No objection to the idea, but a question...

In patched kernel, does slabinfo (or similar) show the list reparented
slab caches?  A pile of zombie kmem_caches is certainly better than a
pile of zombie mem_cgroup.  But it still seems like it'll might cause
degradation - does cache_reap() walk an ever growing set of zombie
caches?

We've found it useful to add a slabinfo_full file which includes zombie
kmem_cache with their memcg_name.  This can help hunt down zombies.

> # History
>
> v4:
>   1) removed excessive memcg != parent check in memcg_deactivate_kmem_caches()
>   2) fixed rcu_read_lock() usage in memcg_charge_slab()
>   3) fixed synchronization aroun

Re: [PATCH] writeback: sum memcg dirty counters as needed

2019-03-29 Thread Greg Thelen

Johannes Weiner  wrote:

> On Thu, Mar 07, 2019 at 08:56:32AM -0800, Greg Thelen wrote:
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3880,6 +3880,7 @@ struct wb_domain *mem_cgroup_wb_domain(struct 
>> bdi_writeback *wb)
>>   * @pheadroom: out parameter for number of allocatable pages according to 
>> memcg
>>   * @pdirty: out parameter for number of dirty pages
>>   * @pwriteback: out parameter for number of pages under writeback
>> + * @exact: determines exact counters are required, indicates more work.
>>   *
>>   * Determine the numbers of file, headroom, dirty, and writeback pages in
>>   * @wb's memcg.  File, dirty and writeback are self-explanatory.  Headroom
>> @@ -3890,18 +3891,29 @@ struct wb_domain *mem_cgroup_wb_domain(struct 
>> bdi_writeback *wb)
>>   * ancestors.  Note that this doesn't consider the actual amount of
>>   * available memory in the system.  The caller should further cap
>>   * *@pheadroom accordingly.
>> + *
>> + * Return value is the error precision associated with *@pdirty
>> + * and *@pwriteback.  When @exact is set this a minimal value.
>>   */
>> -void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long 
>> *pfilepages,
>> - unsigned long *pheadroom, unsigned long *pdirty,
>> - unsigned long *pwriteback)
>> +unsigned long
>> +mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
>> +unsigned long *pheadroom, unsigned long *pdirty,
>> +unsigned long *pwriteback, bool exact)
>>  {
>>  struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
>>  struct mem_cgroup *parent;
>> +unsigned long precision;
>>  
>> -*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
>> -
>> +if (exact) {
>> +precision = 0;
>> +*pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY);
>> +*pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK);
>> +} else {
>> +precision = MEMCG_CHARGE_BATCH * num_online_cpus();
>> +*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
>> +*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
>> +}
>>  /* this should eventually include NR_UNSTABLE_NFS */
>> -*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
>>  *pfilepages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) |
>>   (1 << LRU_ACTIVE_FILE));
>>  *pheadroom = PAGE_COUNTER_MAX;
>> @@ -3913,6 +3925,8 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, 
>> unsigned long *pfilepages,
>>  *pheadroom = min(*pheadroom, ceiling - min(ceiling, used));
>>  memcg = parent;
>>  }
>> +
>> +return precision;
>
> Have you considered unconditionally using the exact version here?
>
> It does for_each_online_cpu(), but until very, very recently we did
> this per default for all stats, for years. It only became a problem in
> conjunction with the for_each_memcg loops when frequently reading
> memory stats at the top of a very large hierarchy.
>
> balance_dirty_pages() is called against memcgs that actually own the
> inodes/memory and doesn't do the additional recursive tree collection.
>
> It's also not *that* hot of a function, and in the io path...
>
> It would simplify this patch immensely.

Good idea.  Done in -v2 of the patch:
https://lore.kernel.org/lkml/20190329174609.164344-1-gthe...@google.com/

Re: [PATCH] writeback: sum memcg dirty counters as needed

2019-03-29 Thread Greg Thelen

Andrew Morton  wrote:

> On Thu,  7 Mar 2019 08:56:32 -0800 Greg Thelen  wrote:
>
>> Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
>> memory.stat reporting") memcg dirty and writeback counters are managed
>> as:
>> 1) per-memcg per-cpu values in range of [-32..32]
>> 2) per-memcg atomic counter
>> When a per-cpu counter cannot fit in [-32..32] it's flushed to the
>> atomic.  Stat readers only check the atomic.
>> Thus readers such as balance_dirty_pages() may see a nontrivial error
>> margin: 32 pages per cpu.
>> Assuming 100 cpus:
>>4k x86 page_size:  13 MiB error per memcg
>>   64k ppc page_size: 200 MiB error per memcg
>> Considering that dirty+writeback are used together for some decisions
>> the errors double.
>> 
>> This inaccuracy can lead to undeserved oom kills.  One nasty case is
>> when all per-cpu counters hold positive values offsetting an atomic
>> negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
>> balance_dirty_pages() only consults the atomic and does not consider
>> throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
>> 13..200 MiB range then there's absolutely no dirty throttling, which
>> burdens vmscan with only dirty+writeback pages thus resorting to oom
>> kill.
>> 
>> It could be argued that tiny containers are not supported, but it's more
>> subtle.  It's the amount the space available for file lru that matters.
>> If a container has memory.max-200MiB of non reclaimable memory, then it
>> will also suffer such oom kills on a 100 cpu machine.
>> 
>> ...
>> 
>> Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
>> collect exact per memcg counters when a memcg is close to the
>> throttling/writeback threshold.  This avoids the aforementioned oom
>> kills.
>> 
>> This does not affect the overhead of memory.stat, which still reads the
>> single atomic counter.
>> 
>> Why not use percpu_counter?  memcg already handles cpus going offline,
>> so no need for that overhead from percpu_counter.  And the
>> percpu_counter spinlocks are more heavyweight than is required.
>> 
>> It probably also makes sense to include exact dirty and writeback
>> counters in memcg oom reports.  But that is saved for later.
>
> Nice changelog, thanks.
>
>> Signed-off-by: Greg Thelen 
>
> Did you consider cc:stable for this?  We may as well - the stablebots
> backport everything which might look slightly like a fix anyway :(

Good idea.  Done in -v2 of the patch.

>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -573,6 +573,22 @@ static inline unsigned long memcg_page_state(struct 
>> mem_cgroup *memcg,
>>  return x;
>>  }
>>  
>> +/* idx can be of type enum memcg_stat_item or node_stat_item */
>> +static inline unsigned long
>> +memcg_exact_page_state(struct mem_cgroup *memcg, int idx)
>> +{
>> +long x = atomic_long_read(&memcg->stat[idx]);
>> +#ifdef CONFIG_SMP
>> +int cpu;
>> +
>> +for_each_online_cpu(cpu)
>> +x += per_cpu_ptr(memcg->stat_cpu, cpu)->count[idx];
>> +if (x < 0)
>> +x = 0;
>> +#endif
>> +return x;
>> +}
>
> This looks awfully heavyweight for an inline function.  Why not make it
> a regular function and avoid the bloat and i-cache consumption?

Done in -v2.

> Also, did you instead consider making this spill the percpu counters
> into memcg->stat[idx]?  That might be more useful for potential future
> callers.  It would become a little more expensive though.

I looked at that approach, but couldn't convince myself it was safe.  I
kept staring at "Remote [...] Write accesses can cause unique problems
due to the relaxed synchronization requirements for this_cpu
operations." from this_cpu_ops.txt.  So I'd like to delay this possible
optimization for a later patch.

[PATCH v2] writeback: use exact memcg dirty counts

2019-03-29 Thread Greg Thelen

Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
memory.stat reporting") memcg dirty and writeback counters are managed
as:
1) per-memcg per-cpu values in range of [-32..32]
2) per-memcg atomic counter
When a per-cpu counter cannot fit in [-32..32] it's flushed to the
atomic.  Stat readers only check the atomic.
Thus readers such as balance_dirty_pages() may see a nontrivial error
margin: 32 pages per cpu.
Assuming 100 cpus:
   4k x86 page_size:  13 MiB error per memcg
  64k ppc page_size: 200 MiB error per memcg
Considering that dirty+writeback are used together for some decisions
the errors double.

This inaccuracy can lead to undeserved oom kills.  One nasty case is
when all per-cpu counters hold positive values offsetting an atomic
negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
balance_dirty_pages() only consults the atomic and does not consider
throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
13..200 MiB range then there's absolutely no dirty throttling, which
burdens vmscan with only dirty+writeback pages thus resorting to oom
kill.

It could be argued that tiny containers are not supported, but it's more
subtle.  It's the amount the space available for file lru that matters.
If a container has memory.max-200MiB of non reclaimable memory, then it
will also suffer such oom kills on a 100 cpu machine.

The following test reliably ooms without this patch.  This patch avoids
oom kills.

  $ cat test
  mount -t cgroup2 none /dev/cgroup
  cd /dev/cgroup
  echo +io +memory > cgroup.subtree_control
  mkdir test
  cd test
  echo 10M > memory.max
  (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
  (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)

  $ cat memcg-writeback-stress.c
  /*
   * Dirty pages from all but one cpu.
   * Clean pages from the non dirtying cpu.
   * This is to stress per cpu counter imbalance.
   * On a 100 cpu machine:
   * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
   * - per memcg atomic is -99*32 pages
   * - thus the complete dirty limit: sum of all counters 0
   * - balance_dirty_pages() only sees atomic count -99*32 pages, which
   *   it max()s to 0.
   * - So a workload can dirty -99*32 pages before balance_dirty_pages()
   *   cares.
   */
  #define _GNU_SOURCE
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 

  static char *buf;
  static int bufSize;

  static void set_affinity(int cpu)
  {
cpu_set_t affinity;

CPU_ZERO(&affinity);
CPU_SET(cpu, &affinity);
if (sched_setaffinity(0, sizeof(affinity), &affinity))
err(1, "sched_setaffinity");
  }

  static void dirty_on(int output_fd, int cpu)
  {
int i, wrote;

set_affinity(cpu);
for (i = 0; i < 32; i++) {
for (wrote = 0; wrote < bufSize; ) {
int ret = write(output_fd, buf+wrote, bufSize-wrote);
if (ret == -1)
err(1, "write");
wrote += ret;
}
}
  }

  int main(int argc, char **argv)
  {
int cpu, flush_cpu = 1, output_fd;
const char *output;

if (argc != 2)
errx(1, "usage: output_file");

output = argv[1];
bufSize = getpagesize();
buf = malloc(getpagesize());
if (buf == NULL)
errx(1, "malloc failed");

output_fd = open(output, O_CREAT|O_RDWR);
if (output_fd == -1)
err(1, "open(%s)", output);

for (cpu = 0; cpu < get_nprocs(); cpu++) {
if (cpu != flush_cpu)
dirty_on(output_fd, cpu);
}

set_affinity(flush_cpu);
if (fsync(output_fd))
err(1, "fsync(%s)", output);
if (close(output_fd))
err(1, "close(%s)", output);
free(buf);
  }

Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
collect exact per memcg counters.  This avoids the aforementioned oom
kills.

This does not affect the overhead of memory.stat, which still reads the
single atomic counter.

Why not use percpu_counter?  memcg already handles cpus going offline,
so no need for that overhead from percpu_counter.  And the
percpu_counter spinlocks are more heavyweight than is required.

It probably also makes sense to use exact dirty and writeback counters
in memcg oom reports.  But that is saved for later.

Cc: sta...@vger.kernel.org # v4.16+
Signed-off-by: Greg Thelen 
---
Changelog since v1:
- Move memcg_exact_page_state() into memcontrol.c.
- Unconditionally gather exact (per cpu) counters in mem_cgroup_wb_stats(), it's
  not called in performance sensitive paths.
- Unconditionally che

Re: [PATCH] writeback: sum memcg dirty counters as needed

2019-03-27 Thread Greg Thelen

On Fri, Mar 22, 2019 at 11:15 AM Roman Gushchin  wrote:
>
> On Thu, Mar 07, 2019 at 08:56:32AM -0800, Greg Thelen wrote:
> > Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
> > memory.stat reporting") memcg dirty and writeback counters are managed
> > as:
> > 1) per-memcg per-cpu values in range of [-32..32]
> > 2) per-memcg atomic counter
> > When a per-cpu counter cannot fit in [-32..32] it's flushed to the
> > atomic.  Stat readers only check the atomic.
> > Thus readers such as balance_dirty_pages() may see a nontrivial error
> > margin: 32 pages per cpu.
> > Assuming 100 cpus:
> >4k x86 page_size:  13 MiB error per memcg
> >   64k ppc page_size: 200 MiB error per memcg
> > Considering that dirty+writeback are used together for some decisions
> > the errors double.
> >
> > This inaccuracy can lead to undeserved oom kills.  One nasty case is
> > when all per-cpu counters hold positive values offsetting an atomic
> > negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
> > balance_dirty_pages() only consults the atomic and does not consider
> > throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
> > 13..200 MiB range then there's absolutely no dirty throttling, which
> > burdens vmscan with only dirty+writeback pages thus resorting to oom
> > kill.
> >
> > It could be argued that tiny containers are not supported, but it's more
> > subtle.  It's the amount the space available for file lru that matters.
> > If a container has memory.max-200MiB of non reclaimable memory, then it
> > will also suffer such oom kills on a 100 cpu machine.
> >
> > The following test reliably ooms without this patch.  This patch avoids
> > oom kills.
> >
> > ...
> >
> > Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
> > collect exact per memcg counters when a memcg is close to the
> > throttling/writeback threshold.  This avoids the aforementioned oom
> > kills.
> >
> > This does not affect the overhead of memory.stat, which still reads the
> > single atomic counter.
> >
> > Why not use percpu_counter?  memcg already handles cpus going offline,
> > so no need for that overhead from percpu_counter.  And the
> > percpu_counter spinlocks are more heavyweight than is required.
> >
> > It probably also makes sense to include exact dirty and writeback
> > counters in memcg oom reports.  But that is saved for later.
> >
> > Signed-off-by: Greg Thelen 
> > ---
> >  include/linux/memcontrol.h | 33 +
> >  mm/memcontrol.c| 26 --
> >  mm/page-writeback.c| 27 +--
> >  3 files changed, 66 insertions(+), 20 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 83ae11cbd12c..6a133c90138c 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -573,6 +573,22 @@ static inline unsigned long memcg_page_state(struct 
> > mem_cgroup *memcg,
> >   return x;
> >  }
>
> Hi Greg!
>
> Thank you for the patch, definitely a good problem to be fixed!
>
> >
> > +/* idx can be of type enum memcg_stat_item or node_stat_item */
> > +static inline unsigned long
> > +memcg_exact_page_state(struct mem_cgroup *memcg, int idx)
> > +{
> > + long x = atomic_long_read(&memcg->stat[idx]);
> > +#ifdef CONFIG_SMP
>
> I doubt that this #ifdef is correct without corresponding changes
> in __mod_memcg_state(). As now, we do use per-cpu buffer which spills
> to an atomic value event if !CONFIG_SMP. It's probably something
> that we want to change, but as now, #ifdef CONFIG_SMP should protect
> only "if (x < 0)" part.

Ack.  I'll fix it.

> > + int cpu;
> > +
> > + for_each_online_cpu(cpu)
> > + x += per_cpu_ptr(memcg->stat_cpu, cpu)->count[idx];
> > + if (x < 0)
> > + x = 0;
> > +#endif
> > + return x;
> > +}
>
> Also, isn't it worth it to generalize memcg_page_state() instead?
> By adding an bool exact argument? I believe dirty balance is not
> the only place, where we need a better accuracy.

Nod.  I'll provide a more general version of memcg_page_state().  I'm
testing updated (forthcoming v2) patch set now with feedback from
Andrew and Roman.

[PATCH] writeback: sum memcg dirty counters as needed

2019-03-07 Thread Greg Thelen

Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
memory.stat reporting") memcg dirty and writeback counters are managed
as:
1) per-memcg per-cpu values in range of [-32..32]
2) per-memcg atomic counter
When a per-cpu counter cannot fit in [-32..32] it's flushed to the
atomic.  Stat readers only check the atomic.
Thus readers such as balance_dirty_pages() may see a nontrivial error
margin: 32 pages per cpu.
Assuming 100 cpus:
   4k x86 page_size:  13 MiB error per memcg
  64k ppc page_size: 200 MiB error per memcg
Considering that dirty+writeback are used together for some decisions
the errors double.

This inaccuracy can lead to undeserved oom kills.  One nasty case is
when all per-cpu counters hold positive values offsetting an atomic
negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
balance_dirty_pages() only consults the atomic and does not consider
throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
13..200 MiB range then there's absolutely no dirty throttling, which
burdens vmscan with only dirty+writeback pages thus resorting to oom
kill.

It could be argued that tiny containers are not supported, but it's more
subtle.  It's the amount the space available for file lru that matters.
If a container has memory.max-200MiB of non reclaimable memory, then it
will also suffer such oom kills on a 100 cpu machine.

The following test reliably ooms without this patch.  This patch avoids
oom kills.

  $ cat test
  mount -t cgroup2 none /dev/cgroup
  cd /dev/cgroup
  echo +io +memory > cgroup.subtree_control
  mkdir test
  cd test
  echo 10M > memory.max
  (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
  (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)

  $ cat memcg-writeback-stress.c
  /*
   * Dirty pages from all but one cpu.
   * Clean pages from the non dirtying cpu.
   * This is to stress per cpu counter imbalance.
   * On a 100 cpu machine:
   * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
   * - per memcg atomic is -99*32 pages
   * - thus the complete dirty limit: sum of all counters 0
   * - balance_dirty_pages() only sees atomic count -99*32 pages, which
   *   it max()s to 0.
   * - So a workload can dirty -99*32 pages before balance_dirty_pages()
   *   cares.
   */
  #define _GNU_SOURCE
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 

  static char *buf;
  static int bufSize;

  static void set_affinity(int cpu)
  {
cpu_set_t affinity;

CPU_ZERO(&affinity);
CPU_SET(cpu, &affinity);
if (sched_setaffinity(0, sizeof(affinity), &affinity))
err(1, "sched_setaffinity");
  }

  static void dirty_on(int output_fd, int cpu)
  {
int i, wrote;

set_affinity(cpu);
for (i = 0; i < 32; i++) {
for (wrote = 0; wrote < bufSize; ) {
int ret = write(output_fd, buf+wrote, bufSize-wrote);
if (ret == -1)
err(1, "write");
wrote += ret;
}
}
  }

  int main(int argc, char **argv)
  {
int cpu, flush_cpu = 1, output_fd;
const char *output;

if (argc != 2)
errx(1, "usage: output_file");

output = argv[1];
bufSize = getpagesize();
buf = malloc(getpagesize());
if (buf == NULL)
errx(1, "malloc failed");

output_fd = open(output, O_CREAT|O_RDWR);
if (output_fd == -1)
err(1, "open(%s)", output);

for (cpu = 0; cpu < get_nprocs(); cpu++) {
if (cpu != flush_cpu)
dirty_on(output_fd, cpu);
}

set_affinity(flush_cpu);
if (fsync(output_fd))
err(1, "fsync(%s)", output);
if (close(output_fd))
err(1, "close(%s)", output);
free(buf);
  }

Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
collect exact per memcg counters when a memcg is close to the
throttling/writeback threshold.  This avoids the aforementioned oom
kills.

This does not affect the overhead of memory.stat, which still reads the
single atomic counter.

Why not use percpu_counter?  memcg already handles cpus going offline,
so no need for that overhead from percpu_counter.  And the
percpu_counter spinlocks are more heavyweight than is required.

It probably also makes sense to include exact dirty and writeback
counters in memcg oom reports.  But that is saved for later.

Signed-off-by: Greg Thelen 
---
 include/linux/memcontrol.h | 33 +
 mm/memcontrol.c| 26 --
 mm/page-writeback.c| 27 +--
 3 fi

[PATCH] writeback: fix inode cgroup switching comment

2019-03-04 Thread Greg Thelen

Commit 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb
transaction and use it for stat updates") refers to
inode_switch_wb_work_fn() which never got merged.  Switch the comments
to inode_switch_wbs_work_fn().

Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and 
use it for stat updates")
Signed-off-by: Greg Thelen 
---
 include/linux/backing-dev.h | 2 +-
 include/linux/fs.h  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c28a47cbe355..f9b029180241 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -365,7 +365,7 @@ unlocked_inode_to_wb_begin(struct inode *inode, struct 
wb_lock_cookie *cookie)
rcu_read_lock();
 
/*
-* Paired with store_release in inode_switch_wb_work_fn() and
+* Paired with store_release in inode_switch_wbs_work_fn() and
 * ensures that we see the new wb if we see cleared I_WB_SWITCH.
 */
cookie->locked = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd423fec8d83..08f26046233e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2091,7 +2091,7 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, 
struct file *filp)
  * I_WB_SWITCH Cgroup bdi_writeback switching in progress.  Used to
  * synchronize competing switching instances and to tell
  * wb stat updates to grab the i_pages lock.  See
- * inode_switch_wb_work_fn() for details.
+ * inode_switch_wbs_work_fn() for details.
  *
  * I_OVL_INUSE Used by overlayfs to get exclusive ownership on upper
  * and work dirs among overlayfs mounts.
-- 
2.21.0.352.gf09ad66450-goog

Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

2019-01-04 Thread Greg Thelen

Yang Shi  wrote:

> On 1/3/19 11:23 AM, Michal Hocko wrote:
>> On Thu 03-01-19 11:10:00, Yang Shi wrote:
>>>
>>> On 1/3/19 10:53 AM, Michal Hocko wrote:
 On Thu 03-01-19 10:40:54, Yang Shi wrote:
> On 1/3/19 10:13 AM, Michal Hocko wrote:
>> [...]
>> Is there any reason for your scripts to be strictly sequential here? In
>> other words why cannot you offload those expensive operations to a
>> detached context in _userspace_?
> I would say it has not to be strictly sequential. The above script is just
> an example to illustrate the pattern. But, sometimes it may hit such 
> pattern
> due to the complicated cluster scheduling and container scheduling in the
> production environment, for example the creation process might be 
> scheduled
> to the same CPU which is doing force_empty. I have to say I don't know too
> much about the internals of the container scheduling.
 In that case I do not see a strong reason to implement the offloding
 into the kernel. It is an additional code and semantic to maintain.
>>> Yes, it does introduce some additional code and semantic, but IMHO, it is
>>> quite simple and very straight forward, isn't it? Just utilize the existing
>>> css offline worker. And, that a couple of lines of code do improve some
>>> throughput issues for some real usecases.
>> I do not really care it is few LOC. It is more important that it is
>> conflating force_empty into offlining logic. There was a good reason to
>> remove reparenting/emptying the memcg during the offline. Considering
>> that you can offload force_empty from userspace trivially then I do not
>> see any reason to implement it in the kernel.
>
> Er, I may not articulate in the earlier email, force_empty can not be 
> offloaded from userspace *trivially*. IOWs the container scheduler may 
> unexpectedly overcommit something due to the stall of synchronous force 
> empty, which can't be figured out by userspace before it actually 
> happens. The scheduler doesn't know how long force_empty would take. If 
> the force_empty could be offloaded by kernel, it would make scheduler's 
> life much easier. This is not something userspace could do.

If kernel workqueues are doing more work (i.e. force_empty processing),
then it seem like the time to offline could grow.  I'm not sure if
that's important.

I assume that if we make force_empty an async side effect of rmdir then
user space scheduler would not be unable to immediately assume the
rmdir'd container memory is available without subjecting a new container
to direct reclaim.  So it seems like user space would use a mechanism to
wait for reclaim: either the existing sync force_empty or polling
meminfo/etc waiting for free memory to appear.

 I think it is more important to discuss whether we want to introduce
 force_empty in cgroup v2.
>>> We would prefer have it in v2 as well.
>> Then bring this up in a separate email thread please.
>
> Sure. Will prepare the patches later.
>
> Thanks,
> Yang

Re: [PATCH] memcg, oom: move out_of_memory back to the charge path

2018-07-03 Thread Greg Thelen

Michal Hocko  wrote:

> On Tue 03-07-18 00:08:05, Greg Thelen wrote:
>> Michal Hocko  wrote:
>> 
>> > On Fri 29-06-18 11:59:04, Greg Thelen wrote:
>> >> Michal Hocko  wrote:
>> >> 
>> >> > On Thu 28-06-18 16:19:07, Greg Thelen wrote:
>> >> >> Michal Hocko  wrote:
>> >> > [...]
>> >> >> > +if (mem_cgroup_out_of_memory(memcg, mask, order))
>> >> >> > +return OOM_SUCCESS;
>> >> >> > +
>> >> >> > +WARN(1,"Memory cgroup charge failed because of no reclaimable 
>> >> >> > memory! "
>> >> >> > +"This looks like a misconfiguration or a kernel bug.");
>> >> >> 
>> >> >> I'm not sure here if the warning should here or so strongly worded.  It
>> >> >> seems like the current task could be oom reaped with MMF_OOM_SKIP and
>> >> >> thus mem_cgroup_out_of_memory() will return false.  So there's nothing
>> >> >> alarming in that case.
>> >> >
>> >> > If the task is reaped then its charges should be released as well and
>> >> > that means that we should get below the limit. Sure there is some room
>> >> > for races but this should be still unlikely. Maybe I am just
>> >> > underestimating though.
>> >> >
>> >> > What would you suggest instead?
>> >> 
>> >> I suggest checking MMF_OOM_SKIP or deleting the warning.
>> >
>> > So what do you do when you have MMF_OOM_SKIP task? Do not warn? Checking
>> > for all the tasks would be quite expensive and remembering that from the
>> > task selection not nice either. Why do you think it would help much?
>> 
>> I assume we could just check current's MMF_OOM_SKIP - no need to check
>> all tasks.
>
> I still do not follow. If you are after a single task memcg then we
> should be ok. try_charge has a runaway for oom victims
>   if (unlikely(tsk_is_oom_victim(current) ||
>fatal_signal_pending(current) ||
>current->flags & PF_EXITING))
>   goto force;
>
> regardless of MMF_OOM_SKIP. So if there is a single process in the
> memcg, we kill it and the oom reaper kicks in and sets MMF_OOM_SKIP then
> we should bail out there. Or do I miss your intention?

For a single task memcg it seems that racing process cgroup migration
could trigger the new warning (I have attempted to reproduce this):

Processes A,B in memcg M1,M2.  M1 is oom.

  Process A[M1]   Process B[M2]

  M1 is oom
  try_charge(M1)
  Move A M1=>M2
  mem_cgroup_oom()
  mem_cgroup_out_of_memory()
out_of_memory()
  select_bad_process()
sees nothing in M1
  return 0
return 0
  WARN()


Another variant might be possible, this time with global oom:

Processes A,B in memcg M1,M2.  M1 is oom.

  Process A[M1]   Process B[M2]

  try_charge()
  trigger global oom
  reaper sets A.MMF_OOM_SKIP
  mem_cgroup_oom()
  mem_cgroup_out_of_memory()
out_of_memory()
  select_bad_process()
sees nothing in M1
  return 0
return 0
  WARN()


These seem unlikely, so I'm fine with taking a wait-and-see approach.

Re: [PATCH] memcg, oom: move out_of_memory back to the charge path

2018-07-03 Thread Greg Thelen

Michal Hocko  wrote:

> On Fri 29-06-18 11:59:04, Greg Thelen wrote:
>> Michal Hocko  wrote:
>> 
>> > On Thu 28-06-18 16:19:07, Greg Thelen wrote:
>> >> Michal Hocko  wrote:
>> > [...]
>> >> > +   if (mem_cgroup_out_of_memory(memcg, mask, order))
>> >> > +   return OOM_SUCCESS;
>> >> > +
>> >> > +   WARN(1,"Memory cgroup charge failed because of no reclaimable 
>> >> > memory! "
>> >> > +   "This looks like a misconfiguration or a kernel bug.");
>> >> 
>> >> I'm not sure here if the warning should here or so strongly worded.  It
>> >> seems like the current task could be oom reaped with MMF_OOM_SKIP and
>> >> thus mem_cgroup_out_of_memory() will return false.  So there's nothing
>> >> alarming in that case.
>> >
>> > If the task is reaped then its charges should be released as well and
>> > that means that we should get below the limit. Sure there is some room
>> > for races but this should be still unlikely. Maybe I am just
>> > underestimating though.
>> >
>> > What would you suggest instead?
>> 
>> I suggest checking MMF_OOM_SKIP or deleting the warning.
>
> So what do you do when you have MMF_OOM_SKIP task? Do not warn? Checking
> for all the tasks would be quite expensive and remembering that from the
> task selection not nice either. Why do you think it would help much?

I assume we could just check current's MMF_OOM_SKIP - no need to check
all tasks.  My only (minor) objection is that the warning text suggests
misconfiguration or kernel bug, when there may be neither.

> I feel strongly that we have to warn when bypassing the charge limit
> during the corner case because it can lead to unexpected behavior and
> users should be aware of this fact. I am open to the wording or some
> optimizations. I would prefer the latter on top with a clear description
> how it helped in a particular case though. I would rather not over
> optimize now without any story to back it.

I'm fine with the warning.  I know enough to look at dmesg logs to take
an educates that the race occurred.  We can refine it later if/when the
reports start rolling in.  No change needed.

Re: [PATCH] memcg, oom: move out_of_memory back to the charge path

2018-06-29 Thread Greg Thelen

Michal Hocko  wrote:

> On Thu 28-06-18 16:19:07, Greg Thelen wrote:
>> Michal Hocko  wrote:
> [...]
>> > +  if (mem_cgroup_out_of_memory(memcg, mask, order))
>> > +  return OOM_SUCCESS;
>> > +
>> > +  WARN(1,"Memory cgroup charge failed because of no reclaimable memory! "
>> > +  "This looks like a misconfiguration or a kernel bug.");
>> 
>> I'm not sure here if the warning should here or so strongly worded.  It
>> seems like the current task could be oom reaped with MMF_OOM_SKIP and
>> thus mem_cgroup_out_of_memory() will return false.  So there's nothing
>> alarming in that case.
>
> If the task is reaped then its charges should be released as well and
> that means that we should get below the limit. Sure there is some room
> for races but this should be still unlikely. Maybe I am just
> underestimating though.
>
> What would you suggest instead?

I suggest checking MMF_OOM_SKIP or deleting the warning.  But I don't
feel strongly.

Re: [PATCH] memcg, oom: move out_of_memory back to the charge path

2018-06-28 Thread Greg Thelen

Michal Hocko  wrote:

> From: Michal Hocko 
>
> 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM")
> has changed the ENOMEM semantic of memcg charges. Rather than invoking
> the oom killer from the charging context it delays the oom killer to the
> page fault path (pagefault_out_of_memory). This in turn means that many
> users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg
> hits the hard limit and the memcg is is OOM. This is behavior is
> inconsistent with !memcg case where the oom killer is invoked from the
> allocation context and the allocator keeps retrying until it succeeds.
>
> The difference in the behavior is user visible. mmap(MAP_POPULATE) might
> result in not fully populated ranges while the mmap return code doesn't
> tell that to the userspace. Random syscalls might fail with ENOMEM etc.
>
> The primary motivation of the different memcg oom semantic was the
> deadlock avoidance. Things have changed since then, though. We have
> an async oom teardown by the oom reaper now and so we do not have to
> rely on the victim to tear down its memory anymore. Therefore we can
> return to the original semantic as long as the memcg oom killer is not
> handed over to the users space.
>
> There is still one thing to be careful about here though. If the oom
> killer is not able to make any forward progress - e.g. because there is
> no eligible task to kill - then we have to bail out of the charge path
> to prevent from same class of deadlocks. We have basically two options
> here. Either we fail the charge with ENOMEM or force the charge and
> allow overcharge. The first option has been considered more harmful than
> useful because rare inconsistencies in the ENOMEM behavior is hard to
> test for and error prone. Basically the same reason why the page
> allocator doesn't fail allocations under such conditions. The later
> might allow runaways but those should be really unlikely unless somebody
> misconfigures the system. E.g. allowing to migrate tasks away from the
> memcg to a different unlimited memcg with move_charge_at_immigrate
> disabled.
>
> Changes since rfc v1
> - s@memcg_may_oom@in_user_fault@ suggested by Johannes. It is much more
>   clear what is the purpose of the flag now
> - s@mem_cgroup_oom_enable@mem_cgroup_enter_user_fault@g
>   s@mem_cgroup_oom_disable@mem_cgroup_exit_user_fault@g as per Johannes
> - make oom_kill_disable an exceptional case because it should be rare
>   and the normal oom handling a core of the function - per Johannes
>
> Signed-off-by: Michal Hocko 

Acked-by: Greg Thelen 

Thanks!  One comment below.

> ---
>
> Hi,
> I've posted this as an RFC previously [1]. There was no fundamental
> disagreement so I've integrated all the suggested changes and tested it.
> mmap(MAP_POPULATE) hits the oom killer again rather than silently fails
> to populate the mapping on the hard limit excess. On the other hand
> g-u-p and other charge path keep the ENOMEM semantic when the memcg oom
> killer is disabled. All the forward progress guarantee relies on the oom
> reaper.
>
> Unless there are objections I think this is ready to go to mmotm and
> ready for the next merge window
>
> [1] http://lkml.kernel.org/r/20180620103736.13880-1-mho...@kernel.org
>  include/linux/memcontrol.h | 16 
>  include/linux/sched.h  |  2 +-
>  mm/memcontrol.c| 75 ++
>  mm/memory.c|  4 +-
>  4 files changed, 71 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6c6fb116e925..5a69bb4026f6 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -494,16 +494,16 @@ unsigned long mem_cgroup_get_max(struct mem_cgroup 
> *memcg);
>  void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>   struct task_struct *p);
>  
> -static inline void mem_cgroup_oom_enable(void)
> +static inline void mem_cgroup_enter_user_fault(void)
>  {
> - WARN_ON(current->memcg_may_oom);
> - current->memcg_may_oom = 1;
> + WARN_ON(current->in_user_fault);
> + current->in_user_fault = 1;
>  }
>  
> -static inline void mem_cgroup_oom_disable(void)
> +static inline void mem_cgroup_exit_user_fault(void)
>  {
> - WARN_ON(!current->memcg_may_oom);
> - current->memcg_may_oom = 0;
> + WARN_ON(!current->in_user_fault);
> + current->in_user_fault = 0;
>  }
>  
>  static inline bool task_in_memcg_oom(struct task_struct *p)
> @@ -924,11 +924,11 @@ static inline void mem_cgroup_handle_over_high(void)
>  {
>  }
>  
> -static inline void mem_cgr

[PATCH] writeback: update stale account_page_redirty() comment

2018-06-25 Thread Greg Thelen

commit 93f78d882865 ("writeback: move backing_dev_info->bdi_stat[] into
bdi_writeback") replaced BDI_DIRTIED with WB_DIRTIED in
account_page_redirty().  Update comment to track that change.
  BDI_DIRTIED => WB_DIRTIED
  BDI_WRITTEN => WB_WRITTEN

Signed-off-by: Greg Thelen 
---
 mm/page-writeback.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 337c6afb3345..6551d3b0dc30 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2490,8 +2490,8 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
 /*
  * Call this whenever redirtying a page, to de-account the dirty counters
- * (NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied), so that they match the written
- * counters (NR_WRITTEN, BDI_WRITTEN) in long term. The mismatches will lead to
+ * (NR_DIRTIED, WB_DIRTIED, tsk->nr_dirtied), so that they match the written
+ * counters (NR_WRITTEN, WB_WRITTEN) in long term. The mismatches will lead to
  * systematic errors in balanced_dirty_ratelimit and the dirty pages position
  * control.
  */
-- 
2.18.0.rc2.346.g013aa6912e-goog

Re: [PATCH v2] mm: condense scan_control

2018-06-20 Thread Greg Thelen

On Tue, May 29, 2018 at 11:12 PM Greg Thelen  wrote:
>
> Use smaller scan_control fields for order, priority, and reclaim_idx.
> Convert fields from int => s8.  All easily fit within a byte:
> * allocation order range: 0..MAX_ORDER(64?)
> * priority range: 0..12(DEF_PRIORITY)
> * reclaim_idx range:  0..6(__MAX_NR_ZONES)
>
> Since commit 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64
> stack overflows are not an issue.  But it's inefficient to use ints.
>
> Use s8 (signed byte) rather than u8 to allow for loops like:
> do {
> ...
> } while (--sc.priority >= 0);
>
> Add BUILD_BUG_ON to verify that s8 is capable of storing max values.
>
> This reduces sizeof(struct scan_control):
> * 96 => 80 bytes (x86_64)
> * 68 => 56 bytes (i386)
>
> scan_control structure field order is changed to utilize padding.
> After this patch there is 1 bit of scan_control padding.
>
> Signed-off-by: Greg Thelen 
> Suggested-by: Matthew Wilcox 

Is there any interest in this?  Less stack usage could mean less
dcache and dtlb pressure.  But I understand if the complexity is
distasteful.

> ---
>  mm/vmscan.c | 32 
>  1 file changed, 20 insertions(+), 12 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9b697323a88c..42731faea306 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -65,12 +65,6 @@ struct scan_control {
> /* How many pages shrink_list() should reclaim */
> unsigned long nr_to_reclaim;
>
> -   /* This context's GFP mask */
> -   gfp_t gfp_mask;
> -
> -   /* Allocation order */
> -   int order;
> -
> /*
>  * Nodemask of nodes allowed by the caller. If NULL, all nodes
>  * are scanned.
> @@ -83,12 +77,6 @@ struct scan_control {
>  */
> struct mem_cgroup *target_mem_cgroup;
>
> -   /* Scan (total_size >> priority) pages at once */
> -   int priority;
> -
> -   /* The highest zone to isolate pages for reclaim from */
> -   enum zone_type reclaim_idx;
> -
> /* Writepage batching in laptop mode; RECLAIM_WRITE */
> unsigned int may_writepage:1;
>
> @@ -111,6 +99,18 @@ struct scan_control {
> /* One of the zones is ready for compaction */
> unsigned int compaction_ready:1;
>
> +   /* Allocation order */
> +   s8 order;
> +
> +   /* Scan (total_size >> priority) pages at once */
> +   s8 priority;
> +
> +   /* The highest zone to isolate pages for reclaim from */
> +   s8 reclaim_idx;
> +
> +   /* This context's GFP mask */
> +   gfp_t gfp_mask;
> +
> /* Incremented by the number of inactive pages that were scanned */
> unsigned long nr_scanned;
>
> @@ -3047,6 +3047,14 @@ unsigned long try_to_free_pages(struct zonelist 
> *zonelist, int order,
> .may_swap = 1,
> };
>
> +   /*
> +* scan_control uses s8 fields for order, priority, and reclaim_idx.
> +* Confirm they are large enough for max values.
> +*/
> +   BUILD_BUG_ON(MAX_ORDER > S8_MAX);
> +   BUILD_BUG_ON(DEF_PRIORITY > S8_MAX);
> +   BUILD_BUG_ON(MAX_NR_ZONES > S8_MAX);
> +
> /*
>  * Do not enter reclaim if fatal signal was delivered while throttled.
>  * 1 is returned so that the page allocator does not OOM kill at this
> --
> 2.17.0.921.gf22659ad46-goog
>

[PATCH] trace: fix SKIP_STACK_VALIDATION=1 build

2018-06-08 Thread Greg Thelen

Non gcc-5 builds with CONFIG_STACK_VALIDATION=y and
SKIP_STACK_VALIDATION=1 fail.
Example output:
  /bin/sh: init/.tmp_main.o: Permission denied

commit 96f60dfa5819 ("trace: Use -mcount-record for dynamic ftrace"),
added a mismatched endif.  This causes cmd_objtool to get mistakenly
set.

Relocate endif to balance the newly added -record-mcount check.

Fixes: 96f60dfa5819 ("trace: Use -mcount-record for dynamic ftrace")
Signed-off-by: Greg Thelen 
---
 scripts/Makefile.build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 34d9e9ce97c2..e7889f486ca1 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -239,6 +239,7 @@ cmd_record_mcount = 
\
 "$(CC_FLAGS_FTRACE)" ]; then   \
$(sub_cmd_record_mcount)\
fi;
+endif # -record-mcount
 endif # CONFIG_FTRACE_MCOUNT_RECORD
 
 ifdef CONFIG_STACK_VALIDATION
@@ -263,7 +264,6 @@ ifneq ($(RETPOLINE_CFLAGS),)
   objtool_args += --retpoline
 endif
 endif
-endif
 
 
 ifdef CONFIG_MODVERSIONS
-- 
2.18.0.rc1.242.g61856ae69a-goog

Re: [PATCH] iw_cxgb4: add INFINIBAND_ADDR_TRANS dependency

2018-06-07 Thread Greg Thelen

On Mon, Jun 4, 2018 at 4:07 PM Jason Gunthorpe  wrote:
>
> On Thu, May 31, 2018 at 02:40:59PM -0400, Doug Ledford wrote:
> > On Wed, 2018-05-30 at 21:03 -0700, Greg Thelen wrote:
> > > On Wed, May 30, 2018 at 4:01 PM Jason Gunthorpe  wrote:
> > >
> > > > On Thu, May 31, 2018 at 12:40:54AM +0200, Arnd Bergmann wrote:
> > > > > > On 5/30/2018 5:25 PM, Jason Gunthorpe wrote:
> > > > > > > On Wed, May 30, 2018 at 05:10:35PM -0500, Steve Wise wrote:
> > > > > > > >
> > > > > > > > On 5/30/2018 5:04 PM, Jason Gunthorpe wrote:
> > > > > > > > > On Wed, May 30, 2018 at 11:58:18PM +0200, Arnd Bergmann wrote:
> > > > > > > > > > The newly added fill_res_ep_entry function fails to link if
> > > > > > > > > > CONFIG_INFINIBAND_ADDR_TRANS is not set:
> > > > > > > > > >
> > > > > > > > > > drivers/infiniband/hw/cxgb4/restrack.o: In function
> > >
> > > `fill_res_ep_entry':
> > > > > > > > > > restrack.c:(.text+0x3cc): undefined reference to 
> > > > > > > > > > `rdma_res_to_id'
> > > > > > > > > > restrack.c:(.text+0x3d0): undefined reference to 
> > > > > > > > > > `rdma_iw_cm_id'
> > > > > > > > > >
> > > > > > > > > > This adds a Kconfig dependency for the driver.
> > > > > > > > > >
> > > > > > > > > > Fixes: 116aeb887371 ("iw_cxgb4: provide detailed
> > >
> > > provider-specific CM_ID information")
> > > > > > > > > > Signed-off-by: Arnd Bergmann 
> > > > > > > > > >  drivers/infiniband/hw/cxgb4/Kconfig | 1 +
> > > > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > >
> > > > > > > > > Oh, I think we need to solve this with maybe a header fill 
> > > > > > > > > null
> > >
> > > stub
> > > > > > > > > instead..
> > > > > > > > >
> > > > > > > > > We don't want to disable drivers just because a user 
> > > > > > > > > interface is
> > > > > > > > > disabled.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Why does CONFIG_INFINIBAND_ADDR_TRANS disable building 
> > > > > > > > rdma_cm.ko?
> > >
> > > That
> > > > > > > > is not correct.
> > > > > > >
> > > > > > > That seems like a reasonable thing to do..
> > > > > >
> > > > > > rdma_ucm.ko is for usermode users, rdma_cm.ko is for kernel users, 
> > > > > > and
> > > > > > is required for iwarp drivers.  It seems rdma_cm.ko is not being
> > > > > > compiled if ADDR_TRANS is not set.
> > > > I think the intention was to completely disable rdma-cm, including all
> > > > support for rx'ing remote packets? Greg?
> > >
> > > Yes.  That's my goal when INFINIBAND_ADDR_TRANS is unset.
> > >
> > > > If this is required for iwarp then Arnd's patch is probably the right
> > > > way to go..
> > > > Jason
> > >
> > > Agreed.
> > > Acked-by: Greg Thelen 
> >
> > If that's the case, then there should be a NOTE: in the Kconfig that
> > disabling the connection manager completely disables iWARP hardware.
> >
> > I don't really think I like this approach though.  At a minimum if you
> > are going to make iWARP totally dependent on rdmacm, then there is zero
> > reason that iw_cm.o should be part of the obj-$(CONFIG_INFINIBAND)
> > Makefile recipe when ADDR_TRANS is disabled.
>
> This makes sense, Greg, can you send a followup patch with these
> items? I know this series has been tough..
>
> Thanks,
> Jason

Ack.  Sorry for delay.  I'll take a look at this in a day or two.  Let
me know if it's more urgent.

Re: [PATCH] iw_cxgb4: add INFINIBAND_ADDR_TRANS dependency

2018-05-30 Thread Greg Thelen

On Wed, May 30, 2018 at 4:01 PM Jason Gunthorpe  wrote:

> On Thu, May 31, 2018 at 12:40:54AM +0200, Arnd Bergmann wrote:
> > > On 5/30/2018 5:25 PM, Jason Gunthorpe wrote:
> > >> On Wed, May 30, 2018 at 05:10:35PM -0500, Steve Wise wrote:
> > >>>
> > >>> On 5/30/2018 5:04 PM, Jason Gunthorpe wrote:
> > >>>> On Wed, May 30, 2018 at 11:58:18PM +0200, Arnd Bergmann wrote:
> > >>>>> The newly added fill_res_ep_entry function fails to link if
> > >>>>> CONFIG_INFINIBAND_ADDR_TRANS is not set:
> > >>>>>
> > >>>>> drivers/infiniband/hw/cxgb4/restrack.o: In function
`fill_res_ep_entry':
> > >>>>> restrack.c:(.text+0x3cc): undefined reference to `rdma_res_to_id'
> > >>>>> restrack.c:(.text+0x3d0): undefined reference to `rdma_iw_cm_id'
> > >>>>>
> > >>>>> This adds a Kconfig dependency for the driver.
> > >>>>>
> > >>>>> Fixes: 116aeb887371 ("iw_cxgb4: provide detailed
provider-specific CM_ID information")
> > >>>>> Signed-off-by: Arnd Bergmann 
> > >>>>>  drivers/infiniband/hw/cxgb4/Kconfig | 1 +
> > >>>>>  1 file changed, 1 insertion(+)
> > >>>> Oh, I think we need to solve this with maybe a header fill null
stub
> > >>>> instead..
> > >>>>
> > >>>> We don't want to disable drivers just because a user interface is
> > >>>> disabled.
> > >>>>
> > >>> Why does CONFIG_INFINIBAND_ADDR_TRANS disable building rdma_cm.ko?
That
> > >>> is not correct.
> > >> That seems like a reasonable thing to do..
> > > rdma_ucm.ko is for usermode users, rdma_cm.ko is for kernel users, and
> > > is required for iwarp drivers.  It seems rdma_cm.ko is not being
> > > compiled if ADDR_TRANS is not set.

> I think the intention was to completely disable rdma-cm, including all
> support for rx'ing remote packets? Greg?

Yes.  That's my goal when INFINIBAND_ADDR_TRANS is unset.

> If this is required for iwarp then Arnd's patch is probably the right
> way to go..

> Jason

Agreed.
Acked-by: Greg Thelen

[PATCH v2] mm: condense scan_control

2018-05-29 Thread Greg Thelen

Use smaller scan_control fields for order, priority, and reclaim_idx.
Convert fields from int => s8.  All easily fit within a byte:
* allocation order range: 0..MAX_ORDER(64?)
* priority range: 0..12(DEF_PRIORITY)
* reclaim_idx range:  0..6(__MAX_NR_ZONES)

Since commit 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64
stack overflows are not an issue.  But it's inefficient to use ints.

Use s8 (signed byte) rather than u8 to allow for loops like:
do {
...
} while (--sc.priority >= 0);

Add BUILD_BUG_ON to verify that s8 is capable of storing max values.

This reduces sizeof(struct scan_control):
* 96 => 80 bytes (x86_64)
* 68 => 56 bytes (i386)

scan_control structure field order is changed to utilize padding.
After this patch there is 1 bit of scan_control padding.

Signed-off-by: Greg Thelen 
Suggested-by: Matthew Wilcox 
---
 mm/vmscan.c | 32 
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9b697323a88c..42731faea306 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -65,12 +65,6 @@ struct scan_control {
/* How many pages shrink_list() should reclaim */
unsigned long nr_to_reclaim;
 
-   /* This context's GFP mask */
-   gfp_t gfp_mask;
-
-   /* Allocation order */
-   int order;
-
/*
 * Nodemask of nodes allowed by the caller. If NULL, all nodes
 * are scanned.
@@ -83,12 +77,6 @@ struct scan_control {
 */
struct mem_cgroup *target_mem_cgroup;
 
-   /* Scan (total_size >> priority) pages at once */
-   int priority;
-
-   /* The highest zone to isolate pages for reclaim from */
-   enum zone_type reclaim_idx;
-
/* Writepage batching in laptop mode; RECLAIM_WRITE */
unsigned int may_writepage:1;
 
@@ -111,6 +99,18 @@ struct scan_control {
/* One of the zones is ready for compaction */
unsigned int compaction_ready:1;
 
+   /* Allocation order */
+   s8 order;
+
+   /* Scan (total_size >> priority) pages at once */
+   s8 priority;
+
+   /* The highest zone to isolate pages for reclaim from */
+   s8 reclaim_idx;
+
+   /* This context's GFP mask */
+   gfp_t gfp_mask;
+
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
 
@@ -3047,6 +3047,14 @@ unsigned long try_to_free_pages(struct zonelist 
*zonelist, int order,
.may_swap = 1,
};
 
+   /*
+* scan_control uses s8 fields for order, priority, and reclaim_idx.
+* Confirm they are large enough for max values.
+*/
+   BUILD_BUG_ON(MAX_ORDER > S8_MAX);
+   BUILD_BUG_ON(DEF_PRIORITY > S8_MAX);
+   BUILD_BUG_ON(MAX_NR_ZONES > S8_MAX);
+
/*
 * Do not enter reclaim if fatal signal was delivered while throttled.
 * 1 is returned so that the page allocator does not OOM kill at this
-- 
2.17.0.921.gf22659ad46-goog

Re: [PATCH] mm: convert scan_control.priority int => byte

2018-05-29 Thread Greg Thelen

Matthew Wilcox  wrote:

> On Mon, May 28, 2018 at 07:40:25PM -0700, Greg Thelen wrote:
>> Reclaim priorities range from 0..12(DEF_PRIORITY).
>> scan_control.priority is a 4 byte int, which is overkill.
>> 
>> Since commit 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64
>> stack overflows are not an issue.  But it's inefficient to use 4 bytes
>> for priority.
>
> If you're looking to shave a few more bytes, allocation order can fit
> in a u8 too (can't be more than 6 bits, and realistically won't be more
> than 4 bits).  reclaim_idx likewise will fit in a u8, and actually won't
> be more than 3 bits.

Nod.  Good tip.  Included in ("[PATCH v2] mm: condense scan_control").

> I am sceptical that nr_to_reclaim should really be an unsigned long; I
> don't think we should be trying to free 4 billion pages in a single call.
> nr_scanned might be over 4 billion (!) but nr_reclaimed can probably
> shrink to unsigned int along with nr_to_reclaim.

Agreed.  For patch simplicity, I'll pass on this for now.

[PATCH] mm: convert scan_control.priority int => byte

2018-05-28 Thread Greg Thelen

Reclaim priorities range from 0..12(DEF_PRIORITY).
scan_control.priority is a 4 byte int, which is overkill.

Since commit 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64
stack overflows are not an issue.  But it's inefficient to use 4 bytes
for priority.

Use s8 (signed byte) rather than u8 to allow for loops like:
do {
...
} while (--sc.priority >= 0);

This reduces sizeof(struct scan_control) from 96 => 88 bytes (x86_64),
which saves some stack.

scan_control.priority field order is changed to occupy otherwise unused
padding.

Signed-off-by: Greg Thelen 
---
 mm/vmscan.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9b697323a88c..541c334bd176 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -83,9 +83,6 @@ struct scan_control {
 */
struct mem_cgroup *target_mem_cgroup;
 
-   /* Scan (total_size >> priority) pages at once */
-   int priority;
-
/* The highest zone to isolate pages for reclaim from */
enum zone_type reclaim_idx;
 
@@ -111,6 +108,9 @@ struct scan_control {
/* One of the zones is ready for compaction */
unsigned int compaction_ready:1;
 
+   /* Scan (total_size >> priority) pages at once */
+   s8 priority;
+
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
 
-- 
2.17.0.921.gf22659ad46-goog

Re: [PATCH] IB: Revert "remove redundant INFINIBAND kconfig dependencies"

2018-05-28 Thread Greg Thelen

Jason Gunthorpe  wrote:

On Fri, May 25, 2018 at 05:32:52PM -0700, Greg Thelen wrote:

On Fri, May 25, 2018 at 2:32 PM Arnd Bergmann  wrote:

> Several subsystems depend on INFINIBAND_ADDR_TRANS, which in turn  
depends

> on INFINIBAND. However, when with CONFIG_INIFIBAND=m, this leads to a
> link error when another driver using it is built-in. The
> INFINIBAND_ADDR_TRANS dependency is insufficient here as this is
> a 'bool' symbol that does not force anything to be a module in turn.

> fs/cifs/smbdirect.o: In function `smbd_disconnect_rdma_work':
> smbdirect.c:(.text+0x1e4): undefined reference to `rdma_disconnect'
> net/9p/trans_rdma.o: In function `rdma_request':
> trans_rdma.c:(.text+0x7bc): undefined reference to `rdma_disconnect'
> net/9p/trans_rdma.o: In function `rdma_destroy_trans':
> trans_rdma.c:(.text+0x830): undefined reference to `ib_destroy_qp'
> trans_rdma.c:(.text+0x858): undefined reference to `ib_dealloc_pd'

> Fixes: 9533b292a7ac ("IB: remove redundant INFINIBAND kconfig
dependencies")
> Signed-off-by: Arnd Bergmann 

Acked-by: Greg Thelen 

Sorry for the 9533b292a7ac problem.
At this point the in release cycle, I think Arnd's revert is best.

If there is interest, I've put a little thought into an alternative fix:
making INFINIBAND_ADDR_TRANS tristate.  But it's nontrivial.
So I prefer this simple revert for now.

Is that a normal thing to do?

For me: no, it's not normal.  In my use case I merely want to disable
INFINIBAND_ADDR_TRANS while continuing to use INFINIBAND.  This is
supported with f7cb7b85be55 ("IB: make INFINIBAND_ADDR_TRANS
configurable").

During f7cb7b85be55 development https://lkml.org/lkml/2018/4/30/1073
suggested that we drop dependency on both INFINIBAND and
INFINIBAND_ADDR_TRANS.  Thus 9533b292a7ac ("IB: remove redundant
INFINIBAND kconfig dependencies").

But 9533b292a7ac led to the randconfig build errors reported and thus
("IB: Revert "remove redundant INFINIBAND kconfig dependencies"").

So I see no need to do anything more than apply ("IB: Revert "remove
redundant INFINIBAND kconfig dependencies"").

Doug: do you need anything from me on this?

I can take the revert..

Jason

Thanks.

Re: [PATCH] IB: Revert "remove redundant INFINIBAND kconfig dependencies"

2018-05-25 Thread Greg Thelen

On Fri, May 25, 2018 at 2:32 PM Arnd Bergmann  wrote:

> Several subsystems depend on INFINIBAND_ADDR_TRANS, which in turn depends
> on INFINIBAND. However, when with CONFIG_INIFIBAND=m, this leads to a
> link error when another driver using it is built-in. The
> INFINIBAND_ADDR_TRANS dependency is insufficient here as this is
> a 'bool' symbol that does not force anything to be a module in turn.

> fs/cifs/smbdirect.o: In function `smbd_disconnect_rdma_work':
> smbdirect.c:(.text+0x1e4): undefined reference to `rdma_disconnect'
> net/9p/trans_rdma.o: In function `rdma_request':
> trans_rdma.c:(.text+0x7bc): undefined reference to `rdma_disconnect'
> net/9p/trans_rdma.o: In function `rdma_destroy_trans':
> trans_rdma.c:(.text+0x830): undefined reference to `ib_destroy_qp'
> trans_rdma.c:(.text+0x858): undefined reference to `ib_dealloc_pd'

> Fixes: 9533b292a7ac ("IB: remove redundant INFINIBAND kconfig
dependencies")
> Signed-off-by: Arnd Bergmann 

Acked-by: Greg Thelen 

Sorry for the 9533b292a7ac problem.
At this point the in release cycle, I think Arnd's revert is best.

If there is interest, I've put a little thought into an alternative fix:
making INFINIBAND_ADDR_TRANS tristate.  But it's nontrivial.
So I prefer this simple revert for now.

Doug: do you need anything from me on this?

> ---
> The patch that introduced the problem has been queued in the
> rdma-fixes/for-rc tree. Please revert the patch before sending
> the branch to Linus.
> ---
> drivers/infiniband/ulp/srpt/Kconfig | 2 +-
> drivers/nvme/host/Kconfig   | 2 +-
> drivers/nvme/target/Kconfig | 2 +-
> drivers/staging/lustre/lnet/Kconfig | 2 +-
> fs/cifs/Kconfig | 2 +-
> net/9p/Kconfig  | 2 +-
> net/rds/Kconfig | 2 +-
> net/sunrpc/Kconfig  | 2 +-
> 8 files changed, 8 insertions(+), 8 deletions(-)

> diff --git a/drivers/infiniband/ulp/srpt/Kconfig
b/drivers/infiniband/ulp/srpt/Kconfig
> index 25bf6955b6d0..fb8b7182f05e 100644
> --- a/drivers/infiniband/ulp/srpt/Kconfig
> +++ b/drivers/infiniband/ulp/srpt/Kconfig
> @@ -1,6 +1,6 @@
> config INFINIBAND_SRPT
>tristate "InfiniBand SCSI RDMA Protocol target support"
> -   depends on INFINIBAND_ADDR_TRANS && TARGET_CORE
> +   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE
>---help---

>  Support for the SCSI RDMA Protocol (SRP) Target driver. The
> diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
> index dbb7464c018c..88a8b5916624 100644
> --- a/drivers/nvme/host/Kconfig
> +++ b/drivers/nvme/host/Kconfig
> @@ -27,7 +27,7 @@ config NVME_FABRICS

> config NVME_RDMA
>tristate "NVM Express over Fabrics RDMA host driver"
> -   depends on INFINIBAND_ADDR_TRANS && BLOCK
> +   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK
>select NVME_CORE
>select NVME_FABRICS
>select SG_POOL
> diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
> index 7595664ee753..3c7b61ddb0d1 100644
> --- a/drivers/nvme/target/Kconfig
> +++ b/drivers/nvme/target/Kconfig
> @@ -27,7 +27,7 @@ config NVME_TARGET_LOOP

> config NVME_TARGET_RDMA
>tristate "NVMe over Fabrics RDMA target support"
> -   depends on INFINIBAND_ADDR_TRANS
> +   depends on INFINIBAND && INFINIBAND_ADDR_TRANS
>depends on NVME_TARGET
>select SGL_ALLOC
>help
> diff --git a/drivers/staging/lustre/lnet/Kconfig
b/drivers/staging/lustre/lnet/Kconfig
> index f3b1ad4bd3dc..ad049e6f24e4 100644
> --- a/drivers/staging/lustre/lnet/Kconfig
> +++ b/drivers/staging/lustre/lnet/Kconfig
> @@ -34,7 +34,7 @@ config LNET_SELFTEST

> config LNET_XPRT_IB
>tristate "LNET infiniband support"
> -   depends on LNET && PCI && INFINIBAND_ADDR_TRANS
> +   depends on LNET && PCI && INFINIBAND && INFINIBAND_ADDR_TRANS
>default LNET && INFINIBAND
>help
>  This option allows the LNET users to use infiniband as an
> diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
> index d61e2de8d0eb..5f132d59dfc2 100644
> --- a/fs/cifs/Kconfig
> +++ b/fs/cifs/Kconfig
> @@ -197,7 +197,7 @@ config CIFS_SMB311

> config CIFS_SMB_DIRECT
>bool "SMB Direct support (Experimental)"
> -   depends on CIFS=m && INFINIBAND_ADDR_TRANS || CIFS=y &&
INFINIBAND_ADDR_TRANS=y
> +   depends on CIFS=m &&a

Re: [PATCH 3/5] ib_srpt: depend on INFINIBAND_ADDR_TRANS

2018-05-03 Thread Greg Thelen

On Tue, May 1, 2018 at 1:48 PM Jason Gunthorpe  wrote:

> On Tue, May 01, 2018 at 03:08:57AM +0000, Greg Thelen wrote:
> > On Mon, Apr 30, 2018 at 4:35 PM Jason Gunthorpe  wrote:
> >
> > > On Wed, Apr 25, 2018 at 03:33:39PM -0700, Greg Thelen wrote:
> > > > INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided
symbols.
> > > > So declare the kconfig dependency.  This is necessary to allow for
> > > > enabling INFINIBAND without INFINIBAND_ADDR_TRANS.
> > > >
> > > > Signed-off-by: Greg Thelen 
> > > > Cc: Tarick Bedeir 
> > > >  drivers/infiniband/ulp/srpt/Kconfig | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/infiniband/ulp/srpt/Kconfig
> > b/drivers/infiniband/ulp/srpt/Kconfig
> > > > index 31ee83d528d9..fb8b7182f05e 100644
> > > > +++ b/drivers/infiniband/ulp/srpt/Kconfig
> > > > @@ -1,6 +1,6 @@
> > > >  config INFINIBAND_SRPT
> > > >   tristate "InfiniBand SCSI RDMA Protocol target support"
> > > > - depends on INFINIBAND && TARGET_CORE
> > > > + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE
> >
> > > Isn't INFINIBAND && INFINIBAND_ADDR_TRANS a bit redundant? Can't have
> > > INFINIBAND_ADDR_TRANS without INFINIBAND.
> >
> > By kconfig INFINIBAND_ADDR_TRANS depends on INFINIBAND.  So yes, it
seems
> > redundant.  I don't know if anyone has designs to break this dependency
and
> > allow for ADDR_TRANS without INFINIBAND.  Assuming not, I'd be willing
to
> > amend my series removing redundant INFINIBAND and a followup series to
> > remove it from similar depends.  Though I'm not familiar with rdma dev
tree
> > lifecycle.  Is rdma/for-rc a throw away branch (akin to linux-next), or
> > will it be merged into linus/master?  If throwaway, then we can amend
> > its patches, otherwise followups will be needed.
> >
> > Let me know what you'd prefer.  Thanks.

> I think a single update patch to fix all of these, and the
> pre-existing ones would be great

I just posted the cleanup: https://lkml.org/lkml/2018/5/3/1045

[PATCH] IB: remove redundant INFINIBAND kconfig dependencies

2018-05-03 Thread Greg Thelen

INFINIBAND_ADDR_TRANS depends on INFINIBAND.  So there's no need for
options which depend INFINIBAND_ADDR_TRANS to also depend on INFINIBAND.
Remove the unnecessary INFINIBAND depends.

Signed-off-by: Greg Thelen 
---
 drivers/infiniband/ulp/srpt/Kconfig | 2 +-
 drivers/nvme/host/Kconfig   | 2 +-
 drivers/nvme/target/Kconfig | 2 +-
 drivers/staging/lustre/lnet/Kconfig | 2 +-
 fs/cifs/Kconfig | 2 +-
 net/9p/Kconfig  | 2 +-
 net/rds/Kconfig | 2 +-
 net/sunrpc/Kconfig  | 2 +-
 8 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/ulp/srpt/Kconfig 
b/drivers/infiniband/ulp/srpt/Kconfig
index fb8b7182f05e..25bf6955b6d0 100644
--- a/drivers/infiniband/ulp/srpt/Kconfig
+++ b/drivers/infiniband/ulp/srpt/Kconfig
@@ -1,6 +1,6 @@
 config INFINIBAND_SRPT
tristate "InfiniBand SCSI RDMA Protocol target support"
-   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE
+   depends on INFINIBAND_ADDR_TRANS && TARGET_CORE
---help---
 
  Support for the SCSI RDMA Protocol (SRP) Target driver. The
diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index 88a8b5916624..dbb7464c018c 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -27,7 +27,7 @@ config NVME_FABRICS
 
 config NVME_RDMA
tristate "NVM Express over Fabrics RDMA host driver"
-   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK
+   depends on INFINIBAND_ADDR_TRANS && BLOCK
select NVME_CORE
select NVME_FABRICS
select SG_POOL
diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
index 3c7b61ddb0d1..7595664ee753 100644
--- a/drivers/nvme/target/Kconfig
+++ b/drivers/nvme/target/Kconfig
@@ -27,7 +27,7 @@ config NVME_TARGET_LOOP
 
 config NVME_TARGET_RDMA
tristate "NVMe over Fabrics RDMA target support"
-   depends on INFINIBAND && INFINIBAND_ADDR_TRANS
+   depends on INFINIBAND_ADDR_TRANS
depends on NVME_TARGET
select SGL_ALLOC
help
diff --git a/drivers/staging/lustre/lnet/Kconfig 
b/drivers/staging/lustre/lnet/Kconfig
index ad049e6f24e4..f3b1ad4bd3dc 100644
--- a/drivers/staging/lustre/lnet/Kconfig
+++ b/drivers/staging/lustre/lnet/Kconfig
@@ -34,7 +34,7 @@ config LNET_SELFTEST
 
 config LNET_XPRT_IB
tristate "LNET infiniband support"
-   depends on LNET && PCI && INFINIBAND && INFINIBAND_ADDR_TRANS
+   depends on LNET && PCI && INFINIBAND_ADDR_TRANS
default LNET && INFINIBAND
help
  This option allows the LNET users to use infiniband as an
diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
index 5f132d59dfc2..d61e2de8d0eb 100644
--- a/fs/cifs/Kconfig
+++ b/fs/cifs/Kconfig
@@ -197,7 +197,7 @@ config CIFS_SMB311
 
 config CIFS_SMB_DIRECT
bool "SMB Direct support (Experimental)"
-   depends on CIFS=m && INFINIBAND && INFINIBAND_ADDR_TRANS || CIFS=y && 
INFINIBAND=y && INFINIBAND_ADDR_TRANS=y
+   depends on CIFS=m && INFINIBAND_ADDR_TRANS || CIFS=y && 
INFINIBAND_ADDR_TRANS=y
help
  Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1.
  SMB Direct allows transferring SMB packets over RDMA. If unsure,
diff --git a/net/9p/Kconfig b/net/9p/Kconfig
index e6014e0e51f7..46c39f7da444 100644
--- a/net/9p/Kconfig
+++ b/net/9p/Kconfig
@@ -32,7 +32,7 @@ config NET_9P_XEN
 
 
 config NET_9P_RDMA
-   depends on INET && INFINIBAND && INFINIBAND_ADDR_TRANS
+   depends on INET && INFINIBAND_ADDR_TRANS
tristate "9P RDMA Transport (Experimental)"
help
  This builds support for an RDMA transport.
diff --git a/net/rds/Kconfig b/net/rds/Kconfig
index bffde4b46c5d..1a31502ee7db 100644
--- a/net/rds/Kconfig
+++ b/net/rds/Kconfig
@@ -8,7 +8,7 @@ config RDS
 
 config RDS_RDMA
tristate "RDS over Infiniband"
-   depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS
+   depends on RDS && INFINIBAND_ADDR_TRANS
---help---
  Allow RDS to use Infiniband as a transport.
  This transport supports RDMA operations.
diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
index ac09ca803296..6358e5271070 100644
--- a/net/sunrpc/Kconfig
+++ b/net/sunrpc/Kconfig
@@ -50,7 +50,7 @@ config SUNRPC_DEBUG
 
 config SUNRPC_XPRT_RDMA
tristate "RPC-over-RDMA transport"
-   depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS
+   depends on SUNRPC && INFINIBAND_ADDR_TRANS
default SUNRPC && INFINIBAND
select SG_POOL
help
-- 
2.17.0.441.gb46fe60e1d-goog

[PATCH] memcg: mark memcg1_events static const

2018-05-03 Thread Greg Thelen

Mark memcg1_events static: it's only used by memcontrol.c.
And mark it const: it's not modified.

Signed-off-by: Greg Thelen 
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2bd3df3d101a..c9c7e5ea0e2f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3083,7 +3083,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void 
*v)
 #endif /* CONFIG_NUMA */
 
 /* Universal VM events cgroup1 shows, original sort order */
-unsigned int memcg1_events[] = {
+static const unsigned int memcg1_events[] = {
PGPGIN,
PGPGOUT,
PGFAULT,
-- 
2.17.0.441.gb46fe60e1d-goog

Re: [PATCH 3/5] ib_srpt: depend on INFINIBAND_ADDR_TRANS

2018-04-30 Thread Greg Thelen

On Mon, Apr 30, 2018 at 4:35 PM Jason Gunthorpe  wrote:

> On Wed, Apr 25, 2018 at 03:33:39PM -0700, Greg Thelen wrote:
> > INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols.
> > So declare the kconfig dependency.  This is necessary to allow for
> > enabling INFINIBAND without INFINIBAND_ADDR_TRANS.
> >
> > Signed-off-by: Greg Thelen 
> > Cc: Tarick Bedeir 
> >  drivers/infiniband/ulp/srpt/Kconfig | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/infiniband/ulp/srpt/Kconfig
b/drivers/infiniband/ulp/srpt/Kconfig
> > index 31ee83d528d9..fb8b7182f05e 100644
> > +++ b/drivers/infiniband/ulp/srpt/Kconfig
> > @@ -1,6 +1,6 @@
> >  config INFINIBAND_SRPT
> >   tristate "InfiniBand SCSI RDMA Protocol target support"
> > - depends on INFINIBAND && TARGET_CORE
> > + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE

> Isn't INFINIBAND && INFINIBAND_ADDR_TRANS a bit redundant? Can't have
> INFINIBAND_ADDR_TRANS without INFINIBAND.

By kconfig INFINIBAND_ADDR_TRANS depends on INFINIBAND.  So yes, it seems
redundant.  I don't know if anyone has designs to break this dependency and
allow for ADDR_TRANS without INFINIBAND.  Assuming not, I'd be willing to
amend my series removing redundant INFINIBAND and a followup series to
remove it from similar depends.  Though I'm not familiar with rdma dev tree
lifecycle.  Is rdma/for-rc a throw away branch (akin to linux-next), or
will it be merged into linus/master?  If throwaway, then we can amend
its patches, otherwise followups will be needed.

Let me know what you'd prefer.  Thanks.

FYI from v4.17-rc3:
drivers/staging/lustre/lnet/Kconfig:  depends on LNET && PCI && INFINIBAND
&& INFINIBAND_ADDR_TRANS
net/9p/Kconfig:   depends on INET && INFINIBAND && INFINIBAND_ADDR_TRANS
net/rds/Kconfig:  depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS
net/sunrpc/Kconfig:   depends on SUNRPC && INFINIBAND &&
INFINIBAND_ADDR_TRANS

[PATCH 6/6] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-26 Thread Greg Thelen

Allow INFINIBAND without INFINIBAND_ADDR_TRANS because fuzzing has been
finding fair number of CM bugs.  So provide option to disable it.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/infiniband/Kconfig | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index ee270e065ba9..2a972ed6851b 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING
  pages on demand instead.
 
 config INFINIBAND_ADDR_TRANS
-   bool
+   bool "RDMA/CM"
depends on INFINIBAND
default y
+   ---help---
+ Support for RDMA communication manager (CM).
+ This allows for a generic connection abstraction over RDMA.
 
 config INFINIBAND_ADDR_TRANS_CONFIGFS
bool
-- 
2.17.0.484.g0c8726318c-goog

[PATCH 5/6] ib_srp: depend on INFINIBAND_ADDR_TRANS

2018-04-26 Thread Greg Thelen

INFINIBAND_SRP code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/infiniband/ulp/srp/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/srp/Kconfig 
b/drivers/infiniband/ulp/srp/Kconfig
index c74ee9633041..99db8fe5173a 100644
--- a/drivers/infiniband/ulp/srp/Kconfig
+++ b/drivers/infiniband/ulp/srp/Kconfig
@@ -1,6 +1,6 @@
 config INFINIBAND_SRP
tristate "InfiniBand SCSI RDMA Protocol"
-   depends on SCSI
+   depends on SCSI && INFINIBAND_ADDR_TRANS
select SCSI_SRP_ATTRS
---help---
  Support for the SCSI RDMA Protocol over InfiniBand.  This
-- 
2.17.0.484.g0c8726318c-goog

[PATCH 4/6] cifs: smbd: depend on INFINIBAND_ADDR_TRANS

2018-04-26 Thread Greg Thelen

CIFS_SMB_DIRECT code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
Reviewed-by: Long Li 
---
 fs/cifs/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
index 741749a98614..5f132d59dfc2 100644
--- a/fs/cifs/Kconfig
+++ b/fs/cifs/Kconfig
@@ -197,7 +197,7 @@ config CIFS_SMB311
 
 config CIFS_SMB_DIRECT
bool "SMB Direct support (Experimental)"
-   depends on CIFS=m && INFINIBAND || CIFS=y && INFINIBAND=y
+   depends on CIFS=m && INFINIBAND && INFINIBAND_ADDR_TRANS || CIFS=y && 
INFINIBAND=y && INFINIBAND_ADDR_TRANS=y
help
  Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1.
  SMB Direct allows transferring SMB packets over RDMA. If unsure,
-- 
2.17.0.484.g0c8726318c-goog

[PATCH 1/6] nvme: depend on INFINIBAND_ADDR_TRANS

2018-04-26 Thread Greg Thelen

NVME_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols.  So
declare the kconfig dependency.  This is necessary to allow for enabling
INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/nvme/host/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index b979cf3bce65..88a8b5916624 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -27,7 +27,7 @@ config NVME_FABRICS
 
 config NVME_RDMA
tristate "NVM Express over Fabrics RDMA host driver"
-   depends on INFINIBAND && BLOCK
+   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK
select NVME_CORE
select NVME_FABRICS
select SG_POOL
-- 
2.17.0.484.g0c8726318c-goog

[PATCH 3/6] ib_srpt: depend on INFINIBAND_ADDR_TRANS

2018-04-26 Thread Greg Thelen

INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/infiniband/ulp/srpt/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/srpt/Kconfig 
b/drivers/infiniband/ulp/srpt/Kconfig
index 31ee83d528d9..fb8b7182f05e 100644
--- a/drivers/infiniband/ulp/srpt/Kconfig
+++ b/drivers/infiniband/ulp/srpt/Kconfig
@@ -1,6 +1,6 @@
 config INFINIBAND_SRPT
tristate "InfiniBand SCSI RDMA Protocol target support"
-   depends on INFINIBAND && TARGET_CORE
+   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE
---help---
 
  Support for the SCSI RDMA Protocol (SRP) Target driver. The
-- 
2.17.0.484.g0c8726318c-goog

[PATCH 2/6] nvmet-rdma: depend on INFINIBAND_ADDR_TRANS

2018-04-26 Thread Greg Thelen

NVME_TARGET_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/nvme/target/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
index 5f4f8b16685f..3c7b61ddb0d1 100644
--- a/drivers/nvme/target/Kconfig
+++ b/drivers/nvme/target/Kconfig
@@ -27,7 +27,7 @@ config NVME_TARGET_LOOP
 
 config NVME_TARGET_RDMA
tristate "NVMe over Fabrics RDMA target support"
-   depends on INFINIBAND
+   depends on INFINIBAND && INFINIBAND_ADDR_TRANS
depends on NVME_TARGET
select SGL_ALLOC
help
-- 
2.17.0.484.g0c8726318c-goog

[PATCH 0/6] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-26 Thread Greg Thelen

d on INFINIBAND_ADDR_TRANS"
drivers/nvme/host/rdma.c
  - depends on NVME_RDMA => INFINIBAND_ADDR_TRANS
per this series' "nvme: depend on INFINIBAND_ADDR_TRANS"
drivers/nvme/target/rdma.c
  - depends on NVME_TARGET_RDMA => INFINIBAND_ADDR_TRANS
per this series' "nvmet-rdma: depend on INFINIBAND_ADDR_TRANS"
drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
  - depends on LNET_XPRT_IB => INFINIBAND_ADDR_TRANS
fs/cifs/smbdirect.c
  - depends on CIFS_SMB_DIRECT => INFINIBAND_ADDR_TRANS
per this series' "cifs: smbd: depend on INFINIBAND_ADDR_TRANS"
net/9p/trans_rdma.c
  - depends on NET_9P_RDMA => INFINIBAND_ADDR_TRANS
net/rds/ib.c
net/rds/ib_cm.c
net/rds/rdma_transport.c
  - depends on RDS_RDMA => INFINIBAND_ADDR_TRANS
net/sunrpc/xprtrdma/svc_rdma_transport.c
net/sunrpc/xprtrdma/transport.c
net/sunrpc/xprtrdma/verbs.c
  - depends on SUNRPC_XPRT_RDMA => INFINIBAND_ADDR_TRANS

Greg Thelen (6):
  nvme: depend on INFINIBAND_ADDR_TRANS
  nvmet-rdma: depend on INFINIBAND_ADDR_TRANS
  ib_srpt: depend on INFINIBAND_ADDR_TRANS
  cifs: smbd: depend on INFINIBAND_ADDR_TRANS
  ib_srp: depend on INFINIBAND_ADDR_TRANS
  IB: make INFINIBAND_ADDR_TRANS configurable

 drivers/infiniband/Kconfig  | 5 -
 drivers/infiniband/ulp/srp/Kconfig  | 2 +-
 drivers/infiniband/ulp/srpt/Kconfig | 2 +-
 drivers/nvme/host/Kconfig   | 2 +-
 drivers/nvme/target/Kconfig | 2 +-
 fs/cifs/Kconfig | 2 +-
 6 files changed, 9 insertions(+), 6 deletions(-)

-- 
2.17.0.484.g0c8726318c-goog

Re: [PATCH 5/5] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-25 Thread Greg Thelen

On Wed, Apr 25, 2018 at 7:13 PM Bart Van Assche 
wrote:

> On Wed, 2018-04-25 at 15:34 -0700, Greg Thelen wrote:
> > Allow INFINIBAND without INFINIBAND_ADDR_TRANS because fuzzing has been
> > finding fair number of CM bugs.  So provide option to disable it.
> >
> > Signed-off-by: Greg Thelen 
> > Cc: Tarick Bedeir 
> > ---
> >  drivers/infiniband/Kconfig | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
> > index ee270e065ba9..2a972ed6851b 100644
> > --- a/drivers/infiniband/Kconfig
> > +++ b/drivers/infiniband/Kconfig
> > @@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING
> > pages on demand instead.
> >
> >  config INFINIBAND_ADDR_TRANS
> > - bool
> > + bool "RDMA/CM"
> >   depends on INFINIBAND
> >   default y
> > + ---help---
> > +   Support for RDMA communication manager (CM).
> > +   This allows for a generic connection abstraction over RDMA.

> Hello Greg,

> Please provide a cover letter when posting a patch series. Such a cover
> letter is not only informative but also makes it easy for people who want
> to comment on a patch series as a whole. I have a question that applies
> to the entire patch series. The RDMA/CM code defines functions like
> rdma_create_id() and rdma_create_qp(). If I search through the kernel tree
> for callers of these functions then I find several more kernel modules
than
> the ones that are modified by this patch series:

> $ git grep -lE '[[:blank:]](rdma_create_id|rdma_create_qp)\('
> drivers/infiniband/core/cma.c
> drivers/infiniband/ulp/iser/iser_verbs.c
> drivers/infiniband/ulp/isert/ib_isert.c
> drivers/infiniband/ulp/srp/ib_srp.c
> drivers/infiniband/ulp/srpt/ib_srpt.c
> drivers/nvme/host/rdma.c
> drivers/nvme/target/rdma.c
> drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
> drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
> fs/cifs/smbdirect.c
> include/rdma/rdma_cm.h
> net/9p/trans_rdma.c
> net/rds/ib.c
> net/rds/ib_cm.c
> net/rds/rdma_transport.c
> net/sunrpc/xprtrdma/svc_rdma_transport.c
> net/sunrpc/xprtrdma/verbs.c

> Are you sure that this patch series is complete?

I'll check your cited files.  I'll resend with cover letter.

FYI: I already rand this series through the 0-day builder, which presumably
would've caught most config dep issues.  But I'll research your list before
reposting.

[PATCH 5/5] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-25 Thread Greg Thelen

Allow INFINIBAND without INFINIBAND_ADDR_TRANS because fuzzing has been
finding fair number of CM bugs.  So provide option to disable it.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/infiniband/Kconfig | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index ee270e065ba9..2a972ed6851b 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING
  pages on demand instead.
 
 config INFINIBAND_ADDR_TRANS
-   bool
+   bool "RDMA/CM"
depends on INFINIBAND
default y
+   ---help---
+ Support for RDMA communication manager (CM).
+ This allows for a generic connection abstraction over RDMA.
 
 config INFINIBAND_ADDR_TRANS_CONFIGFS
bool
-- 
2.17.0.441.gb46fe60e1d-goog

[PATCH 4/5] cifs: smbd: depend on INFINIBAND_ADDR_TRANS

2018-04-25 Thread Greg Thelen

CIFS_SMB_DIRECT code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 fs/cifs/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
index 741749a98614..5f132d59dfc2 100644
--- a/fs/cifs/Kconfig
+++ b/fs/cifs/Kconfig
@@ -197,7 +197,7 @@ config CIFS_SMB311
 
 config CIFS_SMB_DIRECT
bool "SMB Direct support (Experimental)"
-   depends on CIFS=m && INFINIBAND || CIFS=y && INFINIBAND=y
+   depends on CIFS=m && INFINIBAND && INFINIBAND_ADDR_TRANS || CIFS=y && 
INFINIBAND=y && INFINIBAND_ADDR_TRANS=y
help
  Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1.
  SMB Direct allows transferring SMB packets over RDMA. If unsure,
-- 
2.17.0.441.gb46fe60e1d-goog

[PATCH 3/5] ib_srpt: depend on INFINIBAND_ADDR_TRANS

2018-04-25 Thread Greg Thelen

INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/infiniband/ulp/srpt/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/srpt/Kconfig 
b/drivers/infiniband/ulp/srpt/Kconfig
index 31ee83d528d9..fb8b7182f05e 100644
--- a/drivers/infiniband/ulp/srpt/Kconfig
+++ b/drivers/infiniband/ulp/srpt/Kconfig
@@ -1,6 +1,6 @@
 config INFINIBAND_SRPT
tristate "InfiniBand SCSI RDMA Protocol target support"
-   depends on INFINIBAND && TARGET_CORE
+   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE
---help---
 
  Support for the SCSI RDMA Protocol (SRP) Target driver. The
-- 
2.17.0.441.gb46fe60e1d-goog

[PATCH 2/5] nvmet-rdma: depend on INFINIBAND_ADDR_TRANS

2018-04-25 Thread Greg Thelen

NVME_TARGET_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency.  This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/nvme/target/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
index 5f4f8b16685f..3c7b61ddb0d1 100644
--- a/drivers/nvme/target/Kconfig
+++ b/drivers/nvme/target/Kconfig
@@ -27,7 +27,7 @@ config NVME_TARGET_LOOP
 
 config NVME_TARGET_RDMA
tristate "NVMe over Fabrics RDMA target support"
-   depends on INFINIBAND
+   depends on INFINIBAND && INFINIBAND_ADDR_TRANS
depends on NVME_TARGET
select SGL_ALLOC
help
-- 
2.17.0.441.gb46fe60e1d-goog

[PATCH 1/5] nvme: depend on INFINIBAND_ADDR_TRANS

2018-04-25 Thread Greg Thelen

NVME_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols.  So
declare the kconfig dependency.  This is necessary to allow for enabling
INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/nvme/host/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index b979cf3bce65..88a8b5916624 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -27,7 +27,7 @@ config NVME_FABRICS
 
 config NVME_RDMA
tristate "NVM Express over Fabrics RDMA host driver"
-   depends on INFINIBAND && BLOCK
+   depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK
select NVME_CORE
select NVME_FABRICS
select SG_POOL
-- 
2.17.0.441.gb46fe60e1d-goog

Re: [RFC PATCH 0/2] memory.low,min reclaim

2018-04-23 Thread Greg Thelen

On Mon, Apr 23, 2018 at 3:38 AM Roman Gushchin  wrote:

> Hi, Greg!

> On Sun, Apr 22, 2018 at 01:26:10PM -0700, Greg Thelen wrote:
> > Roman's previously posted memory.low,min patches add per memcg effective
> > low limit to detect overcommitment of parental limits.  But if we flip
> > low,min reclaim to bail if usage<{low,min} at any level, then we don't
need
> > an effective low limit, which makes the code simpler.  When parent
limits
> > are overcommited memory.min will oom kill, which is more drastic but
makes
> > the memory.low a simpler concept.  If memcg a/b wants oom kill before
> > reclaim, then give it to them.  It seems a bit strange for
a/b/memory.low's
> > behaviour to depend on a/c/memory.low (i.e. a/b.low is strong unless
> > a/b.low+a/c.low exceed a.low).

> It's actually not strange: a/b and a/c are sharing a common resource:
> a/memory.low.

> Exactly as a/b/memory.max and a/c/memory.max are sharing a/memory.max.
> If there are sibling cgroups which are consuming memory, a cgroup can't
> exceed parent's memory.max, even if its memory.max is grater.

> >
> > I think there might be a simpler way (ableit it doesn't yet include
> > Documentation):
> > - memcg: fix memory.low
> > - memcg: add memory.min
> >  3 files changed, 75 insertions(+), 6 deletions(-)
> >
> > The idea of this alternate approach is for memory.low,min to avoid
reclaim
> > if any portion of under-consideration memcg ancestry is under respective
> > limit.

> This approach has a significant downside: it breaks hierarchical
constraints
> for memory.low/min. There are two important outcomes:

> 1) Any leaf's memory.low/min value is respected, even if parent's value
>   is lower or even 0. It's not possible anymore to limit the amount
of
>   protected memory for a sub-tree.
>   This is especially bad in case of delegation.

As someone who has been using something like memory.min for a while, I have
cases where it needs to be a strong protection.  Such jobs prefer oom kill
to reclaim.  These jobs know they need X MB of memory.  But I guess it's on
me to avoid configuring machines which overcommit memory.min at such cgroup
levels all the way to the root.

> 2) If a cgroup has an ancestor with the usage under its memory.low/min,
>   it becomes protection, even if its memory.low/min is 0. So it
becomes
>   impossible to have unprotected cgroups in protected sub-tree.

Fair point.

One use case is where a non trivial job which has several memory accounting
subcontainers.  Is there a way to only set memory.low at the top and have
the offer protection to the job?
The case I'm thinking of is:
% cd /cgroup
% echo +memory > cgroup.subtree_control
% mkdir top
% echo +memory > top/cgroup.subtree_control
% mkdir top/part1 top/part2
% echo 1GB > top/memory.min
% (echo $BASHPID > top/part1/cgroup.procs && part1)
% (echo $BASHPID > top/part2/cgroup.procs && part2)

Empirically it's been measured that the entire workload (/top) needs 1GB to
perform well.  But we don't care how the memory is distributed between
part1,part2.  Is the strategy for that to set /top, /top/part1.min, and
/top/part2.min to 1GB?

What do you think about exposing emin and elow to user space?  I think that
would reduce admin/user confusion in situations where memory.min is
internally discounted.

(tangent) Delegation in v2 isn't something I've been able to fully
internalize yet.
The "no interior processes" rule challenges my notion of subdelegation.
My current model is where a system controller creates a container C with
C.min and then starts client manager process M in C.  Then M can choose
to further divide C's resources (e.g. C/S).  This doesn't seem possible
because v2 doesn't allow for interior processes.  So the system manager
would need to create C, set C.low, create C/sub_manager, create
C/sub_resources, set C/sub_manager.low, set C/sub_resources.low, then start
M in C/sub_manager.  Then sub_manager can create and manage
C/sub_resources/S.

PS: Thanks for the memory.low and memory.min work.  Regardless of how we
proceed it's better than the upstream memory.soft_limit_in_bytes!

[RFC PATCH 2/2] memcg: add memory.min

2018-04-22 Thread Greg Thelen

The new memory.min limit is similar to memory.low, just no bypassing it
when reclaim is desparate.  Prefer oom kills before reclaim memory below
memory.min.  Sharing more code with memory_cgroup_low() is possible, but
the idea is posted here for simplicity.

Signed-off-by: Greg Thelen 
---
 include/linux/memcontrol.h |  8 ++
 mm/memcontrol.c| 58 ++
 mm/vmscan.c|  3 ++
 3 files changed, 69 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c46016bb25eb..22bb4a88653a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -178,6 +178,7 @@ struct mem_cgroup {
struct page_counter tcpmem;
 
/* Normal memory consumption range */
+   unsigned long min;
unsigned long low;
unsigned long high;
 
@@ -281,6 +282,7 @@ static inline bool mem_cgroup_disabled(void)
return !cgroup_subsys_enabled(memory_cgrp_subsys);
 }
 
+bool mem_cgroup_min(struct mem_cgroup *root, struct mem_cgroup *memcg);
 bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
 
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
@@ -726,6 +728,12 @@ static inline void mem_cgroup_event(struct mem_cgroup 
*memcg,
 {
 }
 
+static inline bool mem_cgroup_min(struct mem_cgroup *root,
+ struct mem_cgroup *memcg)
+{
+   return false;
+}
+
 static inline bool mem_cgroup_low(struct mem_cgroup *root,
  struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9668f620203a..b2aaed4003b4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5058,6 +5058,36 @@ static u64 memory_current_read(struct 
cgroup_subsys_state *css,
return (u64)page_counter_read(&memcg->memory) * PAGE_SIZE;
 }
 
+static int memory_min_show(struct seq_file *m, void *v)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+   unsigned long min = READ_ONCE(memcg->min);
+
+   if (min == PAGE_COUNTER_MAX)
+   seq_puts(m, "max\n");
+   else
+   seq_printf(m, "%llu\n", (u64)min * PAGE_SIZE);
+
+   return 0;
+}
+
+static ssize_t memory_min_write(struct kernfs_open_file *of,
+   char *buf, size_t nbytes, loff_t off)
+{
+   struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+   unsigned long min;
+   int err;
+
+   buf = strstrip(buf);
+   err = page_counter_memparse(buf, "max", &min);
+   if (err)
+   return err;
+
+   memcg->min = min;
+
+   return nbytes;
+}
+
 static int memory_low_show(struct seq_file *m, void *v)
 {
struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
@@ -5288,6 +5318,12 @@ static struct cftype memory_files[] = {
.flags = CFTYPE_NOT_ON_ROOT,
.read_u64 = memory_current_read,
},
+   {
+   .name = "min",
+   .flags = CFTYPE_NOT_ON_ROOT,
+   .seq_show = memory_min_show,
+   .write = memory_min_write,
+   },
{
.name = "low",
.flags = CFTYPE_NOT_ON_ROOT,
@@ -5336,6 +5372,28 @@ struct cgroup_subsys memory_cgrp_subsys = {
.early_init = 0,
 };
 
+/**
+ * mem_cgroup_min returns true for a memcg below its min limit.  Such memcg are
+ * excempt from reclaim.
+ */
+bool mem_cgroup_min(struct mem_cgroup *root, struct mem_cgroup *memcg)
+{
+   if (mem_cgroup_disabled())
+   return false;
+
+   if (!root)
+   root = root_mem_cgroup;
+
+   if (memcg == root)
+   return false;
+
+   for (; memcg != root; memcg = parent_mem_cgroup(memcg)) {
+   if (page_counter_read(&memcg->memory) <= memcg->min)
+   return true; /* protect */
+   }
+   return false; /* !protect */
+}
+
 /**
  * mem_cgroup_low - check if memory consumption is below the normal range
  * @root: the top ancestor of the sub-tree being checked
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cd5dc3faaa57..15ae19a38ad5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2539,6 +2539,9 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
unsigned long reclaimed;
unsigned long scanned;
 
+   if (mem_cgroup_min(root, memcg))
+   continue;
+
if (mem_cgroup_low(root, memcg)) {
if (!sc->memcg_low_reclaim) {
sc->memcg_low_skipped = 1;
-- 
2.17.0.484.g0c8726318c-goog

[RFC PATCH 0/2] memory.low,min reclaim

2018-04-22 Thread Greg Thelen

   => 0k   
  # F: 4203964k => 4203908k
  # F/g  : 2102016k => 2101992k
  # F/H  : 2101948k => 2101916k
  # i: 4204408k => 4203988k
  # i/j  : 2101984k => 2101936k
  # i/K  : 2102424k => 2102052k
  # L: 4189960k => 3886296k
  # L/m  : 2101968k => 2838704k
  # L/N  : 2087992k => 1047592k

  set -o errexit
  set -o nounset
  set -o pipefail
  
  LIM="$1"; shift
  ANTAGONIST="$1"; shift
  CGPATH=/tmp/cgroup
  
  vmtouch2() {
rm -f "$2"
(echo $BASHPID > "${CGPATH}/$1/cgroup.procs" && exec /tmp/mmap --loop 1 
--file "$2" "$3")
  }
  
  vmtouch() {
# twice to ensure slab caches are warmed up and all objs are charged to 
cgroup.
vmtouch2 "$1" "$2" "$3"
vmtouch2 "$1" "$2" "$3"
  }
  
  dump() {
for i in A A/b A/C A/D A/e F F/g F/H i i/j i/K L L/m L/N; do
  printf "%-5s: %sk\n" $i $(($(cat "${CGPATH}/${i}/memory.current")/1024))
done
  }
  
  rm -f /file_?
  if [[ -e "${CGPATH}/A" ]]; then
rmdir ${CGPATH}/?/? ${CGPATH}/?
  fi
  echo "+memory" > "${CGPATH}/cgroup.subtree_control"
  mkdir "${CGPATH}/A" "${CGPATH}/F" "${CGPATH}/i" "${CGPATH}/L"
  echo "+memory" > "${CGPATH}/A/cgroup.subtree_control"
  echo "+memory" > "${CGPATH}/F/cgroup.subtree_control"
  echo "+memory" > "${CGPATH}/i/cgroup.subtree_control"
  echo "+memory" > "${CGPATH}/L/cgroup.subtree_control"
  mkdir "${CGPATH}/A/b" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/e"
  mkdir "${CGPATH}/F/g" "${CGPATH}/F/H"
  mkdir "${CGPATH}/i/j" "${CGPATH}/i/K"
  mkdir "${CGPATH}/L/m" "${CGPATH}/L/N"
  
  echo 2G > "${CGPATH}/A/memory.${LIM}"
  echo 3G > "${CGPATH}/A/b/memory.${LIM}"
  echo 1G > "${CGPATH}/A/C/memory.${LIM}"
  echo 0 > "${CGPATH}/A/D/memory.${LIM}"
  echo 10G > "${CGPATH}/A/e/memory.${LIM}"
  
  echo 2G > "${CGPATH}/F/memory.${LIM}"
  echo 3G > "${CGPATH}/F/g/memory.${LIM}"
  echo 1G > "${CGPATH}/F/H/memory.${LIM}"
  
  echo 5G > "${CGPATH}/i/memory.${LIM}"
  echo 3G > "${CGPATH}/i/j/memory.${LIM}"
  echo 1G > "${CGPATH}/i/K/memory.${LIM}"
  
  echo 2G > "${CGPATH}/L/memory.${LIM}"
  echo 4G > "${CGPATH}/L/memory.max"
  echo 3G > "${CGPATH}/L/m/memory.${LIM}"
  echo 1G > "${CGPATH}/L/N/memory.${LIM}"
  
  vmtouch A/b /file_b 2G
  vmtouch A/C /file_C 2G
  vmtouch A/D /file_D 2G
  
  vmtouch F/g /file_g 2G
  vmtouch F/H /file_H 2G
  
  vmtouch i/j /file_j 2G
  vmtouch i/K /file_K 2G
  
  vmtouch L/m /file_m 2G
  vmtouch L/N /file_N 2G
  
  vmtouch2 "$ANTAGONIST" /file_ant 150G
  echo
  echo "after $ANTAGONIST antagonist"
  dump
  
  rmdir "${CGPATH}/A/b" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/e"
  rmdir "${CGPATH}/F/g" "${CGPATH}/F/H"
  rmdir "${CGPATH}/i/j" "${CGPATH}/i/K"
  rmdir "${CGPATH}/L/m" "${CGPATH}/L/N"
  rmdir "${CGPATH}/A" "${CGPATH}/F" "${CGPATH}/i" "${CGPATH}/L"
  rm /file_ant /file_b /file_C /file_D /file_g /file_H /file_j /file_K

Greg Thelen (2):
  memcg: fix memory.low
  memcg: add memory.min

 include/linux/memcontrol.h |  8 +
 mm/memcontrol.c| 70 ++
 mm/vmscan.c|  3 ++
 3 files changed, 75 insertions(+), 6 deletions(-)

-- 
2.17.0.484.g0c8726318c-goog

[RFC PATCH 1/2] memcg: fix memory.low

2018-04-22 Thread Greg Thelen

When targeting reclaim to a memcg, protect that memcg from reclaim is
memory consumption of any level is below respective memory.low.

Signed-off-by: Greg Thelen 
---
 mm/memcontrol.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 670e99b68aa6..9668f620203a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5341,8 +5341,8 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * @root: the top ancestor of the sub-tree being checked
  * @memcg: the memory cgroup to check
  *
- * Returns %true if memory consumption of @memcg, and that of all
- * ancestors up to (but not including) @root, is below the normal range.
+ * Returns %true if memory consumption of @memcg, or any of its ancestors
+ * up to (but not including) @root, is below the normal range.
  *
  * @root is exclusive; it is never low when looked at directly and isn't
  * checked when traversing the hierarchy.
@@ -5379,12 +5379,12 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct 
mem_cgroup *memcg)
if (memcg == root)
return false;
 
+   /* If any level is under, then protect @memcg from reclaim */
for (; memcg != root; memcg = parent_mem_cgroup(memcg)) {
-   if (page_counter_read(&memcg->memory) >= memcg->low)
-   return false;
+   if (page_counter_read(&memcg->memory) <= memcg->low)
+   return true; /* protect from reclaim */
}
-
-   return true;
+   return false; /* not protected from reclaim */
 }
 
 /**
-- 
2.17.0.484.g0c8726318c-goog

Re: [PATCH] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-15 Thread Greg Thelen

On Sun, Apr 15, 2018 at 5:06 AM Christoph Hellwig  wrote:

> On Fri, Apr 13, 2018 at 12:06:44AM -0700, Greg Thelen wrote:
> > Allow INFINIBAND without INFINIBAND_ADDR_TRANS.

> Why?  We are pushing everyone heavily to use RDMA/CM, so making it
> optional seems rather counter-intuitive.

Bugs.  We don't currently use CM.  But fuzzy testing has been finding a
fair number CM bugs, so we were thinking of disabling it until we need it.

> You'll also have to fix tons of ULPs to explicitly depend on
> INFINIBAND_ADDR_TRANS, or make code conditional on it.

I think I've identified the set of options which use
INFINIBAND_ADDR_TRANS without a kconfig depends:
* CIFS_SMB_DIRECT
* INFINIBAND_SRPT
* NVME_RDMA
* NVME_TARGET_RDMA

I have patches for the above, but need to finish the commit logs.  Let me
know if they'll be nacked and I'll just patch my kernel and forget
upstreaming.

[PATCH v3] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-14 Thread Greg Thelen

Allow INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/infiniband/Kconfig | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index ee270e065ba9..2a972ed6851b 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING
  pages on demand instead.
 
 config INFINIBAND_ADDR_TRANS
-   bool
+   bool "RDMA/CM"
depends on INFINIBAND
default y
+   ---help---
+ Support for RDMA communication manager (CM).
+ This allows for a generic connection abstraction over RDMA.
 
 config INFINIBAND_ADDR_TRANS_CONFIGFS
bool
-- 
2.17.0.484.g0c8726318c-goog

Re: [PATCH v2] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-14 Thread Greg Thelen

On Sat, Apr 14, 2018 at 8:13 AM Dennis Dalessandro <
dennis.dalessan...@intel.com> wrote:

> On 4/13/2018 1:27 PM, Greg Thelen wrote:
> > Allow INFINIBAND without INFINIBAND_ADDR_TRANS.
> >
> > Signed-off-by: Greg Thelen 
> > Cc: Tarick Bedeir 
> > Change-Id: I6fbbf8a432e467710fa65e4904b7d61880b914e5

> Forgot to remove the Gerrit thing.

> -Denny

Ack.  My bad.  Will repost.  Unfortunately checkpatch didn't notice.

[PATCH v2] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-13 Thread Greg Thelen

Allow INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
Change-Id: I6fbbf8a432e467710fa65e4904b7d61880b914e5
---
 drivers/infiniband/Kconfig | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index ee270e065ba9..2a972ed6851b 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING
  pages on demand instead.
 
 config INFINIBAND_ADDR_TRANS
-   bool
+   bool "RDMA/CM"
depends on INFINIBAND
default y
+   ---help---
+ Support for RDMA communication manager (CM).
+ This allows for a generic connection abstraction over RDMA.
 
 config INFINIBAND_ADDR_TRANS_CONFIGFS
bool
-- 
2.17.0.484.g0c8726318c-goog

[PATCH] IB: make INFINIBAND_ADDR_TRANS configurable

2018-04-13 Thread Greg Thelen

Allow INFINIBAND without INFINIBAND_ADDR_TRANS.

Signed-off-by: Greg Thelen 
Cc: Tarick Bedeir 
---
 drivers/infiniband/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index ee270e065ba9..f20a3977087c 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -61,7 +61,7 @@ config INFINIBAND_ON_DEMAND_PAGING
  pages on demand instead.
 
 config INFINIBAND_ADDR_TRANS
-   bool
+   bool "InfiniBand address translation"
depends on INFINIBAND
default y
 
-- 
2.17.0.484.g0c8726318c-goog

Re: [PATCH for-4.4] writeback: safer lock nesting

2018-04-11 Thread Greg Thelen

On Wed, Apr 11, 2018 at 1:45 AM Greg Thelen  wrote:

> lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
> the page's memcg is undergoing move accounting, which occurs when a
> process leaves its memcg for a new one that has
> memory.move_charge_at_immigrate set.

> unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the
> given inode is switching writeback domains.  Switches occur when enough
> writes are issued from a new domain.

> This existing pattern is thus suspicious:
>  lock_page_memcg(page);
>  unlocked_inode_to_wb_begin(inode, &locked);
>  ...
>  unlocked_inode_to_wb_end(inode, locked);
>  unlock_page_memcg(page);

> If both inode switch and process memcg migration are both in-flight then
> unlocked_inode_to_wb_end() will unconditionally enable interrupts while
> still holding the lock_page_memcg() irq spinlock.  This suggests the
> possibility of deadlock if an interrupt occurs before
> unlock_page_memcg().

>  truncate
>  __cancel_dirty_page
>  lock_page_memcg
>  unlocked_inode_to_wb_begin
>  unlocked_inode_to_wb_end
>  
>  
>  end_page_writeback
>  test_clear_page_writeback
>  lock_page_memcg
>  
>  unlock_page_memcg

> Due to configuration limitations this deadlock is not currently possible
> because we don't mix cgroup writeback (a cgroupv2 feature) and
> memory.move_charge_at_immigrate (a cgroupv1 feature).

> If the kernel is hacked to always claim inode switching and memcg
> moving_account, then this script triggers lockup in less than a minute:
>cd /mnt/cgroup/memory
>mkdir a b
>echo 1 > a/memory.move_charge_at_immigrate
>echo 1 > b/memory.move_charge_at_immigrate
>(
>  echo $BASHPID > a/cgroup.procs
>  while true; do
>dd if=/dev/zero of=/mnt/big bs=1M count=256
>  done
>) &
>while true; do
>  sync
>done &
>sleep 1h &
>SLEEP=$!
>while true; do
>  echo $SLEEP > a/cgroup.procs
>  echo $SLEEP > b/cgroup.procs
>done

> The deadlock does not seem possible, so it's debatable if there's
> any reason to modify the kernel.  I suggest we should to prevent future
> surprises.  And Wang Long said "this deadlock occurs three times in our
> environment", so there's more reason to apply this, even to stable.

Wang Long: I wasn't able to reproduce the 4.4 problem.  But tracing
suggests this 4.4 patch is effective.  If you can reproduce the problem in
your 4.4 environment, then it'd be nice to confirm this fixes it.  Thanks!

> [ This patch is only for 4.4 stable.  Newer stable kernels should use be
able to
>    cherry pick the upstream "writeback: safer lock nesting" patch. ]

> Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb
transaction and use it for stat updates")
> Cc: sta...@vger.kernel.org # v4.2+
> Reported-by: Wang Long 
> Signed-off-by: Greg Thelen 
> Acked-by: Michal Hocko 
> Acked-by: Wang Long 
> ---
>   fs/fs-writeback.c|  7 ---
>   include/linux/backing-dev-defs.h |  5 +
>   include/linux/backing-dev.h  | 31 +--
>   mm/page-writeback.c  | 18 +-
>   4 files changed, 35 insertions(+), 26 deletions(-)

> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 22b30249fbcb..0fe667875852 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -747,11 +747,12 @@ int inode_congested(struct inode *inode, int
cong_bits)
>   */
>  if (inode && inode_to_wb_is_valid(inode)) {
>  struct bdi_writeback *wb;
> -   bool locked, congested;
> +   struct wb_lock_cookie lock_cookie = {};
> +   bool congested;

> -   wb = unlocked_inode_to_wb_begin(inode, &locked);
> +   wb = unlocked_inode_to_wb_begin(inode, &lock_cookie);
>  congested = wb_congested(wb, cong_bits);
> -   unlocked_inode_to_wb_end(inode, locked);
> +   unlocked_inode_to_wb_end(inode, &lock_cookie);
>  return congested;
>  }

> diff --git a/include/linux/backing-dev-defs.h
b/include/linux/backing-dev-defs.h
> index 140c29635069..a307c37c2e6c 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -191,6 +191,11 @@ static inline void set_bdi_congested(struct
backing_dev_info *bdi, int sync)
>  set

[PATCH v4] writeback: safer lock nesting

2018-04-11 Thread Greg Thelen

lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the
given inode is switching writeback domains.  Switches occur when enough
writes are issued from a new domain.

This existing pattern is thus suspicious:
lock_page_memcg(page);
unlocked_inode_to_wb_begin(inode, &locked);
...
unlocked_inode_to_wb_end(inode, locked);
unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before
unlock_page_memcg().

truncate
__cancel_dirty_page
lock_page_memcg
unlocked_inode_to_wb_begin
unlocked_inode_to_wb_end


end_page_writeback
test_clear_page_writeback
lock_page_memcg

unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:
  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 > a/memory.move_charge_at_immigrate
  echo 1 > b/memory.move_charge_at_immigrate
  (
echo $BASHPID > a/cgroup.procs
while true; do
  dd if=/dev/zero of=/mnt/big bs=1M count=256
done
  ) &
  while true; do
sync
  done &
  sleep 1h &
  SLEEP=$!
  while true; do
echo $SLEEP > a/cgroup.procs
echo $SLEEP > b/cgroup.procs
  done

The deadlock does not seem possible, so it's debatable if there's
any reason to modify the kernel.  I suggest we should to prevent future
surprises.  And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.
Stable 4.4 has minor conflicts applying this patch.
For a clean 4.4 patch see "[PATCH for-4.4] writeback: safer lock nesting"
https://lkml.org/lkml/2018/4/11/146

Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and 
use it for stat updates")
Cc: sta...@vger.kernel.org # v4.2+
Reported-by: Wang Long 
Signed-off-by: Greg Thelen 
Acked-by: Michal Hocko 
Acked-by: Wang Long 
---

Changelog since v3:
- initialize wb_lock_cookie wiht {} rather than {0}.
- comment grammar fix
- commit log footer cleanup (-Change-Id, +Fixes, +Acks, +stable), though patch
  does not cleanly apply to 4.4.  I'll post a 4.4-stable specific patch.

Changelog since v2:
- explicitly initialize wb_lock_cookie to silence compiler warnings.

Changelog since v1:
- add wb_lock_cookie to record lock context.

 fs/fs-writeback.c|  7 ---
 include/linux/backing-dev-defs.h |  5 +
 include/linux/backing-dev.h  | 31 +--
 mm/page-writeback.c  | 18 +-
 4 files changed, 35 insertions(+), 26 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1280f915079b..b1178acfcb08 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -745,11 +745,12 @@ int inode_congested(struct inode *inode, int cong_bits)
 */
if (inode && inode_to_wb_is_valid(inode)) {
struct bdi_writeback *wb;
-   bool locked, congested;
+   struct wb_lock_cookie lock_cookie = {};
+   bool congested;
 
-   wb = unlocked_inode_to_wb_begin(inode, &locked);
+   wb = unlocked_inode_to_wb_begin(inode, &lock_cookie);
congested = wb_congested(wb, cong_bits);
-   unlocked_inode_to_wb_end(inode, locked);
+   unlocked_inode_to_wb_end(inode, &lock_cookie);
return congested;
}
 
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index bfe86b54f6c1..0bd432a4d7bd 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -223,6 +223,11 @@ static inline void set_bdi_congested(struct 
backing_dev_info *bdi, int sync)
set_wb_congested(bdi->wb.congested, sync);
 }
 
+struct wb_lock_cookie {
+   bool locked;
+   unsigned long flags;
+};
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 /**
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3e4ce54d84ab..96f4a3ddfb81 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -346,7 +346,7 @@ static inline

[PATCH for-4.4] writeback: safer lock nesting

2018-04-11 Thread Greg Thelen

lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the
given inode is switching writeback domains.  Switches occur when enough
writes are issued from a new domain.

This existing pattern is thus suspicious:
lock_page_memcg(page);
unlocked_inode_to_wb_begin(inode, &locked);
...
unlocked_inode_to_wb_end(inode, locked);
unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before
unlock_page_memcg().

truncate
__cancel_dirty_page
lock_page_memcg
unlocked_inode_to_wb_begin
unlocked_inode_to_wb_end


end_page_writeback
test_clear_page_writeback
lock_page_memcg

unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:
  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 > a/memory.move_charge_at_immigrate
  echo 1 > b/memory.move_charge_at_immigrate
  (
echo $BASHPID > a/cgroup.procs
while true; do
  dd if=/dev/zero of=/mnt/big bs=1M count=256
done
  ) &
  while true; do
sync
  done &
  sleep 1h &
  SLEEP=$!
  while true; do
echo $SLEEP > a/cgroup.procs
echo $SLEEP > b/cgroup.procs
  done

The deadlock does not seem possible, so it's debatable if there's
any reason to modify the kernel.  I suggest we should to prevent future
surprises.  And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.

[ This patch is only for 4.4 stable.  Newer stable kernels should use be able to
  cherry pick the upstream "writeback: safer lock nesting" patch. ]

Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and 
use it for stat updates")
Cc: sta...@vger.kernel.org # v4.2+
Reported-by: Wang Long 
Signed-off-by: Greg Thelen 
Acked-by: Michal Hocko 
Acked-by: Wang Long 
---
 fs/fs-writeback.c|  7 ---
 include/linux/backing-dev-defs.h |  5 +
 include/linux/backing-dev.h  | 31 +--
 mm/page-writeback.c  | 18 +-
 4 files changed, 35 insertions(+), 26 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 22b30249fbcb..0fe667875852 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -747,11 +747,12 @@ int inode_congested(struct inode *inode, int cong_bits)
 */
if (inode && inode_to_wb_is_valid(inode)) {
struct bdi_writeback *wb;
-   bool locked, congested;
+   struct wb_lock_cookie lock_cookie = {};
+   bool congested;
 
-   wb = unlocked_inode_to_wb_begin(inode, &locked);
+   wb = unlocked_inode_to_wb_begin(inode, &lock_cookie);
congested = wb_congested(wb, cong_bits);
-   unlocked_inode_to_wb_end(inode, locked);
+   unlocked_inode_to_wb_end(inode, &lock_cookie);
return congested;
}
 
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 140c29635069..a307c37c2e6c 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -191,6 +191,11 @@ static inline void set_bdi_congested(struct 
backing_dev_info *bdi, int sync)
set_wb_congested(bdi->wb.congested, sync);
 }
 
+struct wb_lock_cookie {
+   bool locked;
+   unsigned long flags;
+};
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 /**
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 89d3de3e096b..361274ce5815 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -366,7 +366,7 @@ static inline struct bdi_writeback *inode_to_wb(struct 
inode *inode)
 /**
  * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction
  * @inode: target inode
- * @lockedp: temp bool output param, to be passed to the end function
+ * @cookie: output param, to be passed to the end function
  *
  * The caller wants to access the wb associated with @inode but isn't
  * holding inode->i_lock, mapping->tree

Re: [PATCH] writeback: safer lock nesting

2018-04-10 Thread Greg Thelen

On Tue, Apr 10, 2018 at 7:44 PM Wang Long  wrote:

> > Hi,
> >
> > [This is an automated email]
> >
> > This commit has been processed by the -stable helper bot and determined
> > to be a high probability candidate for -stable trees. (score: 44.5575)
> >
> > The bot has tested the following trees: v4.16.1, v4.15.16, v4.14.33,
v4.9.93, v4.4.127.
> >
> > v4.16.1: Build OK!
> > v4.15.16: Build OK!
> > v4.14.33: Build OK!
> > v4.9.93: Build OK!
> > v4.4.127: Failed to apply! Possible dependencies:
> >  62cccb8c8e7a ("mm: simplify lock_page_memcg()")
> Hi Sasha,

> I test the memory cgroup in lts v4.4, for this issue, 62cccb8c8e7a ("mm:
> simplify lock_page_memcg()")
> need to adjust and there are several other places that need to be fixed.

> I will make the patch for lts v4.4 if no one did.

I'm testing a 4.4-stable patch right now.  ETA is a few hours.

Re: [PATCH v3] writeback: safer lock nesting

2018-04-10 Thread Greg Thelen

On Tue, Apr 10, 2018 at 1:38 PM Andrew Morton 
wrote:

> On Mon,  9 Apr 2018 17:59:08 -0700 Greg Thelen  wrote:

> > lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
> > the page's memcg is undergoing move accounting, which occurs when a
> > process leaves its memcg for a new one that has
> > memory.move_charge_at_immigrate set.
> >
> > unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the
> > given inode is switching writeback domains.  Switches occur when enough
> > writes are issued from a new domain.
> >
> > This existing pattern is thus suspicious:
> > lock_page_memcg(page);
> > unlocked_inode_to_wb_begin(inode, &locked);
> > ...
> > unlocked_inode_to_wb_end(inode, locked);
> > unlock_page_memcg(page);
> >
> > If both inode switch and process memcg migration are both in-flight then
> > unlocked_inode_to_wb_end() will unconditionally enable interrupts while
> > still holding the lock_page_memcg() irq spinlock.  This suggests the
> > possibility of deadlock if an interrupt occurs before
> > unlock_page_memcg().
> >
> > truncate
> > __cancel_dirty_page
> > lock_page_memcg
> > unlocked_inode_to_wb_begin
> > unlocked_inode_to_wb_end
> > 
> > 
> > end_page_writeback
> > test_clear_page_writeback
> > lock_page_memcg
> > 
> > unlock_page_memcg
> >
> > Due to configuration limitations this deadlock is not currently possible
> > because we don't mix cgroup writeback (a cgroupv2 feature) and
> > memory.move_charge_at_immigrate (a cgroupv1 feature).
> >
> > If the kernel is hacked to always claim inode switching and memcg
> > moving_account, then this script triggers lockup in less than a minute:
> >   cd /mnt/cgroup/memory
> >   mkdir a b
> >   echo 1 > a/memory.move_charge_at_immigrate
> >   echo 1 > b/memory.move_charge_at_immigrate
> >   (
> > echo $BASHPID > a/cgroup.procs
> > while true; do
> >   dd if=/dev/zero of=/mnt/big bs=1M count=256
> > done
> >   ) &
> >   while true; do
> > sync
> >   done &
> >   sleep 1h &
> >   SLEEP=$!
> >   while true; do
> > echo $SLEEP > a/cgroup.procs
> > echo $SLEEP > b/cgroup.procs
> >   done
> >
> > Given the deadlock is not currently possible, it's debatable if there's
> > any reason to modify the kernel.  I suggest we should to prevent future
> > surprises.
> >
> > ...
> >
> > Changelog since v2:
> > - explicitly initialize wb_lock_cookie to silence compiler warnings.

> But only in some places.  What's up with that?

I annotated it in places where my compiler was complaining about.  But
you're right.  It's better to init all 4.

> >
> > ...
> >
> > --- a/include/linux/backing-dev.h
> > +++ b/include/linux/backing-dev.h
> > @@ -346,7 +346,7 @@ static inline struct bdi_writeback
*inode_to_wb(const struct inode *inode)
> >  /**
> >   * unlocked_inode_to_wb_begin - begin unlocked inode wb access
transaction
> >   * @inode: target inode
> > - * @lockedp: temp bool output param, to be passed to the end function
> > + * @cookie: output param, to be passed to the end function
> >   *
> >   * The caller wants to access the wb associated with @inode but isn't
> >   * holding inode->i_lock, mapping->tree_lock or wb->list_lock.  This
> > @@ -354,12 +354,11 @@ static inline struct bdi_writeback
*inode_to_wb(const struct inode *inode)
> >   * association doesn't change until the transaction is finished with
> >   * unlocked_inode_to_wb_end().
> >   *
> > - * The caller must call unlocked_inode_to_wb_end() with *@lockdep
> > - * afterwards and can't sleep during transaction.  IRQ may or may not
be
> > - * disabled on return.
> > + * The caller must call unlocked_inode_to_wb_end() with *@cookie
afterwards and
> > + * can't sleep during transaction.  IRQ may or may not be disabled on
return.
> >   */

> Grammar is a bit awkward here,

> >
> > ...
> >
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -2501,13 +2501,13 @@ void account_page_redirty(struct page *page)
> >   if (mapping && mapping_cap_account_dirty(mapping)) {
> >   struct inode *inode =

Re: [PATCH v3] writeback: safer lock nesting

2018-04-10 Thread Greg Thelen

On Tue, Apr 10, 2018 at 1:15 AM Wang Long  wrote:

> > lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
> > the page's memcg is undergoing move accounting, which occurs when a
> > process leaves its memcg for a new one that has
> > memory.move_charge_at_immigrate set.
> >
> > unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the
> > given inode is switching writeback domains.  Switches occur when enough
> > writes are issued from a new domain.
> >
> > This existing pattern is thus suspicious:
> >  lock_page_memcg(page);
> >  unlocked_inode_to_wb_begin(inode, &locked);
> >  ...
> >  unlocked_inode_to_wb_end(inode, locked);
> >  unlock_page_memcg(page);
> >
> > If both inode switch and process memcg migration are both in-flight then
> > unlocked_inode_to_wb_end() will unconditionally enable interrupts while
> > still holding the lock_page_memcg() irq spinlock.  This suggests the
> > possibility of deadlock if an interrupt occurs before
> > unlock_page_memcg().
> >
> >  truncate
> >  __cancel_dirty_page
> >  lock_page_memcg
> >  unlocked_inode_to_wb_begin
> >  unlocked_inode_to_wb_end
> >  
> >  
> >  end_page_writeback
> >  test_clear_page_writeback
> >  lock_page_memcg
> >  
> >  unlock_page_memcg
> >
> > Due to configuration limitations this deadlock is not currently possible
> > because we don't mix cgroup writeback (a cgroupv2 feature) and
> > memory.move_charge_at_immigrate (a cgroupv1 feature).
> >
> > If the kernel is hacked to always claim inode switching and memcg
> > moving_account, then this script triggers lockup in less than a minute:
> >cd /mnt/cgroup/memory
> >mkdir a b
> >echo 1 > a/memory.move_charge_at_immigrate
> >echo 1 > b/memory.move_charge_at_immigrate
> >(
> >  echo $BASHPID > a/cgroup.procs
> >  while true; do
> >dd if=/dev/zero of=/mnt/big bs=1M count=256
> >  done
> >) &
> >while true; do
> >  sync
> >done &
> >sleep 1h &
> >SLEEP=$!
> >while true; do
> >  echo $SLEEP > a/cgroup.procs
> >  echo $SLEEP > b/cgroup.procs
> >done
> >
> > Given the deadlock is not currently possible, it's debatable if there's
> > any reason to modify the kernel.  I suggest we should to prevent future
> > surprises.
> This deadlock occurs three times in our environment，

> this deadlock occurs three times in our environment. It is better to cc
stable kernel and
> backport it.

That's interesting.  Are you using cgroup v1 or v2?  Do you enable
memory.move_charge_at_immigrate?
I assume you've been using 4.4 stable.  I'll look closer at it at a 4.4
stable backport.

[PATCH v3] writeback: safer lock nesting

2018-04-09 Thread Greg Thelen

lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the
given inode is switching writeback domains.  Switches occur when enough
writes are issued from a new domain.

This existing pattern is thus suspicious:
lock_page_memcg(page);
unlocked_inode_to_wb_begin(inode, &locked);
...
unlocked_inode_to_wb_end(inode, locked);
unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before
unlock_page_memcg().

truncate
__cancel_dirty_page
lock_page_memcg
unlocked_inode_to_wb_begin
unlocked_inode_to_wb_end


end_page_writeback
test_clear_page_writeback
lock_page_memcg

unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:
  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 > a/memory.move_charge_at_immigrate
  echo 1 > b/memory.move_charge_at_immigrate
  (
echo $BASHPID > a/cgroup.procs
while true; do
  dd if=/dev/zero of=/mnt/big bs=1M count=256
done
  ) &
  while true; do
sync
  done &
  sleep 1h &
  SLEEP=$!
  while true; do
echo $SLEEP > a/cgroup.procs
echo $SLEEP > b/cgroup.procs
  done

Given the deadlock is not currently possible, it's debatable if there's
any reason to modify the kernel.  I suggest we should to prevent future
surprises.

Reported-by: Wang Long 
Signed-off-by: Greg Thelen 
Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
---
Changelog since v2:
- explicitly initialize wb_lock_cookie to silence compiler warnings.

Changelog since v1:
- add wb_lock_cookie to record lock context.

 fs/fs-writeback.c|  7 ---
 include/linux/backing-dev-defs.h |  5 +
 include/linux/backing-dev.h  | 30 --
 mm/page-writeback.c  | 18 +-
 4 files changed, 34 insertions(+), 26 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1280f915079b..f4b2f6625913 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -745,11 +745,12 @@ int inode_congested(struct inode *inode, int cong_bits)
 */
if (inode && inode_to_wb_is_valid(inode)) {
struct bdi_writeback *wb;
-   bool locked, congested;
+   struct wb_lock_cookie lock_cookie;
+   bool congested;
 
-   wb = unlocked_inode_to_wb_begin(inode, &locked);
+   wb = unlocked_inode_to_wb_begin(inode, &lock_cookie);
congested = wb_congested(wb, cong_bits);
-   unlocked_inode_to_wb_end(inode, locked);
+   unlocked_inode_to_wb_end(inode, &lock_cookie);
return congested;
}
 
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index bfe86b54f6c1..0bd432a4d7bd 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -223,6 +223,11 @@ static inline void set_bdi_congested(struct 
backing_dev_info *bdi, int sync)
set_wb_congested(bdi->wb.congested, sync);
 }
 
+struct wb_lock_cookie {
+   bool locked;
+   unsigned long flags;
+};
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 /**
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3e4ce54d84ab..1d744c61d996 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -346,7 +346,7 @@ static inline struct bdi_writeback *inode_to_wb(const 
struct inode *inode)
 /**
  * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction
  * @inode: target inode
- * @lockedp: temp bool output param, to be passed to the end function
+ * @cookie: output param, to be passed to the end function
  *
  * The caller wants to access the wb associated with @inode but isn't
  * holding inode->i_lock, mapping->tree_lock or wb->list_lock.  This
@@ -354,12 +354,11 @@ static inline struct bdi_writeback *inode_to_wb(const 
struct inode *inode)
  * association doesn't change until the transaction is finished with
  * unlocked_inode_to_wb_end().
  *
- * The caller must call unlocked_inode_to_wb_e

[PATCH v2] writeback: safer lock nesting

2018-04-06 Thread Greg Thelen

lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the
given inode is switching writeback domains.  Swithces occur when enough
writes are issued from a new domain.

This existing pattern is thus suspicious:
lock_page_memcg(page);
unlocked_inode_to_wb_begin(inode, &locked);
...
unlocked_inode_to_wb_end(inode, locked);
unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before
unlock_page_memcg().

truncate
__cancel_dirty_page
lock_page_memcg
unlocked_inode_to_wb_begin
unlocked_inode_to_wb_end


end_page_writeback
test_clear_page_writeback
lock_page_memcg

unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:
  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 > a/memory.move_charge_at_immigrate
  echo 1 > b/memory.move_charge_at_immigrate
  (
echo $BASHPID > a/cgroup.procs
while true; do
  dd if=/dev/zero of=/mnt/big bs=1M count=256
done
  ) &
  while true; do
sync
  done &
  sleep 1h &
  SLEEP=$!
  while true; do
echo $SLEEP > a/cgroup.procs
echo $SLEEP > b/cgroup.procs
  done

Given the deadlock is not currently possible, it's debatable if there's
any reason to modify the kernel.  I suggest we should to prevent future
surprises.

Reported-by: Wang Long 
Signed-off-by: Greg Thelen 
---
Changelog since v1:
- add wb_lock_cookie to record lock context.

 fs/fs-writeback.c|  7 ---
 include/linux/backing-dev-defs.h |  5 +
 include/linux/backing-dev.h  | 30 --
 mm/page-writeback.c  | 18 +-
 4 files changed, 34 insertions(+), 26 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d4d04fee568a..518f72754a77 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -745,11 +745,12 @@ int inode_congested(struct inode *inode, int cong_bits)
 */
if (inode && inode_to_wb_is_valid(inode)) {
struct bdi_writeback *wb;
-   bool locked, congested;
+   struct wb_lock_cookie lock_cookie;
+   bool congested;
 
-   wb = unlocked_inode_to_wb_begin(inode, &locked);
+   wb = unlocked_inode_to_wb_begin(inode, &lock_cookie);
congested = wb_congested(wb, cong_bits);
-   unlocked_inode_to_wb_end(inode, locked);
+   unlocked_inode_to_wb_end(inode, &lock_cookie);
return congested;
}
 
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index bfe86b54f6c1..0bd432a4d7bd 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -223,6 +223,11 @@ static inline void set_bdi_congested(struct 
backing_dev_info *bdi, int sync)
set_wb_congested(bdi->wb.congested, sync);
 }
 
+struct wb_lock_cookie {
+   bool locked;
+   unsigned long flags;
+};
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 /**
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3e4ce54d84ab..1d744c61d996 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -346,7 +346,7 @@ static inline struct bdi_writeback *inode_to_wb(const 
struct inode *inode)
 /**
  * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction
  * @inode: target inode
- * @lockedp: temp bool output param, to be passed to the end function
+ * @cookie: output param, to be passed to the end function
  *
  * The caller wants to access the wb associated with @inode but isn't
  * holding inode->i_lock, mapping->tree_lock or wb->list_lock.  This
@@ -354,12 +354,11 @@ static inline struct bdi_writeback *inode_to_wb(const 
struct inode *inode)
  * association doesn't change until the transaction is finished with
  * unlocked_inode_to_wb_end().
  *
- * The caller must call unlocked_inode_to_wb_end() with *@lockdep
- * afterwards and can't sleep during transaction.  IRQ may or may not be
- * disabled on return.
+ * The caller must cal

Re: [PATCH] writeback: safer lock nesting

2018-04-06 Thread Greg Thelen

On Fri, Apr 6, 2018 at 1:07 AM Michal Hocko  wrote:

> On Fri 06-04-18 01:03:24, Greg Thelen wrote:
> [...]
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index d4d04fee568a..d51bae5a53e2 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -746,10 +746,11 @@ int inode_congested(struct inode *inode, int
cong_bits)
> >   if (inode && inode_to_wb_is_valid(inode)) {
> >   struct bdi_writeback *wb;
> >   bool locked, congested;
> > + unsigned long flags;
> >
> > - wb = unlocked_inode_to_wb_begin(inode, &locked);
> > + wb = unlocked_inode_to_wb_begin(inode, &locked, &flags);

> Wouldn't it be better to have a cookie (struct) rather than 2 parameters
> and let unlocked_inode_to_wb_end DTRT?

Nod.  I'll post a V2 patch with that change.

> >   congested = wb_congested(wb, cong_bits);
> > - unlocked_inode_to_wb_end(inode, locked);
> > + unlocked_inode_to_wb_end(inode, locked, flags);
> >   return congested;
> >   }
> --
> Michal Hocko
> SUSE Labs

[PATCH] writeback: safer lock nesting

2018-04-06 Thread Greg Thelen

lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the
given inode is switching writeback domains.  Swithces occur when enough
writes are issued from a new domain.

This existing pattern is thus suspicious:
lock_page_memcg(page);
unlocked_inode_to_wb_begin(inode, &locked);
...
unlocked_inode_to_wb_end(inode, locked);
unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before
unlock_page_memcg().

truncate
__cancel_dirty_page
lock_page_memcg
unlocked_inode_to_wb_begin
unlocked_inode_to_wb_end


end_page_writeback
test_clear_page_writeback
lock_page_memcg

unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:
  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 > a/memory.move_charge_at_immigrate
  echo 1 > b/memory.move_charge_at_immigrate
  (
echo $BASHPID > a/cgroup.procs
while true; do
  dd if=/dev/zero of=/mnt/big bs=1M count=256
done
  ) &
  while true; do
sync
  done &
  sleep 1h &
  SLEEP=$!
  while true; do
echo $SLEEP > a/cgroup.procs
echo $SLEEP > b/cgroup.procs
  done

Given the deadlock is not currently possible, it's debatable if there's
any reason to modify the kernel.  I suggest we should to prevent future
surprises.

Reported-by: Wang Long 
Signed-off-by: Greg Thelen 
---
 fs/fs-writeback.c   |  5 +++--
 include/linux/backing-dev.h | 18 --
 mm/page-writeback.c | 15 +--
 3 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d4d04fee568a..d51bae5a53e2 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -746,10 +746,11 @@ int inode_congested(struct inode *inode, int cong_bits)
if (inode && inode_to_wb_is_valid(inode)) {
struct bdi_writeback *wb;
bool locked, congested;
+   unsigned long flags;
 
-   wb = unlocked_inode_to_wb_begin(inode, &locked);
+   wb = unlocked_inode_to_wb_begin(inode, &locked, &flags);
congested = wb_congested(wb, cong_bits);
-   unlocked_inode_to_wb_end(inode, locked);
+   unlocked_inode_to_wb_end(inode, locked, flags);
return congested;
}
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3e4ce54d84ab..6c74b64d6f56 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -347,6 +347,7 @@ static inline struct bdi_writeback *inode_to_wb(const 
struct inode *inode)
  * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction
  * @inode: target inode
  * @lockedp: temp bool output param, to be passed to the end function
+ * @flags: saved irq flags, to be passed to the end function
  *
  * The caller wants to access the wb associated with @inode but isn't
  * holding inode->i_lock, mapping->tree_lock or wb->list_lock.  This
@@ -359,7 +360,8 @@ static inline struct bdi_writeback *inode_to_wb(const 
struct inode *inode)
  * disabled on return.
  */
 static inline struct bdi_writeback *
-unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
+unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp,
+  unsigned long *flags)
 {
rcu_read_lock();
 
@@ -370,7 +372,7 @@ unlocked_inode_to_wb_begin(struct inode *inode, bool 
*lockedp)
*lockedp = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
 
if (unlikely(*lockedp))
-   spin_lock_irq(&inode->i_mapping->tree_lock);
+   spin_lock_irqsave(&inode->i_mapping->tree_lock, *flags);
 
/*
 * Protected by either !I_WB_SWITCH + rcu_read_lock() or tree_lock.
@@ -383,11 +385,13 @@ unlocked_inode_to_wb_begin(struct inode *inode, bool 
*lockedp)
  * unlocked_inode_to_wb_end - end inode wb access transaction
  * @inode: target inode
  * @locked: *@lockedp from unlocked_inode_to_wb_begin()
+ * @flags: *@flags from unlocked_in

Re: [RFC] Is it correctly that the usage for spin_{lock|unlock}_irq in clear_page_dirty_for_io

2018-04-03 Thread Greg Thelen

On Tue, Apr 3, 2018 at 5:03 AM Michal Hocko  wrote:

> On Mon 02-04-18 19:50:50, Wang Long wrote:
> >
> > Hi,  Johannes Weiner and Tejun Heo
> >
> > I use linux-4.4.y to test the new cgroup controller io and the current
> > stable kernel linux-4.4.y has the follow logic
> >
> >
> > int clear_page_dirty_for_io(struct page *page){
> > ...
> > ...
> > memcg = mem_cgroup_begin_page_stat(page); --(a)
> > wb = unlocked_inode_to_wb_begin(inode, &locked);
-(b)
> > if (TestClearPageDirty(page)) {
> > mem_cgroup_dec_page_stat(memcg,
MEM_CGROUP_STAT_DIRTY);
> > dec_zone_page_state(page, NR_FILE_DIRTY);
> > dec_wb_stat(wb, WB_RECLAIMABLE);
> > ret =1;
> > }
> > unlocked_inode_to_wb_end(inode, locked); ---(c)
> > mem_cgroup_end_page_stat(memcg); -(d)
> > return ret;
> > ...
> > ...
> > }
> >
> >
> > when memcg is moving, and I_WB_SWITCH flags for inode is set. the logic
> > is the following:
> >
> >
> > spin_lock_irqsave(&memcg->move_lock, flags); -(a)
> > spin_lock_irq(&inode->i_mapping->tree_lock); (b)
> > spin_unlock_irq(&inode->i_mapping->tree_lock); ---(c)
> > spin_unlock_irqrestore(&memcg->move_lock, flags); ---(d)
> >
> >
> > after (c) , the local irq is enabled. I think it is not correct.
> >
> > We get a deadlock backtrace after (c), the cpu get an softirq and in the
> > irq it also call mem_cgroup_begin_page_stat to lock the same
> > memcg->move_lock.
> >
> > Since the conditions are too harsh, this scenario is difficult to
> > reproduce.  But it really exists.
> >
> > So how about change (b) (c) to spin_lock_irqsave/spin_lock_irqrestore?

> Yes, it seems we really need this even for the current tree. Please note
> that At least clear_page_dirty_for_io doesn't lock memcg anymore.
> __cancel_dirty_page still uses lock_page_memcg though (former
> mem_cgroup_begin_page_stat).
> --
> Michal Hocko
> SUSE Labs

I agree the issue looks real in 4.4 stable and upstream.  It seems like
unlocked_inode_to_wb_begin/_end should use spin_lock_irqsave/restore.

I'm testing a little patch now.

Re: [PATCH] RDMA/ucma: reject AF_IB ip multicast requests

2018-03-30 Thread Greg Thelen

On Thu, Mar 29, 2018 at 9:24 PM, Greg Thelen  wrote:
> syzbot discovered that ucma_join_ip_multicast() mishandles AF_IB request
> addresses.  If an RDMA_USER_CM_CMD_JOIN_IP_MCAST request has
> cmd.addr.sa_family=AF_IB then ucma_join_ip_multicast() reads beyond the
> end of its cmd.addr.
>
> Reject non IP RDMA_USER_CM_CMD_JOIN_IP_MCAST requests.
> RDMA_USER_CM_CMD_JOIN_MCAST is interface for AF_IB multicast.
>
> And add a buffer length safety check.
>
> Fixes: 5bc2b7b397b0 ("RDMA/ucma: Allow user space to specify AF_IB when 
> joining multicast")
> Signed-off-by: Greg Thelen 
> ---
>  drivers/infiniband/core/ucma.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)

My patch is no longer needed.  linus/master recently picked up
84652aefb347 ("RDMA/ucma: Introduce safer rdma_addr_size() variants")
which fixes the same issue.

> diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
> index e5a1e7d81326..e410e03940ff 100644
> --- a/drivers/infiniband/core/ucma.c
> +++ b/drivers/infiniband/core/ucma.c
> @@ -1423,11 +1423,19 @@ static ssize_t ucma_join_ip_multicast(struct 
> ucma_file *file,
> if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
> return -EFAULT;
>
> +   switch (cmd.addr.sin6_family) {
> +   case AF_INET:
> +   case AF_INET6:
> +   break;
> +   default:
> +   return -EINVAL;
> +   }
> +
> join_cmd.response = cmd.response;
> join_cmd.uid = cmd.uid;
> join_cmd.id = cmd.id;
> join_cmd.addr_size = rdma_addr_size((struct sockaddr *) &cmd.addr);
> -   if (!join_cmd.addr_size)
> +   if (!join_cmd.addr_size || join_cmd.addr_size > sizeof(cmd.addr))
> return -EINVAL;
>
> join_cmd.join_flags = RDMA_MC_JOIN_FLAG_FULLMEMBER;
> --
> 2.17.0.rc1.321.gba9d0f2565-goog
>

[PATCH] RDMA/ucma: reject AF_IB ip multicast requests

2018-03-29 Thread Greg Thelen

syzbot discovered that ucma_join_ip_multicast() mishandles AF_IB request
addresses.  If an RDMA_USER_CM_CMD_JOIN_IP_MCAST request has
cmd.addr.sa_family=AF_IB then ucma_join_ip_multicast() reads beyond the
end of its cmd.addr.

Reject non IP RDMA_USER_CM_CMD_JOIN_IP_MCAST requests.
RDMA_USER_CM_CMD_JOIN_MCAST is interface for AF_IB multicast.

And add a buffer length safety check.

Fixes: 5bc2b7b397b0 ("RDMA/ucma: Allow user space to specify AF_IB when joining 
multicast")
Signed-off-by: Greg Thelen 
---
 drivers/infiniband/core/ucma.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index e5a1e7d81326..e410e03940ff 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -1423,11 +1423,19 @@ static ssize_t ucma_join_ip_multicast(struct ucma_file 
*file,
if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
return -EFAULT;
 
+   switch (cmd.addr.sin6_family) {
+   case AF_INET:
+   case AF_INET6:
+   break;
+   default:
+   return -EINVAL;
+   }
+
join_cmd.response = cmd.response;
join_cmd.uid = cmd.uid;
join_cmd.id = cmd.id;
join_cmd.addr_size = rdma_addr_size((struct sockaddr *) &cmd.addr);
-   if (!join_cmd.addr_size)
+   if (!join_cmd.addr_size || join_cmd.addr_size > sizeof(cmd.addr))
return -EINVAL;
 
join_cmd.join_flags = RDMA_MC_JOIN_FLAG_FULLMEMBER;
-- 
2.17.0.rc1.321.gba9d0f2565-goog

Re: [PATCH v2] mm, shrinker: make shrinker_list lockless

2017-11-08 Thread Greg Thelen

(off list)

Shakeel Butt  wrote:

> In our production, we have observed that the job loader gets stuck for
> 10s of seconds while doing mount operation. It turns out that it was
> stuck in register_shrinker() and some unrelated job was under memory
> pressure and spending time in shrink_slab(). Our machines have a lot
> of shrinkers registered and jobs under memory pressure has to traverse
> all of those memcg-aware shrinkers and do affect unrelated jobs which
> want to register their own shrinkers.
>
> This patch has made the shrinker_list traversal lockless and shrinker
> register remain fast. For the shrinker unregister, atomic counter
> has been introduced to avoid synchronize_rcu() call. The fields of
> struct shrinker has been rearraged to make sure that the size does
> not increase for x86_64.
>
> The shrinker functions are allowed to reschedule() and thus can not
> be called with rcu read lock. One way to resolve that is to use
> srcu read lock but then ifdefs has to be used as SRCU is behind
> CONFIG_SRCU. Another way is to just release the rcu read lock before
> calling the shrinker and reacquire on the return. The atomic counter
> will make sure that the shrinker entry will not be freed under us.
>
> Signed-off-by: Shakeel Butt 
> ---
> Changelog since v1:
> - release and reacquire rcu lock across shrinker call.
>
>  include/linux/shrinker.h |  4 +++-
>  mm/vmscan.c  | 54 
> ++--
>  2 files changed, 37 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 388ff2936a87..434b76ef9367 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -60,14 +60,16 @@ struct shrinker {
>   unsigned long (*scan_objects)(struct shrinker *,
> struct shrink_control *sc);
>  
> + unsigned int flags;
>   int seeks;  /* seeks to recreate an obj */
>   long batch; /* reclaim batch size, 0 = default */
> - unsigned long flags;
>  
>   /* These are for internal use */
>   struct list_head list;
>   /* objs pending delete, per node */
>   atomic_long_t *nr_deferred;
> + /* Number of active do_shrink_slab calls to this shrinker */
> + atomic_t nr_active;
>  };
>  #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eb2f0315b8c0..6cec46ac6d95 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -157,7 +157,7 @@ int vm_swappiness = 60;
>  unsigned long vm_total_pages;
>  
>  static LIST_HEAD(shrinker_list);
> -static DECLARE_RWSEM(shrinker_rwsem);
> +static DEFINE_SPINLOCK(shrinker_lock);
>  
>  #ifdef CONFIG_MEMCG
>  static bool global_reclaim(struct scan_control *sc)
> @@ -285,21 +285,42 @@ int register_shrinker(struct shrinker *shrinker)
>   if (!shrinker->nr_deferred)
>   return -ENOMEM;
>  
> - down_write(&shrinker_rwsem);
> - list_add_tail(&shrinker->list, &shrinker_list);
> - up_write(&shrinker_rwsem);
> + atomic_set(&shrinker->nr_active, 0);
> + spin_lock(&shrinker_lock);
> + list_add_tail_rcu(&shrinker->list, &shrinker_list);
> + spin_unlock(&shrinker_lock);
>   return 0;
>  }
>  EXPORT_SYMBOL(register_shrinker);
>  
> +static void get_shrinker(struct shrinker *shrinker)
> +{
> + atomic_inc(&shrinker->nr_active);
> + rcu_read_unlock();
> +}
> +
> +static void put_shrinker(struct shrinker *shrinker)
> +{
> + rcu_read_lock();
> + if (!atomic_dec_return(&shrinker->nr_active))
> + wake_up_atomic_t(&shrinker->nr_active);
> +}
> +
> +static int shrinker_wait_atomic_t(atomic_t *p)
> +{
> + schedule();
> + return 0;
> +}
>  /*
>   * Remove one
>   */
>  void unregister_shrinker(struct shrinker *shrinker)
>  {
> - down_write(&shrinker_rwsem);
> - list_del(&shrinker->list);
> - up_write(&shrinker_rwsem);
> + spin_lock(&shrinker_lock);
> + list_del_rcu(&shrinker->list);
> + spin_unlock(&shrinker_lock);
> + wait_on_atomic_t(&shrinker->nr_active, shrinker_wait_atomic_t,
> +  TASK_UNINTERRUPTIBLE);

What keeps us from returning to the caller which could kfree the
shrinker before shrink_slab() uses it for list iteration?

>   kfree(shrinker->nr_deferred);
>  }
>  EXPORT_SYMBOL(unregister_shrinker);
> @@ -468,18 +489,9 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>   if (nr_scanned == 0)
>   nr_scanned = SWAP_CLUSTER_MAX;
>  
> - if (!down_read_trylock(&shrinker_rwsem)) {
> - /*
> -  * If we would return 0, our callers would understand that we
> -  * have nothing else to shrink and give up trying. By returning
> -  * 1 we keep it going and assume we'll be able to shrink next
> -  * time.
> -  */
> - freed = 1;
> - goto out;
> - }
> + rcu_read_lock();
>  
> - list_for_each_

Re: [PATCH] fs, mm: account filp and names caches to kmemcg

2017-10-26 Thread Greg Thelen

Michal Hocko wrote:
> Greg Thelen wrote:
> > So a force charge fallback might be a needed even with oom killer successful
> > invocations.  Or we'll need to teach out_of_memory() to return three values
> > (e.g. NO_VICTIM, NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on
> > NEW_VICTIM.
> 
> No we, really want to wait for the oom victim to do its job. The only thing we
> should be worried about is when out_of_memory doesn't invoke the reaper. There
> is only one case like that AFAIK - GFP_NOFS request. I have to think about
> this case some more. We currently fail in that case the request.

Nod, but I think only wait a short time (more below).  The
schedule_timeout_killable(1) in out_of_memory() seems ok to me.  I don't
think there's a problem overcharging a little bit to expedite oom
killing.

Johannes Weiner wrote:
> True. I was assuming we'd retry MEM_CGROUP_RECLAIM_RETRIES times at a maximum,
> even if the OOM killer indicates a kill has been issued. What you propose
> makes sense too.

Sounds good.

It looks like the oom reaper will wait 1 second
(MAX_OOM_REAP_RETRIES*HZ/10) before giving up and setting MMF_OOM_SKIP,
which enables the oom killer to select another victim.  Repeated
try_charge() => out_of_memory() calls will return true while there's a
pending victim.  After the first call, out_of_memory() doesn't appear to
sleep.  So I assume try_charge() would quickly burn through
MEM_CGROUP_RECLAIM_RETRIES (5) attempts before resorting to
overcharging.  IMO, this is fine because:
1) it's possible the victim wants locks held by try_charge caller.  So
   waiting for the oom reaper to timeout and out_of_memory to select
   additional victims would kill more than required.
2) waiting 1 sec to detect a livelock between try_charge() and pending
   oom victim seems unfortunate.

Re: [PATCH] fs, mm: account filp and names caches to kmemcg

2017-10-25 Thread Greg Thelen

Johannes Weiner  wrote:

> On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote:
>> On Wed 25-10-17 14:11:06, Johannes Weiner wrote:
>> > "Safe" is a vague term, and it doesn't make much sense to me in this
>> > situation. The OOM behavior should be predictable and consistent.
>> > 
>> > Yes, global might in the rarest cases also return -ENOMEM. Maybe. We
>> > don't have to do that in memcg because we're not physically limited.
>> 
>> OK, so here seems to be the biggest disconnect. Being physically or
>> artificially constrained shouldn't make much difference IMHO. In both
>> cases the resource is simply limited for the consumer. And once all the
>> attempts to fit within the limit fail then the request for the resource
>> has to fail.
>
> It's a huge difference. In the global case, we have to make trade-offs
> to not deadlock the kernel. In the memcg case, we have to make a trade
> off between desirable OOM behavior and desirable meaning of memory.max.
>
> If we can borrow a resource temporarily from the ether to resolve the
> OOM situation, I don't see why we shouldn't. We're only briefly
> ignoring the limit to make sure the allocating task isn't preventing
> the OOM victim from exiting or the OOM reaper from reaping. It's more
> of an implementation detail than interface.
>
> The only scenario you brought up where this might be the permanent
> overrun is the single, oom-disabled task. And I explained why that is
> a silly argument, why that's the least problematic consequence of
> oom-disabling, and why it probably shouldn't even be configurable.
>
> The idea that memory.max must never be breached is an extreme and
> narrow view. As Greg points out, there are allocations we do not even
> track. There are other scenarios that force allocations. They may
> violate the limit on paper, but they're not notably weakening the goal
> of memory.max - isolating workloads from each other.
>
> Let's look at it this way.
>
> There are two deadlock relationships the OOM killer needs to solve
> between the triggerer and the potential OOM victim:
>
>   #1 Memory. The triggerer needs memory that the victim has,
>   but the victim needs some additional memory to release it.
>
>   #2 Locks. The triggerer needs memory that the victim has, but
>   the victim needs a lock the triggerer holds to release it.
>
> We have no qualms letting the victim temporarily (until the victim's
> exit) ignore memory.max to resolve the memory deadlock #1.
>
> I don't understand why it's such a stretch to let the triggerer
> temporarily (until the victim's exit) ignore memory.max to resolve the
> locks deadlock #2. [1]
>
> We need both for the OOM killer to function correctly.
>
> We've solved #1 both for memcg and globally. But we haven't solved #2.
> Global can still deadlock, and memcg copped out and returns -ENOMEM.
>
> Adding speculative OOM killing before the -ENOMEM makes things more
> muddy and unpredictable. It doesn't actually solve deadlock #2.
>
> [1] And arguably that's what we should be doing in the global case
> too: give the triggerer access to reserves. If you recall this
> thread here: https://patchwork.kernel.org/patch/6088511/
>
>> > > So the only change I am really proposing is to keep retrying as long
>> > > as the oom killer makes a forward progress and ENOMEM otherwise.
>> > 
>> > That's the behavior change I'm against.
>> 
>> So just to make it clear you would be OK with the retry on successful
>> OOM killer invocation and force charge on oom failure, right?
>
> Yeah, that sounds reasonable to me.

Assuming we're talking about retrying within try_charge(), then there's
a detail to iron out...

If there is a pending oom victim blocked on a lock held by try_charge() caller
(the "#2 Locks" case), then I think repeated calls to out_of_memory() will
return true until the victim either gets MMF_OOM_SKIP or disappears.  So a force
charge fallback might be a needed even with oom killer successful invocations.
Or we'll need to teach out_of_memory() to return three values (e.g. NO_VICTIM,
NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on NEW_VICTIM.

Re: [PATCH] fs, mm: account filp and names caches to kmemcg

2017-10-24 Thread Greg Thelen

Michal Hocko  wrote:

> On Tue 24-10-17 14:58:54, Johannes Weiner wrote:
>> On Tue, Oct 24, 2017 at 07:55:58PM +0200, Michal Hocko wrote:
>> > On Tue 24-10-17 13:23:30, Johannes Weiner wrote:
>> > > On Tue, Oct 24, 2017 at 06:22:13PM +0200, Michal Hocko wrote:
>> > [...]
>> > > > What would prevent a runaway in case the only process in the memcg is
>> > > > oom unkillable then?
>> > > 
>> > > In such a scenario, the page fault handler would busy-loop right now.
>> > > 
>> > > Disabling oom kills is a privileged operation with dire consequences
>> > > if used incorrectly. You can panic the kernel with it. Why should the
>> > > cgroup OOM killer implement protective semantics around this setting?
>> > > Breaching the limit in such a setup is entirely acceptable.
>> > > 
>> > > Really, I think it's an enormous mistake to start modeling semantics
>> > > based on the most contrived and non-sensical edge case configurations.
>> > > Start the discussion with what is sane and what most users should
>> > > optimally experience, and keep the cornercases simple.
>> > 
>> > I am not really seeing your concern about the semantic. The most
>> > important property of the hard limit is to protect from runaways and
>> > stop them if they happen. Users can use the softer variant (high limit)
>> > if they are not afraid of those scenarios. It is not so insane to
>> > imagine that a master task (which I can easily imagine would be oom
>> > disabled) has a leak and runaway as a result.
>> 
>> Then you're screwed either way. Where do you return -ENOMEM in a page
>> fault path that cannot OOM kill anything? Your choice is between
>> maintaining the hard limit semantics or going into an infinite loop.
>
> in the PF path yes. And I would argue that this is a reasonable
> compromise to provide the gurantee the hard limit is giving us (and
> the resulting isolation which is the whole point). Btw. we are already
> having that behavior. All we are talking about is the non-PF path which
> ENOMEMs right now and the meta-patch tried to handle it more gracefully
> and only ENOMEM when there is no other option.
>
>> I fail to see how this setup has any impact on the semantics we pick
>> here. And even if it were real, it's really not what most users do.
>
> sure, such a scenario is really on the edge but my main point was that
> the hard limit is an enforcement of an isolation guarantee (as much as
> possible of course).
>
>> > We are not talking only about the page fault path. There are other
>> > allocation paths to consume a lot of memory and spill over and break
>> > the isolation restriction. So it makes much more sense to me to fail
>> > the allocation in such a situation rather than allow the runaway to
>> > continue. Just consider that such a situation shouldn't happen in
>> > the first place because there should always be an eligible task to
>> > kill - who would own all the memory otherwise?
>> 
>> Okay, then let's just stick to the current behavior.
>
> I am definitely not pushing that thing right now. It is good to discuss
> it, though. The more kernel allocations we will track the more careful we
> will have to be. So maybe we will have to reconsider the current
> approach. I am not sure we need it _right now_ but I feel we will
> eventually have to reconsider it.

The kernel already attempts to charge radix_tree_nodes.  If they fail
then we fallback to unaccounted memory.  So the memcg limit already
isn't an air tight constraint.

I agree that unchecked overcharging could be bad, but wonder if we could
overcharge kmem so long as there is a pending oom kill victim.  If
current is the victim or no victim, then fail allocations (as is
currently done).  The current thread can loop in syscall exit until
usage is reconciled (either via reclaim or kill).  This seems consistent
with pagefault oom handling and compatible with overcommit use case.

Here's an example of an overcommit case we've found quite useful.  Memcg A has
memory which is shared between children B and C.  B is more important the C.
B and C are unprivileged, neither has the authority to kill the other.

/A(limit=100MB) - B(limit=80MB,prio=high)
 \ C(limit=80MB,prio=low)

If memcg charge drives B.usage+C.usage>=A.limit, then C should be killed due to
its low priority.  B pagefault can kill, but if a syscall returns ENOMEM then B
can't do anything useful with it.

I know there are related oom killer victim selections discussions afoot.
Even with classic oom_score_adj killing it's possible to heavily bias
oom killer to select C over B.

Re: [PATCH] fs, mm: account filp and names caches to kmemcg

2017-10-12 Thread Greg Thelen

Johannes Weiner  wrote:

> On Tue, Oct 10, 2017 at 04:24:34PM +0200, Michal Hocko wrote:
>> On Tue 10-10-17 10:17:33, Johannes Weiner wrote:
>> > On Tue, Oct 10, 2017 at 11:14:30AM +0200, Michal Hocko wrote:
>> > > On Mon 09-10-17 16:26:13, Johannes Weiner wrote:
>> > > > It's consistent in the sense that only page faults enable the memcg
>> > > > OOM killer. It's not the type of memory that decides, it's whether the
>> > > > allocation context has a channel to communicate an error to userspace.
>> > > > 
>> > > > Whether userspace is able to handle -ENOMEM from syscalls was a voiced
>> > > > concern at the time this patch was merged, although there haven't been
>> > > > any reports so far,
>> > > 
>> > > Well, I remember reports about MAP_POPULATE breaking or at least having
>> > > an unexpected behavior.
>> > 
>> > Hm, that slipped past me. Did we do something about these? Or did they
>> > fix userspace?
>> 
>> Well it was mostly LTP complaining. I have tried to fix that but Linus
>> was against so we just documented that this is possible and MAP_POPULATE
>> is not a guarantee.
>
> Okay, makes sense. I wouldn't really count that as a regression.
>
>> > > Well, we should be able to do that with the oom_reaper. At least for v2
>> > > which doesn't have synchronous userspace oom killing.
>> > 
>> > I don't see how the OOM reaper is a guarantee as long as we have this:
>> > 
>> >if (!down_read_trylock(&mm->mmap_sem)) {
>> >ret = false;
>> >trace_skip_task_reaping(tsk->pid);
>> >goto unlock_oom;
>> >}
>> 
>> And we will simply mark the victim MMF_OOM_SKIP and hide it from the oom
>> killer if we fail to get the mmap_sem after several attempts. This will
>> allow to find a new victim. So we shouldn't deadlock.
>
> It's less likely to deadlock, but not exactly deadlock-free. There
> might not BE any other mm's holding significant amounts of memory.
>
>> > What do you mean by 'v2'?
>> 
>> cgroup v2 because the legacy memcg allowed sync wait for the oom killer
>> and that would be a bigger problem from a deep callchains for obevious
>> reasons.
>
> Actually, the async oom killing code isn't dependent on cgroup
> version. cgroup1 doesn't wait inside the charge context, either.
>
>> > > > > c) Overcharge kmem to oom memcg and queue an async memcg limit 
>> > > > > checker,
>> > > > >which will oom kill if needed.
>> > > > 
>> > > > This makes the most sense to me. Architecturally, I imagine this would
>> > > > look like b), with an OOM handler at the point of return to userspace,
>> > > > except that we'd overcharge instead of retrying the syscall.
>> > > 
>> > > I do not think we should break the hard limit semantic if possible. We
>> > > can currently allow that for allocations which are very short term (oom
>> > > victims) or too important to fail but allowing that for kmem charges in
>> > > general sounds like too easy to runaway.
>> > 
>> > I'm not sure there is a convenient way out of this.
>> > 
>> > If we want to respect the hard limit AND guarantee allocation success,
>> > the OOM killer has to free memory reliably - which it doesn't. But if
>> > it did, we could also break the limit temporarily and have the OOM
>> > killer replenish the pool before that userspace app can continue. The
>> > allocation wouldn't have to be short-lived, since memory is fungible.
>> 
>> If we can guarantee the oom killer is started then we can allow temporal
>> access to reserves which is already implemented even for memcg. The
>> thing is we do not invoke the oom killer...
>
> You lost me here. Which reserves?
>
> All I'm saying is that, when the syscall-context fails to charge, we
> should do mem_cgroup_oom() to set up the async OOM killer, let the
> charge succeed over the hard limit - since the OOM killer will most
> likely get us back below the limit - then mem_cgroup_oom_synchronize()
> before the syscall returns to userspace.
>
> That would avoid returning -ENOMEM from syscalls without the risk of
> the hard limit deadlocking - at the risk of sometimes overrunning the
> hard limit, but that seems like the least problematic behavior out of
> the three.

Overcharging kmem with deferred reconciliation sounds good to me.

A few comments (not reasons to avoid this):

1) If a task is moved between memcg it seems possible to overcharge
   multiple oom memcg for different kmem/user allocations.
   mem_cgroup_oom_synchronize() would see at most one oom memcg in
   current->memcg_in_oom.  Thus it'd only reconcile a single memcg.  But
   that seems pretty rare and the next charge to any of the other memcg
   would reconcile them.

2) if a kernel thread charges kmem on behalf of a client mm then there
   is no good place to call mem_cgroup_oom_synchronize(), short of
   launching a work item in mem_cgroup_oom().  I don't we have anything
   like that yet.  So nothing to worry about.

3) it's debatable if mem_cgroup_oom_synchronize() should first attempt
   reclaim before killing.  But

Re: [PATCH] fs, mm: account filp and names caches to kmemcg

2017-10-09 Thread Greg Thelen

Michal Hocko  wrote:

> On Fri 06-10-17 12:33:03, Shakeel Butt wrote:
>> >>   names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
>> >> - SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
>> >> + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
>> >
>> > I might be wrong but isn't name cache only holding temporary objects
>> > used for path resolution which are not stored anywhere?
>> >
>> 
>> Even though they're temporary, many containers can together use a
>> significant amount of transient uncharged memory. We've seen machines
>> with 100s of MiBs in names_cache.
>
> Yes that might be possible but are we prepared for random ENOMEM from
> vfs calls which need to allocate a temporary name?
>
>> 
>> >>   filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
>> >> - SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
>> >> + SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, 
>> >> NULL);
>> >>   percpu_counter_init(&nr_files, 0, GFP_KERNEL);
>> >>  }
>> >
>> > Don't we have a limit for the maximum number of open files?
>> >
>> 
>> Yes, there is a system limit of maximum number of open files. However
>> this limit is shared between different users on the system and one
>> user can hog this resource. To cater that, we set the maximum limit
>> very high and let the memory limit of each user limit the number of
>> files they can open.
>
> Similarly here. Are all syscalls allocating a fd prepared to return
> ENOMEM?
>
> -- 
> Michal Hocko
> SUSE Labs

Even before this patch I find memcg oom handling inconsistent.  Page
cache pages trigger oom killer and may allow caller to succeed once the
kernel retries.  But kmem allocations don't call oom killer.  They
surface errors to user space.  This makes memcg hard to use for memory
overcommit because it's desirable for a high priority task to
transparently kill a lower priority task using the memcg oom killer.

A few ideas on how to make it more flexible:

a) Go back to memcg oom killing within memcg charging.  This runs risk
   of oom killing while caller holds locks which oom victim selection or
   oom victim termination may need.  Google's been running this way for
   a while.

b) Have every syscall return do something similar to page fault handler:
   kmem allocations in oom memcg mark the current task as needing an oom
   check return NULL.  If marked oom, syscall exit would use
   mem_cgroup_oom_synchronize() before retrying the syscall.  Seems
   risky.  I doubt every syscall is compatible with such a restart.

c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
   which will oom kill if needed.

Comments?

Demo program which eventually gets ENOSPC from mkdir.

$ cat /tmp/t
while umount /tmp/mnt; do true; done
mkdir -p /tmp/mnt
mount -t tmpfs nodev /tmp/mnt
cd /dev/cgroup/memory
rmdir t
mkdir t
echo 32M > t/memory.limit_in_bytes
(echo $BASHPID > t/cgroup.procs && cd /tmp/mnt && exec /tmp/mkdirs)

$ cat /tmp/mkdirs.c
#include 
#include 
#include 
#include 
#include 

int main()
{
int i;
char name[32];

if (mlockall(MCL_CURRENT|MCL_FUTURE))
err(1, "mlockall");
for (i = 0; i < (1<<20); i++) {
sprintf(name, "%d", i);
if (mkdir(name, 0700))
err(1, "mkdir");
}
printf("done\n");
return 0;
}

[PATCH] block: tolerate tracing of NULL bio

2017-09-07 Thread Greg Thelen

__get_request() can call trace_block_getrq() with bio=NULL which causes
block_get_rq::TP_fast_assign() to deref a NULL pointer and panic.

Syzkaller fuzzer panics with
linux-next (1d53d908b79d7870d89063062584eead4cf83448):
  kasan: GPF could be caused by NULL-ptr deref or user memory access
  general protection fault:  [#1] SMP KASAN
  Modules linked in:
  CPU: 0 PID: 2983 Comm: syzkaller40 Not tainted 4.13.0-rc7-next-20170901+ 
#13
  task: 8801cf1da000 task.stack: 8801ce44
  RIP: 0010:perf_trace_block_get_rq+0x697/0x970 include/trace/events/block.h:384
  RSP: 0018:8801ce4473f0 EFLAGS: 00010246
  RAX: 8801cf1da000 RBX: 110039c88e84 RCX: 1d184d27
  RDX: dc01 RSI: 11003b643e7a RDI: e8c26938
  RBP: 8801ce447530 R08: 11003b643e6c R09: e8c26964
  R10: 0002 R11: f9184d2d R12: e8c1f890
  R13: e8c26930 R14: 85cad9e0 R15: 
  FS:  02641880() GS:8801db20() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 0043e670 CR3: 0001d1d7a000 CR4: 001406f0
  DR0:  DR1:  DR2: 
  DR3:  DR6: fffe0ff0 DR7: 0400
  Call Trace:
trace_block_getrq include/trace/events/block.h:423 [inline]
__get_request block/blk-core.c:1283 [inline]
get_request+0x1518/0x23b0 block/blk-core.c:1355
blk_old_get_request block/blk-core.c:1402 [inline]
blk_get_request+0x1d8/0x3c0 block/blk-core.c:1427
sg_scsi_ioctl+0x117/0x750 block/scsi_ioctl.c:451
sg_ioctl+0x192d/0x2ed0 drivers/scsi/sg.c:1070
vfs_ioctl fs/ioctl.c:45 [inline]
do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:685
SYSC_ioctl fs/ioctl.c:700 [inline]
SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
entry_SYSCALL_64_fastpath+0x1f/0xbe

block_get_rq::TP_fast_assign() has multiple redundant ->dev assignments.
Only one of them is NULL tolerant.  Favor the NULL tolerant one.

Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and 
partitions index")
Signed-off-by: Greg Thelen 
---
 include/trace/events/block.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index f815aaaef755..1fd7ff1a46f7 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -397,7 +397,6 @@ DECLARE_EVENT_CLASS(block_get_rq,
 
TP_fast_assign(
__entry->dev= bio ? bio_dev(bio) : 0;
-   __entry->dev= bio_dev(bio);
__entry->sector = bio ? bio->bi_iter.bi_sector : 0;
__entry->nr_sector  = bio ? bio_sectors(bio) : 0;
blk_fill_rwbs(__entry->rwbs,
@@ -414,7 +413,7 @@ DECLARE_EVENT_CLASS(block_get_rq,
 /**
  * block_getrq - get a free request entry in queue for block IO operations
  * @q: queue for operations
- * @bio: pending block IO operation
+ * @bio: pending block IO operation (can be %NULL)
  * @rw: low bit indicates a read (%0) or a write (%1)
  *
  * A request struct for queue @q has been allocated to handle the
@@ -430,7 +429,7 @@ DEFINE_EVENT(block_get_rq, block_getrq,
 /**
  * block_sleeprq - waiting to get a free request entry in queue for block IO 
operation
  * @q: queue for operation
- * @bio: pending block IO operation
+ * @bio: pending block IO operation (can be %NULL)
  * @rw: low bit indicates a read (%0) or a write (%1)
  *
  * In the case where a request struct cannot be provided for queue @q
-- 
2.14.1.581.gf28d330327-goog

Re: [PATCH] net/mlx4: suppress 'may be used uninitialized' warning

2017-04-18 Thread Greg Thelen

Leon Romanovsky  wrote:

> [ Unknown signature status ]
> On Mon, Apr 17, 2017 at 11:21:35PM -0700, Greg Thelen wrote:
>> gcc 4.8.4 complains that mlx4_SW2HW_MPT_wrapper() uses an uninitialized
>> 'mpt' variable:
>>   drivers/net/ethernet/mellanox/mlx4/resource_tracker.c: In function 
>> 'mlx4_SW2HW_MPT_wrapper':
>>   drivers/net/ethernet/mellanox/mlx4/resource_tracker.c:2802:12: warning: 
>> 'mpt' may be used uninitialized in this function [-Wmaybe-uninitialized]
>>  mpt->mtt = mtt;
>>
>> I think this warning is a false complaint.  mpt is only used when
>> mr_res_start_move_to() return zero, and in all such cases it initializes
>> mpt.
>> But apparently gcc cannot see that.
>>
>> Initialize mpt to avoid the warning.
>>
>> Signed-off-by: Greg Thelen 
>> ---
>
> It looks like other callers of mr_res_start_move_to() have the same
> "uninitialized" variable.
>
> Thanks

The above is the only mellanox warning I see.  So gcc is able to better
analyze the other callsites.

[PATCH] net/mlx4: suppress 'may be used uninitialized' warning

2017-04-17 Thread Greg Thelen

gcc 4.8.4 complains that mlx4_SW2HW_MPT_wrapper() uses an uninitialized
'mpt' variable:
  drivers/net/ethernet/mellanox/mlx4/resource_tracker.c: In function 
'mlx4_SW2HW_MPT_wrapper':
  drivers/net/ethernet/mellanox/mlx4/resource_tracker.c:2802:12: warning: 'mpt' 
may be used uninitialized in this function [-Wmaybe-uninitialized]
 mpt->mtt = mtt;

I think this warning is a false complaint.  mpt is only used when
mr_res_start_move_to() return zero, and in all such cases it initializes
mpt.  But apparently gcc cannot see that.

Initialize mpt to avoid the warning.

Signed-off-by: Greg Thelen 
---
 drivers/net/ethernet/mellanox/mlx4/resource_tracker.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c 
b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
index d8d5d161b8c7..4aa29ee93013 100644
--- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
+++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
@@ -2749,7 +2749,7 @@ int mlx4_SW2HW_MPT_wrapper(struct mlx4_dev *dev, int 
slave,
int err;
int index = vhcr->in_modifier;
struct res_mtt *mtt;
-   struct res_mpt *mpt;
+   struct res_mpt *mpt = NULL;
int mtt_base = mr_get_mtt_addr(inbox->buf) / dev->caps.mtt_entry_sz;
int phys;
int id;
-- 
2.12.2.762.g0e3151a226-goog

[PATCH] slab: avoid IPIs when creating kmem caches

2017-04-16 Thread Greg Thelen

Each slab kmem cache has per cpu array caches.  The array caches are
created when the kmem_cache is created, either via kmem_cache_create()
or lazily when the first object is allocated in context of a kmem
enabled memcg.  Array caches are replaced by writing to /proc/slabinfo.

Array caches are protected by holding slab_mutex or disabling
interrupts.  Array cache allocation and replacement is done by
__do_tune_cpucache() which holds slab_mutex and calls
kick_all_cpus_sync() to interrupt all remote processors which confirms
there are no references to the old array caches.

IPIs are needed when replacing array caches.  But when creating a new
array cache, there's no need to send IPIs because there cannot be any
references to the new cache.  Outside of memcg kmem accounting these
IPIs occur at boot time, so they're not a problem.  But with memcg kmem
accounting each container can create kmem caches, so the IPIs are
wasteful.

Avoid unnecessary IPIs when creating array caches.

Test which reports the IPI count of allocating slab in 1 memcg:
import os

def ipi_count():
with open("/proc/interrupts") as f:
for l in f:
if 'Function call interrupts' in l:
return int(l.split()[1])

def echo(val, path):
with open(path, "w") as f:
f.write(val)

n = 1
os.chdir("/mnt/cgroup/memory")
pid = str(os.getpid())
a = ipi_count()
for i in range(n):
os.mkdir(str(i))
echo("1G\n", "%d/memory.limit_in_bytes" % i)
echo("1G\n", "%d/memory.kmem.limit_in_bytes" % i)
echo(pid, "%d/cgroup.procs" % i)
open("/tmp/x", "w").close()
os.unlink("/tmp/x")
b = ipi_count()
print "%d loops: %d => %d (+%d ipis)" % (n, a, b, b-a)
echo(pid, "cgroup.procs")
for i in range(n):
        os.rmdir(str(i))

patched:   1 loops: 1069 => 1170 (+101 ipis)
unpatched: 1 loops: 1192 => 48933 (+47741 ipis)

Signed-off-by: Greg Thelen 
---
 mm/slab.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/slab.c b/mm/slab.c
index 807d86c76908..1880d482a0cb 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3879,7 +3879,12 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, 
int limit,
 
prev = cachep->cpu_cache;
cachep->cpu_cache = cpu_cache;
-   kick_all_cpus_sync();
+   /*
+* Without a previous cpu_cache there's no need to synchronize remote
+* cpus, so skip the IPIs.
+*/
+   if (prev)
+   kick_all_cpus_sync();
 
check_irq_on();
cachep->batchcount = batchcount;
-- 
2.12.2.762.g0e3151a226-goog

[PATCH 2/2] kasan: add memcg kmem_cache test

2016-12-20 Thread Greg Thelen

Make a kasan test which uses a SLAB_ACCOUNT slab cache.  If the test is
run within a non default memcg, then it uncovers the bug fixed by
"kasan: drain quarantine of memcg slab objects"[1].

If run without fix [1] it shows "Slab cache still has objects", and the
kmem_cache structure is leaked.
Here's an unpatched kernel test:
$ dmesg -c > /dev/null
$ mkdir /sys/fs/cgroup/memory/test
$ echo $$ > /sys/fs/cgroup/memory/test/tasks
$ modprobe test_kasan 2> /dev/null
$ dmesg | grep -B1 still
[ 123.456789] kasan test: memcg_accounted_kmem_cache allocate memcg accounted 
object
[ 124.456789] kmem_cache_destroy test_cache: Slab cache still has objects

Kernels with fix [1] don't have the "Slab cache still has objects"
warning or the underlying leak.

The new test runs and passes in the default (root) memcg, though in the
root memcg it won't uncover the problem fixed by [1].

Signed-off-by: Greg Thelen 
---
 lib/test_kasan.c | 34 ++
 1 file changed, 34 insertions(+)

diff --git a/lib/test_kasan.c b/lib/test_kasan.c
index fbdf87920093..0b1d3140fbb8 100644
--- a/lib/test_kasan.c
+++ b/lib/test_kasan.c
@@ -11,6 +11,7 @@
 
 #define pr_fmt(fmt) "kasan test: %s " fmt, __func__
 
+#include 
 #include 
 #include 
 #include 
@@ -331,6 +332,38 @@ static noinline void __init kmem_cache_oob(void)
kmem_cache_destroy(cache);
 }
 
+static noinline void __init memcg_accounted_kmem_cache(void)
+{
+   int i;
+   char *p;
+   size_t size = 200;
+   struct kmem_cache *cache;
+
+   cache = kmem_cache_create("test_cache", size, 0, SLAB_ACCOUNT, NULL);
+   if (!cache) {
+   pr_err("Cache allocation failed\n");
+   return;
+   }
+
+   pr_info("allocate memcg accounted object\n");
+   /*
+* Several allocations with a delay to allow for lazy per memcg kmem
+* cache creation.
+*/
+   for (i = 0; i < 5; i++) {
+   p = kmem_cache_alloc(cache, GFP_KERNEL);
+   if (!p) {
+   pr_err("Allocation failed\n");
+   goto free_cache;
+   }
+   kmem_cache_free(cache, p);
+   msleep(100);
+   }
+
+free_cache:
+   kmem_cache_destroy(cache);
+}
+
 static char global_array[10];
 
 static noinline void __init kasan_global_oob(void)
@@ -460,6 +493,7 @@ static int __init kmalloc_tests_init(void)
kmalloc_uaf_memset();
kmalloc_uaf2();
kmem_cache_oob();
+   memcg_accounted_kmem_cache();
kasan_stack_oob();
kasan_global_oob();
ksize_unpoisons_memory();
-- 
2.8.0.rc3.226.g39d4020

[PATCH 1/2] kasan: drain quarantine of memcg slab objects

2016-12-20 Thread Greg Thelen

Per memcg slab accounting and kasan have a problem with kmem_cache
destruction.
- kmem_cache_create() allocates a kmem_cache, which is used for
  allocations from processes running in root (top) memcg.
- Processes running in non root memcg and allocating with either
  __GFP_ACCOUNT or from a SLAB_ACCOUNT cache use a per memcg kmem_cache.
- Kasan catches use-after-free by having kfree() and kmem_cache_free()
  defer freeing of objects.  Objects are placed in a quarantine.
- kmem_cache_destroy() destroys root and non root kmem_caches.  It takes
  care to drain the quarantine of objects from the root memcg's
  kmem_cache, but ignores objects associated with non root memcg.  This
  causes leaks because quarantined per memcg objects refer to per memcg
  kmem cache being destroyed.

To see the problem:
1) create a slab cache with kmem_cache_create(,,,SLAB_ACCOUNT,)
2) from non root memcg, allocate and free a few objects from cache
3) dispose of the cache with kmem_cache_destroy()
kmem_cache_destroy() will trigger a "Slab cache still has objects"
warning indicating that the per memcg kmem_cache structure was leaked.

Fix the leak by draining kasan quarantined objects allocated from non
root memcg.

Racing memcg deletion is tricky, but handled.  kmem_cache_destroy() =>
shutdown_memcg_caches() => __shutdown_memcg_cache() => shutdown_cache()
flushes per memcg quarantined objects, even if that memcg has been
rmdir'd and gone through memcg_deactivate_kmem_caches().

This leak only affects destroyed SLAB_ACCOUNT kmem caches when kasan is
enabled.  So I don't think it's worth patching stable kernels.

Signed-off-by: Greg Thelen 
---
 include/linux/kasan.h | 4 ++--
 mm/kasan/kasan.c  | 2 +-
 mm/kasan/quarantine.c | 1 +
 mm/slab_common.c  | 4 +++-
 4 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 820c0ad54a01..c908b25bf5a5 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -52,7 +52,7 @@ void kasan_free_pages(struct page *page, unsigned int order);
 void kasan_cache_create(struct kmem_cache *cache, size_t *size,
unsigned long *flags);
 void kasan_cache_shrink(struct kmem_cache *cache);
-void kasan_cache_destroy(struct kmem_cache *cache);
+void kasan_cache_shutdown(struct kmem_cache *cache);
 
 void kasan_poison_slab(struct page *page);
 void kasan_unpoison_object_data(struct kmem_cache *cache, void *object);
@@ -98,7 +98,7 @@ static inline void kasan_cache_create(struct kmem_cache 
*cache,
  size_t *size,
  unsigned long *flags) {}
 static inline void kasan_cache_shrink(struct kmem_cache *cache) {}
-static inline void kasan_cache_destroy(struct kmem_cache *cache) {}
+static inline void kasan_cache_shutdown(struct kmem_cache *cache) {}
 
 static inline void kasan_poison_slab(struct page *page) {}
 static inline void kasan_unpoison_object_data(struct kmem_cache *cache,
diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c
index 0e9505f66ec1..8d020ad5b74a 100644
--- a/mm/kasan/kasan.c
+++ b/mm/kasan/kasan.c
@@ -428,7 +428,7 @@ void kasan_cache_shrink(struct kmem_cache *cache)
quarantine_remove_cache(cache);
 }
 
-void kasan_cache_destroy(struct kmem_cache *cache)
+void kasan_cache_shutdown(struct kmem_cache *cache)
 {
quarantine_remove_cache(cache);
 }
diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c
index baabaad4a4aa..fb362cb19157 100644
--- a/mm/kasan/quarantine.c
+++ b/mm/kasan/quarantine.c
@@ -273,6 +273,7 @@ static void per_cpu_remove_cache(void *arg)
qlist_free_all(&to_free, cache);
 }
 
+/* Free all quarantined objects belonging to cache. */
 void quarantine_remove_cache(struct kmem_cache *cache)
 {
unsigned long flags;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 329b03843863..d3c8602dea5d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -455,6 +455,9 @@ EXPORT_SYMBOL(kmem_cache_create);
 static int shutdown_cache(struct kmem_cache *s,
struct list_head *release, bool *need_rcu_barrier)
 {
+   /* free asan quarantined objects */
+   kasan_cache_shutdown(s);
+
if (__kmem_cache_shutdown(s) != 0)
return -EBUSY;
 
@@ -715,7 +718,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
get_online_cpus();
get_online_mems();
 
-   kasan_cache_destroy(s);
mutex_lock(&slab_mutex);
 
s->refcount--;
-- 
2.8.0.rc3.226.g39d4020

[PATCH] memcg: fix stale mem_cgroup_force_empty() comment

2016-05-06 Thread Greg Thelen

commit f61c42a7d911 ("memcg: remove tasks/children test from
mem_cgroup_force_empty()") removed memory reparenting from the function.

Fix the function's comment.

Signed-off-by: Greg Thelen 
---
 mm/memcontrol.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fe787f5c41bd..19fd76168a05 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2636,8 +2636,7 @@ static inline bool memcg_has_children(struct mem_cgroup 
*memcg)
 }
 
 /*
- * Reclaims as many pages from the given memcg as possible and moves
- * the rest to the parent.
+ * Reclaims as many pages from the given memcg as possible.
  *
  * Caller is responsible for holding css reference for memcg.
  */
-- 
2.8.0.rc3.226.g39d4020

Re: [GIT PULL] ext4 bug fixes for 4.6

2016-04-09 Thread Greg Thelen


Theodore Ts'o wrote:

> The following changes since commit 243d50678583100855862bc084b8b307eea67f68:
>
>   Merge branch 'overlayfs-linus' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs (2016-03-22 
> 13:11:15 -0700)
>
n> are available in the git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
> tags/ext4_for_linus_stable
>
> for you to fetch changes up to c325a67c72903e1cc30e990a15ce745bda0dbfde:
>
>   ext4: ignore quota mount options if the quota feature is enabled 
> (2016-04-03 17:03:37 -0400)
>
> 
> These changes contains a fix for overlayfs interacting with some
> (badly behaved) dentry code in various file systems.  These have been
> reviewed by Al and the respective file system mtinainers and are going
> through the ext4 tree for convenience.
>
> This also has a few ext4 encryption bug fixes that were discovered in
> Android testing (yes, we will need to get these sync'ed up with the
> fs/crypto code; I'll take care of that).  It also has some bug fixes
> and a change to ignore the legacy quota options to allow for xfstests
> regression testing of ext4's internal quota feature and to be more
> consistent with how xfs handles this case.
>
> 
> Dan Carpenter (1):
>   ext4 crypto: fix some error handling
>
> Filipe Manana (1):
>   btrfs: fix crash/invalid memory access on fsync when using overlayfs
>
> Jan Kara (1):
>   ext4: retry block allocation for failed DIO and DAX writes
>
> Miklos Szeredi (4):
>   fs: add file_dentry()
>   nfs: use file_dentry()
>   ext4: use dget_parent() in ext4_file_open()
>   ext4: use file_dentry()
>
> Theodore Ts'o (7):
>   ext4: check if in-inode xattr is corrupted in 
> ext4_expand_extra_isize_ea()
>   ext4 crypto: don't let data integrity writebacks fail with ENOMEM
>   ext4 crypto: use dget_parent() in ext4_d_revalidate()
>   ext4: allow readdir()'s of large empty directories to be interrupted

Ted,

I've been testing 5b5b7fd185e9 (linus/master) and seeing that
interrupted readdir() now returns duplicate dirents.

Reverting commit 1028b55bafb7 ("ext4: allow readdir()'s of large empty
directories to be interrupted") avoids duplicates.

On 5b5b7fd185e9 a SIGPROF to the test program below occasionally causes
"already seen name" error.  On older kernels, it runs indefinitely
without error.

/*
  mkdir /tmp/foo
  cd /tmp/foo
  for i in $(seq 0 599); do touch $i; done
  /tmp/scanner &
  kill -PROF %1
 */

#include 
#include 
#include 
#include 
#include 
#include 

static void handler(int unused)
{
}

int main() {
enum { count = 600 };
char seen[count];
struct dirent *ent;
DIR *dir;
struct sigaction sa = {0};

sa.sa_handler = handler;
if (sigemptyset(&sa.sa_mask))
err(1, "sigemptyset");
sa.sa_flags = SA_RESTART;
if (sigaction(SIGPROF, &sa, NULL))
err(1, "sigaction");

for (;;) {
memset(seen, 0, sizeof(seen));
dir = opendir(".");
if (dir == NULL)
err(1, "opendir(.)");
while ((ent = readdir(dir)) != NULL) {
int idx;

if ((strcmp(ent->d_name, ".") == 0) ||
(strcmp(ent->d_name, "..") == 0)) {
continue;
}
idx = atoi(ent->d_name);
if (idx >= count) {
errx(1, "bogus name %s index %d", ent->d_name, 
idx);
}
if (seen[idx]) {
errx(1, "already seen name %s index %d", 
ent->d_name, idx);
}
seen[idx] = 1;
}
if (closedir(dir))
err(1, "closedir(.)");
}
}

>   ext4: add lockdep annotations for i_data_sem
>   ext4: avoid calling dquot_get_next_id() if quota is not enabled
>   ext4: ignore quota mount options if the quota feature is enabled
>
>  fs/btrfs/file.c|  2 +-
>  fs/dcache.c|  5 -
>  fs/ext4/crypto.c   | 49 +
>  fs/ext4/dir.c  |  5 +
>  fs/ext4/ext4.h | 29 +++--
>  fs/ext4/file.c | 12 
>  fs/ext4/inode.c| 58 
> --
>  fs/ext4/move_extent.c  | 11 +--
>  fs/ext4/namei.c|  5 +
>  fs/ext4/page-io.c  | 14 +-
>  fs/ext4/readpage.c |  2 +-
>  fs/ext4/super.c| 61 
> +++--
>  fs/ext4/xattr.c| 32 
>  fs/nfs/dir.c   |  6

Re: [PATCH 1/5] mm: uncharge kmem pages from generic free_page path

2015-09-30 Thread Greg Thelen


Vladimir Davydov wrote:

> Currently, to charge a page to kmemcg one should use alloc_kmem_pages
> helper. When the page is not needed anymore it must be freed with
> free_kmem_pages helper, which will uncharge the page before freeing it.
> Such a design is acceptable for thread info pages and kmalloc large
> allocations, which are currently the only users of alloc_kmem_pages, but
> it gets extremely inconvenient if one wants to make use of batched free
> (e.g. to charge page tables - see release_pages) or page reference
> counter (pipe buffers - see anon_pipe_buf_release).
>
> To overcome this limitation, this patch moves kmemcg uncharge code to
> the generic free path and zaps free_kmem_pages helper. To distinguish
> kmem pages from other page types, it makes alloc_kmem_pages initialize
> page->_mapcount to a special value and introduces a new PageKmem helper,
> which returns true if it sees this value.
>
> Signed-off-by: Vladimir Davydov 
> ---
>  include/linux/gfp.h|  3 ---
>  include/linux/page-flags.h | 22 ++
>  kernel/fork.c  |  2 +-
>  mm/page_alloc.c| 26 --
>  mm/slub.c  |  2 +-
>  mm/swap.c  |  3 ++-
>  6 files changed, 34 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index f92cbd2f4450..b46147c45966 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -384,9 +384,6 @@ extern void *__alloc_page_frag(struct page_frag_cache *nc,
>  unsigned int fragsz, gfp_t gfp_mask);
>  extern void __free_page_frag(void *addr);
>  
> -extern void __free_kmem_pages(struct page *page, unsigned int order);
> -extern void free_kmem_pages(unsigned long addr, unsigned int order);
> -
>  #define __free_page(page) __free_pages((page), 0)
>  #define free_page(addr) free_pages((addr), 0)
>  
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 416509e26d6d..a190719c2f46 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -594,6 +594,28 @@ static inline void __ClearPageBalloon(struct page *page)
>  }
>  
>  /*
> + * PageKmem() returns true if the page was allocated with alloc_kmem_pages().
> + */
> +#define PAGE_KMEM_MAPCOUNT_VALUE (-512)
> +
> +static inline int PageKmem(struct page *page)
> +{
> + return atomic_read(&page->_mapcount) == PAGE_KMEM_MAPCOUNT_VALUE;
> +}
> +
> +static inline void __SetPageKmem(struct page *page)
> +{
> + VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);
> + atomic_set(&page->_mapcount, PAGE_KMEM_MAPCOUNT_VALUE);
> +}

What do you think about several special mapcount values for various
types of kmem?

It's helps user and administrators break down memory usage.

A nice equation is:
  memory.usage_in_bytes = memory.stat[file + anon + unevictable + kmem]

Next, it's helpful to be able to breakdown kmem into:
  kmem = stack + pgtable + slab + ...

On one hand (and the kernel I use internally) we can use separate per
memcg counters for each kmem type.  Then reconstitute memory.kmem as
needed by adding them together.  But using keeping a single kernel kmem
counter is workable if there is a way to breakdown the memory charge to
a container (e.g. by walking /proc/kpageflags-ish or per memcg
memory.kpageflags-ish file).

> +static inline void __ClearPageKmem(struct page *page)
> +{
> + VM_BUG_ON_PAGE(!PageKmem(page), page);
> + atomic_set(&page->_mapcount, -1);
> +}
> +
> +/*
>   * If network-based swap is enabled, sl*b must keep track of whether pages
>   * were allocated from pfmemalloc reserves.
>   */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 2845623fb582..c23f8a17e99e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -169,7 +169,7 @@ static struct thread_info *alloc_thread_info_node(struct 
> task_struct *tsk,
>  
>  static inline void free_thread_info(struct thread_info *ti)
>  {
> - free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
> + free_pages((unsigned long)ti, THREAD_SIZE_ORDER);
>  }
>  # else
>  static struct kmem_cache *thread_info_cache;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48aaf7b9f253..88d85367c81e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -942,6 +942,10 @@ static bool free_pages_prepare(struct page *page, 
> unsigned int order)
>  
>   if (PageAnon(page))
>   page->mapping = NULL;
> + if (PageKmem(page)) {
> + memcg_kmem_uncharge_pages(page, order);
> + __ClearPageKmem(page);
> + }
>   bad += free_pages_check(page);
>   for (i = 1; i < (1 << order); i++) {
>   if (compound)
> @@ -3434,6 +3438,8 @@ struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned 
> int order)
>   return NULL;
>   page = alloc_pages(gfp_mask, order);
>   memcg_kmem_commit_charge(page, memcg, order);
> + if (page)
> + __SetPageKmem(page);
>   return page;
>  }
>

Re: [PATCH] memcg: make mem_cgroup_read_stat() unsigned

2015-09-25 Thread Greg Thelen


Michal Hocko wrote:

> On Tue 22-09-15 15:16:32, Greg Thelen wrote:
>> mem_cgroup_read_stat() returns a page count by summing per cpu page
>> counters.  The summing is racy wrt. updates, so a transient negative sum
>> is possible.  Callers don't want negative values:
>> - mem_cgroup_wb_stats() doesn't want negative nr_dirty or nr_writeback.
>
> OK, this can confuse dirty throttling AFAIU
>
>> - oom reports and memory.stat shouldn't show confusing negative usage.
>
> I guess this is not earth shattering.
>
>> - tree_usage() already avoids negatives.
>> 
>> Avoid returning negative page counts from mem_cgroup_read_stat() and
>> convert it to unsigned.
>> 
>> Signed-off-by: Greg Thelen 
>
> I guess we want that for stable 4.2 because of the dirty throttling
> part. Longterm we should use generic per-cpu counter.
>
> Acked-by: Michal Hocko 
>
> Thanks!

Correct, this is not an earth shattering patch.  The patch only filters
out negative memcg stat values from mem_cgroup_read_stat() callers.
Negative values should only be temporary due to stat update races.  So
I'm not sure it's worth sending it to stable.  I've heard no reports of
it troubling anyone.  The worst case without this patch is that memcg
temporarily burps up a negative dirty and/or writeback count which
causes balance_dirty_pages() to sleep for a (at most) 200ms nap
(MAX_PAUSE).  Ccing Tejun in case there are more serious consequences to
balance_dirty_pages() occasionally seeing a massive (underflowed) dirty
or writeback count.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] fs, seqfile: always allow oom killer

2015-09-25 Thread Greg Thelen

Since commit 5cec38ac866b ("fs, seq_file: fallback to vmalloc instead of
oom kill processes") seq_buf_alloc() avoids calling the oom killer for
PAGE_SIZE or smaller allocations; but larger allocations can use the oom
killer via vmalloc().  Thus reads of small files can return ENOMEM, but
larger files use the oom killer to avoid ENOMEM.

Memory overcommit requires use of the oom killer to select a victim
regardless of file size.

Enable oom killer for small seq_buf_alloc() allocations.

Signed-off-by: David Rientjes 
Signed-off-by: Greg Thelen 
---
 fs/seq_file.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/seq_file.c b/fs/seq_file.c
index 225586e141ca..a8e288755f24 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -25,12 +25,17 @@ static void seq_set_overflow(struct seq_file *m)
 static void *seq_buf_alloc(unsigned long size)
 {
void *buf;
+   gfp_t gfp = GFP_KERNEL;
 
/*
-* __GFP_NORETRY to avoid oom-killings with high-order allocations -
-* it's better to fall back to vmalloc() than to kill things.
+* For high order allocations, use __GFP_NORETRY to avoid oom-killing -
+* it's better to fall back to vmalloc() than to kill things.  For small
+* allocations, just use GFP_KERNEL which will oom kill, thus no need
+* for vmalloc fallback.
 */
-   buf = kmalloc(size, GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN);
+   if (size > PAGE_SIZE)
+   gfp |= __GFP_NORETRY | __GFP_NOWARN;
+   buf = kmalloc(size, gfp);
if (!buf && size > PAGE_SIZE)
buf = vmalloc(size);
return buf;
-- 
2.6.0.rc2.230.g3dd15c0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] memcg: make mem_cgroup_read_stat() unsigned

2015-09-23 Thread Greg Thelen


Andrew Morton wrote:

> On Tue, 22 Sep 2015 17:42:13 -0700 Greg Thelen  wrote:
>
>> Andrew Morton wrote:
>> 
>> > On Tue, 22 Sep 2015 15:16:32 -0700 Greg Thelen  wrote:
>> >
>> >> mem_cgroup_read_stat() returns a page count by summing per cpu page
>> >> counters.  The summing is racy wrt. updates, so a transient negative sum
>> >> is possible.  Callers don't want negative values:
>> >> - mem_cgroup_wb_stats() doesn't want negative nr_dirty or nr_writeback.
>> >> - oom reports and memory.stat shouldn't show confusing negative usage.
>> >> - tree_usage() already avoids negatives.
>> >>
>> >> Avoid returning negative page counts from mem_cgroup_read_stat() and
>> >> convert it to unsigned.
>> >
>> > Someone please remind me why this code doesn't use the existing
>> > percpu_counter library which solved this problem years ago.
>> >
>> >>   for_each_possible_cpu(cpu)
>> >
>> > and which doesn't iterate across offlined CPUs.
>> 
>> I found [1] and [2] discussing memory layout differences between:
>> a) existing memcg hand rolled per cpu arrays of counters
>> vs
>> b) array of generic percpu_counter
>> The current approach was claimed to have lower memory overhead and
>> better cache behavior.
>> 
>> I assume it's pretty straightforward to create generic
>> percpu_counter_array routines which memcg could use.  Possibly something
>> like this could be made general enough could be created to satisfy
>> vmstat, but less clear.
>> 
>> [1] http://www.spinics.net/lists/cgroups/msg06216.html
>> [2] https://lkml.org/lkml/2014/9/11/1057
>
> That all sounds rather bogus to me.  __percpu_counter_add() doesn't
> modify struct percpu_counter at all except for when the cpu-local
> counter overflows the configured batch size.  And for the memcg
> application I suspect we can set the batch size to INT_MAX...

Nod.  The memory usage will be a bit larger, but the code reuse is
attractive.  I dusted off Vladimir's
https://lkml.org/lkml/2014/9/11/710.  Next step is to benchmark it
before posting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] memcg: remove pcp_counter_lock

2015-09-22 Thread Greg Thelen

Commit 733a572e66d2 ("memcg: make mem_cgroup_read_{stat|event}() iterate
possible cpus instead of online") removed the last use of the per memcg
pcp_counter_lock but forgot to remove the variable.

Kill the vestigial variable.

Signed-off-by: Greg Thelen 
---
 include/linux/memcontrol.h | 1 -
 mm/memcontrol.c| 1 -
 2 files changed, 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ad800e62cb7a..6452ff4c463f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -242,7 +242,6 @@ struct mem_cgroup {
 * percpu counter.
 */
struct mem_cgroup_stat_cpu __percpu *stat;
-   spinlock_t pcp_counter_lock;
 
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
struct cg_proto tcp_mem;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ddaeba34e09..da21143550c0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4179,7 +4179,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
if (memcg_wb_domain_init(memcg, GFP_KERNEL))
goto out_free_stat;
 
-   spin_lock_init(&memcg->pcp_counter_lock);
return memcg;
 
 out_free_stat:
-- 
2.6.0.rc0.131.gf624c3d

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] memcg: make mem_cgroup_read_stat() unsigned

2015-09-22 Thread Greg Thelen

Andrew Morton wrote:

> On Tue, 22 Sep 2015 15:16:32 -0700 Greg Thelen  wrote:
>
>> mem_cgroup_read_stat() returns a page count by summing per cpu page
>> counters.  The summing is racy wrt. updates, so a transient negative sum
>> is possible.  Callers don't want negative values:
>> - mem_cgroup_wb_stats() doesn't want negative nr_dirty or nr_writeback.
>> - oom reports and memory.stat shouldn't show confusing negative usage.
>> - tree_usage() already avoids negatives.
>>
>> Avoid returning negative page counts from mem_cgroup_read_stat() and
>> convert it to unsigned.
>
> Someone please remind me why this code doesn't use the existing
> percpu_counter library which solved this problem years ago.
>
>>   for_each_possible_cpu(cpu)
>
> and which doesn't iterate across offlined CPUs.

I found [1] and [2] discussing memory layout differences between:
a) existing memcg hand rolled per cpu arrays of counters
vs
b) array of generic percpu_counter
The current approach was claimed to have lower memory overhead and
better cache behavior.

I assume it's pretty straightforward to create generic
percpu_counter_array routines which memcg could use.  Possibly something
like this could be made general enough could be created to satisfy
vmstat, but less clear.

[1] http://www.spinics.net/lists/cgroups/msg06216.html
[2] https://lkml.org/lkml/2014/9/11/1057
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 >

1 - 100 of 209 matches

Mail list logo