Re: [PATCH] shmem, memcg: enable memcg aware shrinker
Kirill Tkhai wrote: > Hi, Greg, > > good finding. See comments below. > > On 01.06.2020 06:22, Greg Thelen wrote: >> Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged >> shrinkers during memcg shrink_slab()") a memcg aware shrinker is only >> called when the per-memcg per-node shrinker_map indicates that the >> shrinker may have objects to release to the memcg and node. >> >> shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs >> shrinker which advertises per memcg and numa awareness. The shmem >> shrinker releases memory by splitting hugepages that extend beyond >> i_size. >> >> Shmem does not currently set bits in shrinker_map. So, starting with >> b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under >> pressure. This leads to undeserved memcg OOM kills. >> Example that reliably sees memcg OOM kill in unpatched kernel: >> FS=/tmp/fs >> CONTAINER=/cgroup/memory/tmpfs_shrinker >> mkdir -p $FS >> mount -t tmpfs -o huge=always nodev $FS >> # Create 1000 MB container, which shouldn't suffer OOM. >> mkdir $CONTAINER >> echo 1000M > $CONTAINER/memory.limit_in_bytes >> echo $BASHPID >> $CONTAINER/cgroup.procs >> # Create 4000 files. Ideally each file uses 4k data page + a little >> # metadata. Assume 8k total per-file, 32MB (4000*8k) should easily >> # fit within container's 1000 MB. But if data pages use 2MB >> # hugepages (due to aggressive huge=always) then files consume 8GB, >> # which hits memcg 1000 MB limit. >> for i in {1..4000}; do >> echo . > $FS/$i >> done >> >> v5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg >> aware") maintains the per-node per-memcg shrinker bitmap for THP >> shrinker. But there's no such logic in shmem. Make shmem set the >> per-memcg per-node shrinker bits when it modifies inodes to have >> shrinkable pages. >> >> Fixes: b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers >> during memcg shrink_slab()") >> Cc: # 4.19+ >> Signed-off-by: Greg Thelen >> --- >> mm/shmem.c | 61 +++--- >> 1 file changed, 35 insertions(+), 26 deletions(-) >> >> diff --git a/mm/shmem.c b/mm/shmem.c >> index bd8840082c94..e11090f78cb5 100644 >> --- a/mm/shmem.c >> +++ b/mm/shmem.c >> @@ -1002,6 +1002,33 @@ static int shmem_getattr(const struct path *path, >> struct kstat *stat, >> return 0; >> } >> >> +/* >> + * Expose inode and optional page to shrinker as having a possibly >> splittable >> + * hugepage that reaches beyond i_size. >> + */ >> +static void shmem_shrinker_add(struct shmem_sb_info *sbinfo, >> + struct inode *inode, struct page *page) >> +{ >> +struct shmem_inode_info *info = SHMEM_I(inode); >> + >> +spin_lock(&sbinfo->shrinklist_lock); >> +/* >> + * _careful to defend against unlocked access to ->shrink_list in >> + * shmem_unused_huge_shrink() >> + */ >> +if (list_empty_careful(&info->shrinklist)) { >> +list_add_tail(&info->shrinklist, &sbinfo->shrinklist); >> +sbinfo->shrinklist_len++; >> +} >> +spin_unlock(&sbinfo->shrinklist_lock); >> + >> +#ifdef CONFIG_MEMCG >> +if (page && PageTransHuge(page)) >> +memcg_set_shrinker_bit(page->mem_cgroup, page_to_nid(page), >> + inode->i_sb->s_shrink.id); >> +#endif >> +} >> + >> static int shmem_setattr(struct dentry *dentry, struct iattr *attr) >> { >> struct inode *inode = d_inode(dentry); >> @@ -1048,17 +1075,13 @@ static int shmem_setattr(struct dentry *dentry, >> struct iattr *attr) >> * to shrink under memory pressure. >> */ >> if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { >> -spin_lock(&sbinfo->shrinklist_lock); >> -/* >> - * _careful to defend against unlocked access to >> - * ->shrink_list in shmem_unused_huge_shrink() >> - */ >> -if (list_empty_careful(&info->shrinklist)) { >> -list_add_tail(&info->shrinkl
Re: [PATCH] selftests: more general make nesting support
On Tue, Jul 28, 2020 at 12:32 AM Greg Thelen wrote: > > selftests can be built from the toplevel kernel makefile (e.g. make > kselftest-all) or directly (make -C tools/testing/selftests all). > > The toplevel kernel makefile explicitly disables implicit rules with > "MAKEFLAGS += -rR", which is passed to tools/testing/selftests. Some > selftest makefiles require implicit make rules, which is why > commit 67d8712dcc70 ("selftests: Fix build failures when invoked from > kselftest target") reenables implicit rules by clearing MAKEFLAGS if > MAKELEVEL=1. > > So far so good. However, if the toplevel makefile is called from an > outer makefile then MAKELEVEL will be elevated, which breaks the > MAKELEVEL equality test. > Example wrapped makefile error: > $ cat ~/Makefile > all: > $(MAKE) defconfig > $(MAKE) kselftest-all > $ make -sf ~/Makefile > futex_wait_timeout.c /src/tools/testing/selftests/kselftest_harness.h > /src/tools/testing/selftests/kselftest.h ../include/futextest.h > ../include/atomic.h ../include/logging.h -lpthread -lrt -o > /src/tools/testing/selftests/futex/functional/futex_wait_timeout > make[4]: futex_wait_timeout.c: Command not found > > Rather than checking $(MAKELEVEL), check for $(LINK.c), which is a more > direct side effect of "make -R". This enables arbitrary makefile > nesting. > > Signed-off-by: Greg Thelen > --- > tools/testing/selftests/Makefile | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/tools/testing/selftests/Makefile > b/tools/testing/selftests/Makefile > index 1195bd85af38..289a2e4b3f6f 100644 > --- a/tools/testing/selftests/Makefile > +++ b/tools/testing/selftests/Makefile > @@ -84,10 +84,10 @@ endif > # of the targets gets built. > FORCE_TARGETS ?= > > -# Clear LDFLAGS and MAKEFLAGS if called from main > -# Makefile to avoid test build failures when test > -# Makefile doesn't have explicit build rules. > -ifeq (1,$(MAKELEVEL)) > +# Clear LDFLAGS and MAKEFLAGS when implicit rules are missing. This provides > +# implicit rules to sub-test Makefiles which avoids build failures in test > +# Makefile that don't have explicit build rules. > +ifeq (,$(LINK.c)) > override LDFLAGS = > override MAKEFLAGS = > endif > -- > 2.28.0.rc0.142.g3c755180ce-goog Is there any feedback on this patch? It's not high priority but something that will help me make more use of selftests.
Re: [PATCH] selftests: more general make nesting support
Shuah Khan wrote: > On 8/5/20 1:36 PM, Greg Thelen wrote: >> On Tue, Jul 28, 2020 at 12:32 AM Greg Thelen wrote: >>> >>> selftests can be built from the toplevel kernel makefile (e.g. make >>> kselftest-all) or directly (make -C tools/testing/selftests all). >>> >>> The toplevel kernel makefile explicitly disables implicit rules with >>> "MAKEFLAGS += -rR", which is passed to tools/testing/selftests. Some >>> selftest makefiles require implicit make rules, which is why >>> commit 67d8712dcc70 ("selftests: Fix build failures when invoked from >>> kselftest target") reenables implicit rules by clearing MAKEFLAGS if >>> MAKELEVEL=1. >>> >>> So far so good. However, if the toplevel makefile is called from an >>> outer makefile then MAKELEVEL will be elevated, which breaks the >>> MAKELEVEL equality test. >>> Example wrapped makefile error: >>>$ cat ~/Makefile >>>all: >>> $(MAKE) defconfig >>> $(MAKE) kselftest-all >>>$ make -sf ~/Makefile >>> futex_wait_timeout.c /src/tools/testing/selftests/kselftest_harness.h >>> /src/tools/testing/selftests/kselftest.h ../include/futextest.h >>> ../include/atomic.h ../include/logging.h -lpthread -lrt -o >>> /src/tools/testing/selftests/futex/functional/futex_wait_timeout >>>make[4]: futex_wait_timeout.c: Command not found >>> >>> Rather than checking $(MAKELEVEL), check for $(LINK.c), which is a more >>> direct side effect of "make -R". This enables arbitrary makefile >>> nesting. >>> >>> Signed-off-by: Greg Thelen >>> --- >>> tools/testing/selftests/Makefile | 8 >>> 1 file changed, 4 insertions(+), 4 deletions(-) >>> >>> diff --git a/tools/testing/selftests/Makefile >>> b/tools/testing/selftests/Makefile >>> index 1195bd85af38..289a2e4b3f6f 100644 >>> --- a/tools/testing/selftests/Makefile >>> +++ b/tools/testing/selftests/Makefile >>> @@ -84,10 +84,10 @@ endif >>> # of the targets gets built. >>> FORCE_TARGETS ?= >>> >>> -# Clear LDFLAGS and MAKEFLAGS if called from main >>> -# Makefile to avoid test build failures when test >>> -# Makefile doesn't have explicit build rules. >>> -ifeq (1,$(MAKELEVEL)) >>> +# Clear LDFLAGS and MAKEFLAGS when implicit rules are missing. This >>> provides >>> +# implicit rules to sub-test Makefiles which avoids build failures in test >>> +# Makefile that don't have explicit build rules. >>> +ifeq (,$(LINK.c)) >>> override LDFLAGS = >>> override MAKEFLAGS = >>> endif >>> -- >>> 2.28.0.rc0.142.g3c755180ce-goog >> >> Is there any feedback on this patch? It's not high priority but something >> that >> will help me make more use of selftests. >> > > No issues with the patch. I will apply it once the merge window ends. > > thanks, > -- Shuah Great. Thank you.
Re: + revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch added to -mm tree
a...@linux-foundation.org wrote: > The patch titled > Subject: revert "mm/filemap: add static for function > __add_to_page_cache_locked" > has been added to the -mm tree. Its filename is > > revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch > > This patch should soon appear at > > https://ozlabs.org/~akpm/mmots/broken-out/revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch > and later at > > https://ozlabs.org/~akpm/mmotm/broken-out/revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch > > Before you just go and hit "reply", please: >a) Consider who else should be cc'ed >b) Prefer to cc a suitable mailing list as well >c) Ideally: find the original patch on the mailing list and do a > reply-to-all to that, adding suitable additional cc's > > *** Remember to use Documentation/process/submit-checklist.rst when testing > your code *** > > The -mm tree is included into linux-next and is updated > there every 3-4 working days > > -- > From: Andrew Morton > Subject: revert "mm/filemap: add static for function > __add_to_page_cache_locked" > > Revert 3351b16af494 ("mm/filemap: add static for function > __add_to_page_cache_locked") due to build issues with > CONFIG_DEBUG_INFO_BTF=y. > > Link: > https://lkml.kernel.org/r/caadnvqj6tmzbxvtrobueh6qa0h+q7yaskxrvvvxhqr3kbzd...@mail.gmail.com > Cc: Michal Kubecek > Cc: Justin Forbes > Cc: Alex Shi > Cc: Souptick Joarder > Cc: Alexei Starovoitov > Cc: Daniel Borkmann > Cc: Josef Bacik > Signed-off-by: Andrew Morton Thanks. Tested-by: Greg Thelen > --- > > mm/filemap.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- > a/mm/filemap.c~revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked > +++ a/mm/filemap.c > @@ -827,7 +827,7 @@ int replace_page_cache_page(struct page > } > EXPORT_SYMBOL_GPL(replace_page_cache_page); > > -static noinline int __add_to_page_cache_locked(struct page *page, > +noinline int __add_to_page_cache_locked(struct page *page, > struct address_space *mapping, > pgoff_t offset, gfp_t gfp, > void **shadowp) > _ > > Patches currently in -mm which might be from a...@linux-foundation.org are > > revert-mm-filemap-add-static-for-function-__add_to_page_cache_locked.patch > kthread_worker-document-cpu-hotplug-handling-fix.patch > mm.patch > mm-remove-the-unuseful-else-after-a-return-checkpatch-fixes.patch > mm-prevent-gup_fast-from-racing-with-cow-during-fork-checkpatch-fixes.patch > mm-swap_state-skip-meaningless-swap-cache-readahead-when-ra_infowin-==-0-fix.patch > mm-vmallocc-__vmalloc_area_node-avoid-32-bit-overflow.patch > mm-cma-improve-pr_debug-log-in-cma_release-fix.patch > mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix-2.patch > lib-cmdline_kunit-add-a-new-test-suite-for-cmdline-api-fix.patch > ilog2-improve-ilog2-for-constant-arguments-checkpatch-fixes.patch > lib-test_bitmapc-add-for_each_set_clump-test-cases-checkpatch-fixes.patch > checkpatch-fix-typo_spelling-check-for-words-with-apostrophe-fix.patch > resource-fix-kernel-doc-markups-checkpatch-fixes.patch > linux-next-rejects.patch > kmap-stupid-hacks-to-make-it-compile.patch > init-kconfig-dont-assume-scripts-lld-versionsh-is-executable.patch > set_memory-allow-set_direct_map__noflush-for-multiple-pages-fix.patch > arch-mm-wire-up-memfd_secret-system-call-were-relevant-fix.patch > kernel-forkc-export-kernel_thread-to-modules.patch
Re: [PATCH 07/10] mm/vmscan: add helper for querying ability to age anonymous pages
Dave Hansen wrote: > From: Dave Hansen > > Anonymous pages are kept on their own LRU(s). These lists could > theoretically always be scanned and maintained. But, without swap, > there is currently nothing the kernel can *do* with the results of a > scanned, sorted LRU for anonymous pages. > > A check for '!total_swap_pages' currently serves as a valid check as > to whether anonymous LRUs should be maintained. However, another > method will be added shortly: page demotion. > > Abstract out the 'total_swap_pages' checks into a helper, give it a > logically significant name, and check for the possibility of page > demotion. Reviewed-by: Greg Thelen > Signed-off-by: Dave Hansen > Cc: David Rientjes > Cc: Huang Ying > Cc: Dan Williams > Cc: David Hildenbrand > Cc: osalvador > --- > > b/mm/vmscan.c | 28 +--- > 1 file changed, 25 insertions(+), 3 deletions(-) > > diff -puN mm/vmscan.c~mm-vmscan-anon-can-be-aged mm/vmscan.c > --- a/mm/vmscan.c~mm-vmscan-anon-can-be-aged 2021-03-04 15:35:58.935806422 > -0800 > +++ b/mm/vmscan.c 2021-03-04 15:35:58.942806422 -0800 > @@ -2517,6 +2517,26 @@ out: > } > } > > +/* > + * Anonymous LRU management is a waste if there is > + * ultimately no way to reclaim the memory. > + */ > +bool anon_should_be_aged(struct lruvec *lruvec) Should this be static? > +{ > + struct pglist_data *pgdat = lruvec_pgdat(lruvec); > + > + /* Aging the anon LRU is valuable if swap is present: */ > + if (total_swap_pages > 0) > + return true; > + > + /* Also valuable if anon pages can be demoted: */ > + if (next_demotion_node(pgdat->node_id) >= 0) > + return true; > + > + /* No way to reclaim anon pages. Should not age anon LRUs: */ > + return false; > +} > + > static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) > { > unsigned long nr[NR_LRU_LISTS]; > @@ -2626,7 +2646,8 @@ static void shrink_lruvec(struct lruvec >* Even if we did not try to evict anon pages at all, we want to >* rebalance the anon lru active/inactive ratio. >*/ > - if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON)) > + if (anon_should_be_aged(lruvec) && > + inactive_is_low(lruvec, LRU_INACTIVE_ANON)) > shrink_active_list(SWAP_CLUSTER_MAX, lruvec, > sc, LRU_ACTIVE_ANON); > } > @@ -3455,10 +3476,11 @@ static void age_active_anon(struct pglis > struct mem_cgroup *memcg; > struct lruvec *lruvec; > > - if (!total_swap_pages) > + lruvec = mem_cgroup_lruvec(NULL, pgdat); > + > + if (!anon_should_be_aged(lruvec)) > return; > > - lruvec = mem_cgroup_lruvec(NULL, pgdat); > if (!inactive_is_low(lruvec, LRU_INACTIVE_ANON)) > return; > > _
Re: [PATCH] mm/filemap: add static for function __add_to_page_cache_locked
Alex Shi wrote: > 在 2020/11/11 上午3:50, Andrew Morton 写道: >> On Tue, 10 Nov 2020 08:39:24 +0530 Souptick Joarder >> wrote: >> >>> On Fri, Nov 6, 2020 at 4:55 PM Alex Shi wrote: Otherwise it cause gcc warning: ^~~ ../mm/filemap.c:830:14: warning: no previous prototype for ‘__add_to_page_cache_locked’ [-Wmissing-prototypes] noinline int __add_to_page_cache_locked(struct page *page, ^~ >>> >>> Is CONFIG_DEBUG_INFO_BTF enabled in your .config ? >> >> hm, yes. > > When the config enabled, compiling looks good untill pahole tool > used to get BTF info, but I still failed on a right version pahole >> 1.16. Sorry. I'm seeing an issue with this patch. My build system has pahole v1.17, but I don't think the pahole version is key. $ git checkout 3351b16af494 # recently added to linus/master $ make defconfig $ make menuconfig # set CONFIG_BPF_SYSCALL and CONFIG_DEBUG_INFO_BTF $ make V=1 + ./tools/bpf/resolve_btfids/resolve_btfids vmlinux FAILED unresolved symbol __add_to_page_cache_locked Reverting 3351b16af494 ("mm/filemap: add static for function __add_to_page_cache_locked") fixes the issue. I don't see the warning which motivated this patch, but maybe it requires particular a .config or gcc version. Perhaps adding a __add_to_page_cache_locked() prototype would meet avoid it. But I haven't studied enough on BTF to know if there's a better answer. Signed-off-by: Alex Shi Cc: Andrew Morton Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- mm/filemap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/filemap.c b/mm/filemap.c index d90614f501da..249cf489f5df 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -827,7 +827,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask) } EXPORT_SYMBOL_GPL(replace_page_cache_page); -noinline int __add_to_page_cache_locked(struct page *page, +static noinline int __add_to_page_cache_locked(struct page *page, struct address_space *mapping, pgoff_t offset, gfp_t gfp, void **shadowp) >> >> It's unclear to me whether BTF_ID() requires that the target symbol be >> non-static. It doesn't actually reference the symbol: >> >> #define BTF_ID(prefix, name) \ >> __BTF_ID(__ID(__BTF_ID__##prefix##__##name##__)) >> > > The above usage make me thought BTF don't require the symbol, though > the symbol still exist in vmlinux with 'static'. > > So any comments of this, Alexei? > >> Alexei, can you please comment? >>
Re: [PATCH] mm: mmap_lock: fix use-after-free race and css ref leak in tracepoints
Axel Rasmussen wrote: > On Mon, Nov 30, 2020 at 5:34 PM Shakeel Butt wrote: >> >> On Mon, Nov 30, 2020 at 3:43 PM Axel Rasmussen >> wrote: >> > >> > syzbot reported[1] a use-after-free introduced in 0f818c4bc1f3. The bug >> > is that an ongoing trace event might race with the tracepoint being >> > disabled (and therefore the _unreg() callback being called). Consider >> > this ordering: >> > >> > T1: trace event fires, get_mm_memcg_path() is called >> > T1: get_memcg_path_buf() returns a buffer pointer >> > T2: trace_mmap_lock_unreg() is called, buffers are freed >> > T1: cgroup_path() is called with the now-freed buffer >> >> Any reason to use the cgroup_path instead of the cgroup_ino? There are >> other examples of trace points using cgroup_ino and no need to >> allocate buffers. Also cgroup namespace might complicate the path >> usage. > > Hmm, so in general I would love to use a numeric identifier instead of a > string. > > I did some reading, and it looks like the cgroup_ino() mainly has to > do with writeback, instead of being just a general identifier? > https://www.kernel.org/doc/Documentation/cgroup-v2.txt > > There is cgroup_id() which I think is almost what I'd want, but there > are a couple problems with it: > > - I don't know of a way for userspace to translate IDs -> paths, to > make them human readable? The id => name map can be built from user space with a tree walk. Example: $ find /sys/fs/cgroup/memory -type d -printf '%i %P\n' # ~ [main] 20387 init.scope 31 system.slice > - Also I think the ID implementation we use for this is "dense", > meaning if a cgroup is removed, its ID is likely to be quickly reused. > >> >> > >> > The solution in this commit is to modify trace_mmap_lock_unreg() to >> > first stop new buffers from being handed out, and then to wait (spin) >> > until any existing buffer references are dropped (i.e., those trace >> > events complete). >> > >> > I have a simple reproducer program which spins up two pools of threads, >> > doing the following in a tight loop: >> > >> > Pool 1: >> > mmap(NULL, 4096, PROT_READ | PROT_WRITE, >> >MAP_PRIVATE | MAP_ANONYMOUS, -1, 0) >> > munmap() >> > >> > Pool 2: >> > echo 1 > /sys/kernel/debug/tracing/events/mmap_lock/enable >> > echo 0 > /sys/kernel/debug/tracing/events/mmap_lock/enable >> > >> > This triggers the use-after-free very quickly. With this patch, I let it >> > run for an hour without any BUGs. >> > >> > While fixing this, I also noticed and fixed a css ref leak. Previously >> > we called get_mem_cgroup_from_mm(), but we never called css_put() to >> > release that reference. get_mm_memcg_path() now does this properly. >> > >> > [1]: https://syzkaller.appspot.com/bug?extid=19e6dd9943972fa1c58a >> > >> > Fixes: 0f818c4bc1f3 ("mm: mmap_lock: add tracepoints around lock >> > acquisition") >> >> The original patch is in mm tree, so the SHA1 is not stabilized. >> Usually Andrew squash the fixes into the original patches. > > Ah, I added this because it also shows up in linux-next, under the > next-20201130 tag. I'll remove it in v2, squashing is fine. :) > >> >> > Signed-off-by: Axel Rasmussen >> > --- >> > mm/mmap_lock.c | 100 + >> > 1 file changed, 85 insertions(+), 15 deletions(-) >> > >> > diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c >> > index 12af8f1b8a14..be38dc58278b 100644 >> > --- a/mm/mmap_lock.c >> > +++ b/mm/mmap_lock.c >> > @@ -3,6 +3,7 @@ >> > #include >> > >> > #include >> > +#include >> > #include >> > #include >> > #include >> > @@ -18,13 +19,28 @@ EXPORT_TRACEPOINT_SYMBOL(mmap_lock_released); >> > #ifdef CONFIG_MEMCG >> > >> > /* >> > - * Our various events all share the same buffer (because we don't want or >> > need >> > - * to allocate a set of buffers *per event type*), so we need to protect >> > against >> > - * concurrent _reg() and _unreg() calls, and count how many _reg() calls >> > have >> > - * been made. >> > + * This is unfortunately complicated... _reg() and _unreg() may be called >> > + * in parallel, separately for each of our three event types. To save >> > memory, >> > + * all of the event types share the same buffers. Furthermore, trace >> > events >> > + * might happen in parallel with _unreg(); we need to ensure we don't >> > free the >> > + * buffers before all inflights have finished. Because these events happen >> > + * "frequently", we also want to prevent new inflights from starting once >> > the >> > + * _unreg() process begins. And, for performance reasons, we want to >> > avoid any >> > + * locking in the trace event path. >> > + * >> > + * So: >> > + * >> > + * - Use a spinlock to serialize _reg() and _unreg() calls. >> > + * - Keep track of nested _reg() calls with a lock-protected counter. >> > + * - Define a flag indicating whether or not unregistration
Re: [PATCH v2 1/3] percpu: add test module for various percpu operations
On Mon, Nov 04 2013, Andrew Morton wrote: > On Sun, 27 Oct 2013 10:30:15 -0700 Greg Thelen wrote: > >> Tests various percpu operations. > > Could you please take a look at the 32-bit build (this is i386): > > lib/percpu_test.c: In function 'percpu_test_init': > lib/percpu_test.c:61: warning: integer constant is too large for 'long' type > lib/percpu_test.c:61: warning: integer constant is too large for 'long' type > lib/percpu_test.c:61: warning: integer constant is too large for 'long' type > lib/percpu_test.c:61: warning: integer constant is too large for 'long' type > lib/percpu_test.c:61: warning: integer constant is too large for 'long' type > lib/percpu_test.c:61: warning: integer constant is too large for 'long' type > lib/percpu_test.c:70: warning: integer constant is too large for 'long' type > lib/percpu_test.c:70: warning: integer constant is too large for 'long' type > lib/percpu_test.c:70: warning: integer constant is too large for 'long' type > lib/percpu_test.c:70: warning: integer constant is too large for 'long' type > lib/percpu_test.c:70: warning: integer constant is too large for 'long' type > lib/percpu_test.c:70: warning: integer constant is too large for 'long' type > lib/percpu_test.c:89: warning: integer constant is too large for 'long' type > lib/percpu_test.c:89: warning: integer constant is too large for 'long' type > lib/percpu_test.c:89: warning: integer constant is too large for 'long' type > lib/percpu_test.c:89: warning: integer constant is too large for 'long' type > lib/percpu_test.c:89: warning: integer constant is too large for 'long' type > lib/percpu_test.c:89: warning: integer constant is too large for 'long' type > lib/percpu_test.c:97: warning: integer constant is too large for 'long' type > lib/percpu_test.c:97: warning: integer constant is too large for 'long' type > lib/percpu_test.c:97: warning: integer constant is too large for 'long' type > lib/percpu_test.c:97: warning: integer constant is too large for 'long' type > lib/percpu_test.c:97: warning: integer constant is too large for 'long' type > lib/percpu_test.c:97: warning: integer constant is too large for 'long' type > lib/percpu_test.c:112: warning: integer constant is too large for 'long' type > lib/percpu_test.c:112: warning: integer constant is too large for 'long' type > lib/percpu_test.c:112: warning: integer constant is too large for 'long' type > lib/percpu_test.c:112: warning: integer constant is too large for 'long' type > lib/percpu_test.c:112: warning: integer constant is too large for 'long' type > lib/percpu_test.c:112: warning: integer constant is too large for 'long' type I was using gcc 4.6 which apparently adds LL suffix as needed. Though there were some other code problems with 32 bit beyond missing suffixes. Fixed version below tested with both gcc 4.4 and gcc 4.6 on 32 and 64 bit x86. ---8<--- >From a95bb1ce42b4492644fa10c7c80fd9bbd7bf23b9 Mon Sep 17 00:00:00 2001 In-Reply-To: <20131104160918.0c571b410cf165e9c4b4a...@linux-foundation.org> References: <20131104160918.0c571b410cf165e9c4b4a...@linux-foundation.org> From: Greg Thelen Date: Sun, 27 Oct 2013 10:30:15 -0700 Subject: [PATCH v2] percpu: add test module for various percpu operations Tests various percpu operations. Enable with CONFIG_PERCPU_TEST=m. Signed-off-by: Greg Thelen Acked-by: Tejun Heo --- Changelog since v1: - use %lld/x which allows for less casting - fix 32 bit build by casting large constants lib/Kconfig.debug | 9 lib/Makefile | 2 + lib/percpu_test.c | 138 ++ 3 files changed, 149 insertions(+) create mode 100644 lib/percpu_test.c diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 094f3152ec2b..1891eb271adf 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1472,6 +1472,15 @@ config INTERVAL_TREE_TEST help A benchmark measuring the performance of the interval tree library +config PERCPU_TEST + tristate "Per cpu operations test" + depends on m && DEBUG_KERNEL + help + Enable this option to build test module which validates per-cpu + operations. + + If unsure, say N. + config ATOMIC64_SELFTEST bool "Perform an atomic64_t self-test at boot" help diff --git a/lib/Makefile b/lib/Makefile index f3bb2cb98adf..bb016e116ba4 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -157,6 +157,8 @@ obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o
Re: [PATCH -mm v2 2/2] mm: get rid of __GFP_KMEMCG
On Tue, Apr 01 2014, Vladimir Davydov wrote: > Currently to allocate a page that should be charged to kmemcg (e.g. > threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page > allocated is then to be freed by free_memcg_kmem_pages. Apart from > looking asymmetrical, this also requires intrusion to the general > allocation path. So let's introduce separate functions that will > alloc/free pages charged to kmemcg. > > The new functions are called alloc_kmem_pages and free_kmem_pages. They > should be used when the caller actually would like to use kmalloc, but > has to fall back to the page allocator for the allocation is large. They > only differ from alloc_pages and free_pages in that besides allocating > or freeing pages they also charge them to the kmem resource counter of > the current memory cgroup. > > Signed-off-by: Vladimir Davydov > --- > include/linux/gfp.h | 10 --- > include/linux/memcontrol.h |2 +- > include/linux/slab.h| 11 > include/linux/thread_info.h |2 -- > include/trace/events/gfpflags.h |1 - > kernel/fork.c |6 ++--- > mm/page_alloc.c | 56 > --- > mm/slab_common.c| 12 + > mm/slub.c |6 ++--- > 9 files changed, 60 insertions(+), 46 deletions(-) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 39b81dc7d01a..d382db71e300 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -31,7 +31,6 @@ struct vm_area_struct; > #define ___GFP_HARDWALL 0x2u > #define ___GFP_THISNODE 0x4u > #define ___GFP_RECLAIMABLE 0x8u > -#define ___GFP_KMEMCG0x10u > #define ___GFP_NOTRACK 0x20u > #define ___GFP_NO_KSWAPD 0x40u > #define ___GFP_OTHER_NODE0x80u > @@ -91,7 +90,6 @@ struct vm_area_struct; > > #define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD) > #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of > other node */ > -#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from > a memcg-accounted resource */ > #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to > dirty page */ > > /* > @@ -353,6 +351,10 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int > order, > #define alloc_page_vma_node(gfp_mask, vma, addr, node) \ > alloc_pages_vma(gfp_mask, 0, vma, addr, node) > > +extern struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order); > +extern struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask, > + unsigned int order); > + > extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order); > extern unsigned long get_zeroed_page(gfp_t gfp_mask); > > @@ -372,8 +374,8 @@ extern void free_pages(unsigned long addr, unsigned int > order); > extern void free_hot_cold_page(struct page *page, int cold); > extern void free_hot_cold_page_list(struct list_head *list, int cold); > > -extern void __free_memcg_kmem_pages(struct page *page, unsigned int order); > -extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order); > +extern void __free_kmem_pages(struct page *page, unsigned int order); > +extern void free_kmem_pages(unsigned long addr, unsigned int order); > > #define __free_page(page) __free_pages((page), 0) > #define free_page(addr) free_pages((addr), 0) > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 29068dd26c3d..13acdb5259f5 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -543,7 +543,7 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup > **memcg, int order) >* res_counter_charge_nofail, but we hope those allocations are rare, >* and won't be worth the trouble. >*/ > - if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL)) > + if (gfp & __GFP_NOFAIL) > return true; > if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD)) > return true; > diff --git a/include/linux/slab.h b/include/linux/slab.h > index 3dd389aa91c7..6d6959292e00 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -358,17 +358,6 @@ kmem_cache_alloc_node_trace(struct kmem_cache *s, > #include > #endif > > -static __always_inline void * > -kmalloc_order(size_t size, gfp_t flags, unsigned int order) > -{ > - void *ret; > - > - flags |= (__GFP_COMP | __GFP_KMEMCG); > - ret = (void *) __get_free_pages(flags, order); > - kmemleak_alloc(ret, size, 1, flags); > - return ret; > -} > - Removing this from the header file breaks builds without CONFIG_TRACING. Example: % make allnoconfig && make -j4 mm/ [...] include/linux/slab.h: In function ‘kmalloc_order_trace’: include/linux/slab.h:367:2: error: imp
Re: [PATCH] ipc,shm: increase default size for shmmax
On Tue, Apr 01 2014, Davidlohr Bueso wrote: > On Tue, 2014-04-01 at 19:56 -0400, KOSAKI Motohiro wrote: >> >> > Ah-hah, that's interesting info. >> >> > >> >> > Let's make the default 64GB? >> >> >> >> 64GB is infinity at that time, but it no longer near infinity today. I >> >> like >> >> very large or total memory proportional number. >> > >> > So I still like 0 for unlimited. Nice, clean and much easier to look at >> > than ULONG_MAX. And since we cannot disable shm through SHMMIN, I really >> > don't see any disadvantages, as opposed to some other arbitrary value. >> > Furthermore it wouldn't break userspace: any existing sysctl would >> > continue to work, and if not set, the user never has to worry about this >> > tunable again. >> > >> > Please let me know if you all agree with this... >> >> Surething. Why not. :) > > *sigh* actually, the plot thickens a bit with SHMALL (total size of shm > segments system wide, in pages). Currently by default: > > #define SHMALL (SHMMAX/getpagesize()*(SHMMNI/16)) > > This deals with physical memory, at least admins are recommended to set > it to some large percentage of ram / pagesize. So I think that if we > loose control over the default value, users can potentially DoS the > system, or at least cause excessive swapping if not manually set, but > then again the same goes for anon mem... so do we care? At least when there's an egregious anon leak the oom killer has the power to free the memory by killing until the memory is unreferenced. This isn't true for shm or tmpfs. So shm is more effective than anon at crushing a machine. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm v2 1/2] sl[au]b: charge slabs to kmemcg explicitly
On Tue, Apr 01 2014, Vladimir Davydov wrote: > We have only a few places where we actually want to charge kmem so > instead of intruding into the general page allocation path with > __GFP_KMEMCG it's better to explictly charge kmem there. All kmem > charges will be easier to follow that way. > > This is a step towards removing __GFP_KMEMCG. It removes __GFP_KMEMCG > from memcg caches' allocflags. Instead it makes slab allocation path > call memcg_charge_kmem directly getting memcg to charge from the cache's > memcg params. > > This also eliminates any possibility of misaccounting an allocation > going from one memcg's cache to another memcg, because now we always > charge slabs against the memcg the cache belongs to. That's why this > patch removes the big comment to memcg_kmem_get_cache. > > Signed-off-by: Vladimir Davydov Acked-by: Greg Thelen -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ipc,shm: increase default size for shmmax
On Tue, Apr 01 2014, Kamezawa Hiroyuki wrote: >> On Tue, Apr 01 2014, Davidlohr Bueso wrote: >> >>> On Tue, 2014-04-01 at 19:56 -0400, KOSAKI Motohiro wrote: >>>>>>> Ah-hah, that's interesting info. >>>>>>> >>>>>>> Let's make the default 64GB? >>>>>> >>>>>> 64GB is infinity at that time, but it no longer near infinity today. I >>>>>> like >>>>>> very large or total memory proportional number. >>>>> >>>>> So I still like 0 for unlimited. Nice, clean and much easier to look at >>>>> than ULONG_MAX. And since we cannot disable shm through SHMMIN, I really >>>>> don't see any disadvantages, as opposed to some other arbitrary value. >>>>> Furthermore it wouldn't break userspace: any existing sysctl would >>>>> continue to work, and if not set, the user never has to worry about this >>>>> tunable again. >>>>> >>>>> Please let me know if you all agree with this... >>>> >>>> Surething. Why not. :) >>> >>> *sigh* actually, the plot thickens a bit with SHMALL (total size of shm >>> segments system wide, in pages). Currently by default: >>> >>> #define SHMALL (SHMMAX/getpagesize()*(SHMMNI/16)) >>> >>> This deals with physical memory, at least admins are recommended to set >>> it to some large percentage of ram / pagesize. So I think that if we >>> loose control over the default value, users can potentially DoS the >>> system, or at least cause excessive swapping if not manually set, but >>> then again the same goes for anon mem... so do we care? >> > (2014/04/02 10:08), Greg Thelen wrote: >> >> At least when there's an egregious anon leak the oom killer has the >> power to free the memory by killing until the memory is unreferenced. >> This isn't true for shm or tmpfs. So shm is more effective than anon at >> crushing a machine. > > Hm..sysctl.kernel.shm_rmid_forced won't work with oom-killer ? > > http://www.openwall.com/lists/kernel-hardening/2011/07/26/7 > > I like to handle this kind of issue under memcg but hmm..tmpfs's limit is half > of memory at default. Ah, yes. I forgot about shm_rmid_forced. Thanks. It would give the oom killer ability to cleanup shm (as it does with anon) when shm_rmid_forced=1. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm v2.1] mm: get rid of __GFP_KMEMCG
On Tue, Apr 01 2014, Vladimir Davydov wrote: > Currently to allocate a page that should be charged to kmemcg (e.g. > threadinfo), we pass __GFP_KMEMCG flag to the page allocator. The page > allocated is then to be freed by free_memcg_kmem_pages. Apart from > looking asymmetrical, this also requires intrusion to the general > allocation path. So let's introduce separate functions that will > alloc/free pages charged to kmemcg. > > The new functions are called alloc_kmem_pages and free_kmem_pages. They > should be used when the caller actually would like to use kmalloc, but > has to fall back to the page allocator for the allocation is large. They > only differ from alloc_pages and free_pages in that besides allocating > or freeing pages they also charge them to the kmem resource counter of > the current memory cgroup. > > Signed-off-by: Vladimir Davydov One comment nit below, otherwise looks good to me. Acked-by: Greg Thelen > Cc: Johannes Weiner > Cc: Michal Hocko > Cc: Glauber Costa > Cc: Christoph Lameter > Cc: Pekka Enberg > --- > Changes in v2.1: > - add missing kmalloc_order forward declaration; lacking it caused >compilation breakage with CONFIG_TRACING=n > > include/linux/gfp.h | 10 --- > include/linux/memcontrol.h |2 +- > include/linux/slab.h| 11 +--- > include/linux/thread_info.h |2 -- > include/trace/events/gfpflags.h |1 - > kernel/fork.c |6 ++--- > mm/page_alloc.c | 56 > --- > mm/slab_common.c| 12 + > mm/slub.c |6 ++--- > 9 files changed, 61 insertions(+), 45 deletions(-) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 39b81dc7d01a..d382db71e300 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -31,7 +31,6 @@ struct vm_area_struct; > #define ___GFP_HARDWALL 0x2u > #define ___GFP_THISNODE 0x4u > #define ___GFP_RECLAIMABLE 0x8u > -#define ___GFP_KMEMCG0x10u > #define ___GFP_NOTRACK 0x20u > #define ___GFP_NO_KSWAPD 0x40u > #define ___GFP_OTHER_NODE0x80u > @@ -91,7 +90,6 @@ struct vm_area_struct; > > #define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD) > #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of > other node */ > -#define __GFP_KMEMCG ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from > a memcg-accounted resource */ > #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to > dirty page */ > > /* > @@ -353,6 +351,10 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int > order, > #define alloc_page_vma_node(gfp_mask, vma, addr, node) \ > alloc_pages_vma(gfp_mask, 0, vma, addr, node) > > +extern struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned int order); > +extern struct page *alloc_kmem_pages_node(int nid, gfp_t gfp_mask, > + unsigned int order); > + > extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order); > extern unsigned long get_zeroed_page(gfp_t gfp_mask); > > @@ -372,8 +374,8 @@ extern void free_pages(unsigned long addr, unsigned int > order); > extern void free_hot_cold_page(struct page *page, int cold); > extern void free_hot_cold_page_list(struct list_head *list, int cold); > > -extern void __free_memcg_kmem_pages(struct page *page, unsigned int order); > -extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order); > +extern void __free_kmem_pages(struct page *page, unsigned int order); > +extern void free_kmem_pages(unsigned long addr, unsigned int order); > > #define __free_page(page) __free_pages((page), 0) > #define free_page(addr) free_pages((addr), 0) > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 29068dd26c3d..13acdb5259f5 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -543,7 +543,7 @@ memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup > **memcg, int order) >* res_counter_charge_nofail, but we hope those allocations are rare, >* and won't be worth the trouble. >*/ Just a few lines higher in first memcg_kmem_newpage_charge() comment, there is a leftover reference to GFP_KMEMCG which should be removed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] dm bufio: fully initialize shrinker
1d3d4437eae1 ("vmscan: per-node deferred work") added a flags field to struct shrinker assuming that all shrinkers were zero filled. The dm bufio shrinker is not zero filled, which leaves arbitrary kmalloc() data in flags. So far the only defined flags bit is SHRINKER_NUMA_AWARE. But there are proposed patches which add other bits to shrinker.flags (e.g. memcg awareness). Rather than simply initializing the shrinker, this patch uses kzalloc() when allocating the dm_bufio_client to ensure that the embedded shrinker and any other similar structures are zeroed. This fixes theoretical over aggressive shrinking of dm bufio objects. If the uninitialized dm_bufio_client.shrinker.flags contains SHRINKER_NUMA_AWARE then shrink_slab() would call the dm shrinker for each numa node rather than just once. This has been broken since 3.12. Signed-off-by: Greg Thelen --- drivers/md/dm-bufio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c index 4e84095833db..d724459860d9 100644 --- a/drivers/md/dm-bufio.c +++ b/drivers/md/dm-bufio.c @@ -1541,7 +1541,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign BUG_ON(block_size < 1 << SECTOR_SHIFT || (block_size & (block_size - 1))); - c = kmalloc(sizeof(*c), GFP_KERNEL); + c = kzalloc(sizeof(*c), GFP_KERNEL); if (!c) { r = -ENOMEM; goto bad_client; -- 2.0.0.526.g5318336 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] memory cgroup: weak points of kmem accounting design
On Tue, Sep 16 2014, Vladimir Davydov wrote: > Hi Suleiman, > > On Mon, Sep 15, 2014 at 12:13:33PM -0700, Suleiman Souhlal wrote: >> On Mon, Sep 15, 2014 at 3:44 AM, Vladimir Davydov >> wrote: >> > Hi, >> > >> > I'd like to discuss downsides of the kmem accounting part of the memory >> > cgroup controller and a possible way to fix them. I'd really appreciate >> > if you could share your thoughts on it. >> > >> > The idea lying behind the kmem accounting design is to provide each >> > memory cgroup with its private copy of every kmem_cache and list_lru >> > it's going to use. This is implemented by bundling these structures with >> > arrays storing per-memcg copies. The arrays are referenced by css id. >> > When a process in a cgroup tries to allocate an object from a kmem cache >> > we first find out which cgroup the process resides in, then look up the >> > cache copy corresponding to the cgroup, and finally allocate a new >> > object from the private cache. Similarly, on addition/deletion of an >> > object from a list_lru, we first obtain the kmem cache the object was >> > allocated from, then look up the memory cgroup which the cache belongs >> > to, and finally add/remove the object from the private copy of the >> > list_lru corresponding to the cgroup. >> > >> > Though simple it looks from the first glance, it has a number of serious >> > weaknesses: >> > >> > - Providing each memory cgroup with its own kmem cache increases >> >external fragmentation. >> >> I haven't seen any evidence of this being a problem (but that doesn't >> mean it doesn't exist). > > Actually, it's rather speculative. For example, if we have say a hundred > of extra objects per cache (fragmented or in per-cpu stocks) of size 256 > bytes, then for one cache the overhead would be 25K, which is > negligible. Now if there are thousand cgroups using the cache, we have > to pay 25M, which is noticeable. Anyway, to estimate this exactly, one > needs to run a typical workload inside a cgroup. > >> >> > - SLAB isn't ready to deal with thousands of caches: its algorithm >> >walks over all system caches and shrinks them periodically, which may >> >be really costly if we have thousands active memory cgroups. >> >> This could be throttled. > > It could be, but then we'd have more objects in per-cpu stocks, which > means more memory overhead. > >> >> > >> > - Caches may now be created/destroyed frequently and from various >> >places: on system cache destruction, on cgroup offline, from a work >> >struct scheduled by kmalloc. Synchronizing them properly is really >> >difficult. I've fixed some places, but it's still desperately buggy. >> >> Agreed. >> >> > - It's hard to determine when we should destroy a cache that belongs to >> >a dead memory cgroup. The point is both SLAB and SLUB implementations >> >always keep some pages in stock for performance reasons, so just >> >scheduling cache destruction work from kfree once the last slab page >> >is freed isn't enough - it will normally never happen for SLUB and >> >may take really long for SLAB. Of course, we can forbid SL[AU]B >> >algorithm to stock pages in dead caches, but it looks ugly and has >> >negative impact on performance (I did this, but finally decided to >> >revert). Another approach could be scanning dead caches periodically >> >or on memory pressure, but that would be ugly too. >> >> Not sure about slub, but for SLAB doesn't cache_reap take care of that? > > It is, but it takes some time. If we decide to throttle it, then it'll > take even longer. Anyway, SLUB has nothing like that, therefore we'd > have to handle different algorithms in different ways, which I > particularly dislike. > >> >> > >> > - The arrays for storing per-memcg copies can get really large, >> >especially if we finally decide to leave dead memory cgroups hanging >> >until memory pressure reaps objects assigned to them and let them >> >free. How can we deal with an array of, say, 20K elements? Simply >> >allocating them with kmal^W vmalloc will result in memory wastes. It >> >will be particularly funny if the user wants to provide each cgroup >> >with a separate mount point: each super block will have a list_lru >> >for every memory cgroup, but only one of them will be really used. >> >That said we need a kind of dynamic reclaimable arrays. Radix trees >> >would fit, but they are way slower than plain arrays, which is a >> >no-go, because we want to look up on each kmalloc, list_lru_add/del, >> >which are fast paths. >> >> The initial design we had was to have an array indexed by "cache id" >> in struct memcg, instead of the current array indexed by "css id" in >> struct kmem_cache. >> The initial design doesn't have the problem you're describing here, as >> far as I can tell. > > It is indexed by "cache id", not "css id", but it doesn't matter > actually. Suppose, when a cgroup is taken offli
Re: [patch] mm: memcontrol: support transparent huge pages under pressure
On Fri, Sep 19 2014, Johannes Weiner wrote: > In a memcg with even just moderate cache pressure, success rates for > transparent huge page allocations drop to zero, wasting a lot of > effort that the allocator puts into assembling these pages. > > The reason for this is that the memcg reclaim code was never designed > for higher-order charges. It reclaims in small batches until there is > room for at least one page. Huge pages charges only succeed when > these batches add up over a series of huge faults, which is unlikely > under any significant load involving order-0 allocations in the group. > > Remove that loop on the memcg side in favor of passing the actual > reclaim goal to direct reclaim, which is already set up and optimized > to meet higher-order goals efficiently. > > This brings memcg's THP policy in line with the system policy: if the > allocator painstakingly assembles a hugepage, memcg will at least make > an honest effort to charge it. As a result, transparent hugepage > allocation rates amid cache activity are drastically improved: > > vanilla patched > pgalloc 4717530.80 ( +0.00%) 4451376.40 ( -5.64%) > pgfault 491370.60 ( +0.00%)225477.40 ( -54.11%) > pgmajfault2.00 ( +0.00%) 1.80 ( -6.67%) > thp_fault_alloc 0.00 ( +0.00%) 531.60 (+100.00%) > thp_fault_fallback 749.00 ( +0.00%) 217.40 ( -70.88%) > > [ Note: this may in turn increase memory consumption from internal > fragmentation, which is an inherent risk of transparent hugepages. > Some setups may have to adjust the memcg limits accordingly to > accomodate this - or, if the machine is already packed to capacity, > disable the transparent huge page feature. ] We're using an earlier version of this patch, so I approve of the general direction. But I have some feedback. The memsw aspect of this change seems somewhat separate. Can it be split into a different patch? The memsw aspect of this patch seems to change behavior. Is this intended? If so, a mention of it in the commit log would assuage the reader. I'll explain... Assume a machine with swap enabled and res.limit==memsw.limit, thus memsw_is_minimum is true. My understanding is that memsw.usage represents sum(ram_usage, swap_usage). So when memsw_is_minimum=true, then both swap_usage=0 and memsw.usage==res.usage. In this condition, if res usage is at limit then there's no point in swapping because memsw.usage is already maximal. Prior to this patch I think the kernel did the right thing, but not afterwards. Before this patch: if res.usage == res.limit, try_charge() indirectly calls try_to_free_mem_cgroup_pages(noswap=true) After this patch: if res.usage == res.limit, try_charge() calls try_to_free_mem_cgroup_pages(may_swap=true) Notice the inverted swap-is-allowed value. I haven't had time to look at your other outstanding memcg patches. These comments were made with this patch in isolation. > Signed-off-by: Johannes Weiner > --- > include/linux/swap.h | 6 ++-- > mm/memcontrol.c | 86 > +++- > mm/vmscan.c | 7 +++-- > 3 files changed, 25 insertions(+), 74 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index ea4f926e6b9b..37a585beef5c 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -327,8 +327,10 @@ extern void lru_cache_add_active_or_unevictable(struct > page *page, > extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > gfp_t gfp_mask, nodemask_t *mask); > extern int __isolate_lru_page(struct page *page, isolate_mode_t mode); > -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, > - gfp_t gfp_mask, bool noswap); > +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > + unsigned long nr_pages, > + gfp_t gfp_mask, > + bool may_swap); > extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, > gfp_t gfp_mask, bool noswap, > struct zone *zone, > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 9431024e490c..e2def11f1ec1 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -315,9 +315,6 @@ struct mem_cgroup { > /* OOM-Killer disable */ > int oom_kill_disable; > > - /* set when res.limit == memsw.limit */ > - boolmemsw_is_minimum; > - > /* protect arrays of thresholds */ > struct mutex thresholds_lock; > > @@ -481,14 +478,6 @@ enum res_type { > #define OOM_CONTROL (0) > > /* > - * Reclaim f
Re: [patch] mm: memcontrol: support transparent huge pages under pressure
On Tue, Sep 23 2014, Johannes Weiner wrote: > On Mon, Sep 22, 2014 at 10:52:50PM -0700, Greg Thelen wrote: >> >> On Fri, Sep 19 2014, Johannes Weiner wrote: >> >> > In a memcg with even just moderate cache pressure, success rates for >> > transparent huge page allocations drop to zero, wasting a lot of >> > effort that the allocator puts into assembling these pages. >> > >> > The reason for this is that the memcg reclaim code was never designed >> > for higher-order charges. It reclaims in small batches until there is >> > room for at least one page. Huge pages charges only succeed when >> > these batches add up over a series of huge faults, which is unlikely >> > under any significant load involving order-0 allocations in the group. >> > >> > Remove that loop on the memcg side in favor of passing the actual >> > reclaim goal to direct reclaim, which is already set up and optimized >> > to meet higher-order goals efficiently. >> > >> > This brings memcg's THP policy in line with the system policy: if the >> > allocator painstakingly assembles a hugepage, memcg will at least make >> > an honest effort to charge it. As a result, transparent hugepage >> > allocation rates amid cache activity are drastically improved: >> > >> > vanilla patched >> > pgalloc 4717530.80 ( +0.00%) 4451376.40 ( -5.64%) >> > pgfault 491370.60 ( +0.00%)225477.40 ( -54.11%) >> > pgmajfault2.00 ( +0.00%) 1.80 ( -6.67%) >> > thp_fault_alloc 0.00 ( +0.00%) 531.60 (+100.00%) >> > thp_fault_fallback 749.00 ( +0.00%) 217.40 ( -70.88%) >> > >> > [ Note: this may in turn increase memory consumption from internal >> > fragmentation, which is an inherent risk of transparent hugepages. >> > Some setups may have to adjust the memcg limits accordingly to >> > accomodate this - or, if the machine is already packed to capacity, >> > disable the transparent huge page feature. ] >> >> We're using an earlier version of this patch, so I approve of the >> general direction. But I have some feedback. >> >> The memsw aspect of this change seems somewhat separate. Can it be >> split into a different patch? >> >> The memsw aspect of this patch seems to change behavior. Is this >> intended? If so, a mention of it in the commit log would assuage the >> reader. I'll explain... Assume a machine with swap enabled and >> res.limit==memsw.limit, thus memsw_is_minimum is true. My understanding >> is that memsw.usage represents sum(ram_usage, swap_usage). So when >> memsw_is_minimum=true, then both swap_usage=0 and >> memsw.usage==res.usage. In this condition, if res usage is at limit >> then there's no point in swapping because memsw.usage is already >> maximal. Prior to this patch I think the kernel did the right thing, >> but not afterwards. >> >> Before this patch: >> if res.usage == res.limit, try_charge() indirectly calls >> try_to_free_mem_cgroup_pages(noswap=true) >> >> After this patch: >> if res.usage == res.limit, try_charge() calls >> try_to_free_mem_cgroup_pages(may_swap=true) >> >> Notice the inverted swap-is-allowed value. > > For some reason I had myself convinced that this is dead code due to a > change in callsites a long time ago, but you are right that currently > try_charge() relies on it, thanks for pointing it out. > > However, memsw is always equal to or bigger than the memory limit - so > instead of keeping a separate state variable to track when memory > failure implies memsw failure, couldn't we just charge memsw first? > > How about the following? But yeah, I'd split this into a separate > patch now. Looks good to me. Thanks. Acked-by: Greg Thelen > --- > mm/memcontrol.c | 15 --- > 1 file changed, 8 insertions(+), 7 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e2def11f1ec1..7c9a8971d0f4 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2497,16 +2497,17 @@ retry: > goto done; > > size = batch * PAGE_SIZE; > - if (!res_counter_charge(&memcg->res, size, &fail_res)) { > - if (!do_swap_account) > + if (!do_swap_account || > + !res_counter_charge(&memcg->memsw, size, &fail_res)) { > + if (!res_counter_charge(&
Re: [PATCH v2 0/4] memcg: Low-limit reclaim
On Mon, Apr 28 2014, Roman Gushchin wrote: > 28.04.2014, 16:27, "Michal Hocko" : >> The series is based on top of the current mmotm tree. Once the series >> gets accepted I will post a patch which will mark the soft limit as >> deprecated with a note that it will be eventually dropped. Let me know >> if you would prefer to have such a patch a part of the series. >> >> Thoughts? > > > Looks good to me. > > The only question is: are there any ideas how the hierarchy support > will be used in this case in practice? > Will someone set low limit for non-leaf cgroups? Why? > > Thanks, > Roman I imagine that a hosting service may want to give X MB to a top level memcg (/a) with sub-jobs (/a/b, /a/c) which may(not) have their own low-limits. Examples: case_1) only set low limit on /a. /a/b and /a/c may overcommit /a's memory (b.limit_in_bytes + c.limit_in_bytes > a.limit_in_bytes). case_2) low limits on all memcg. But not overcommitting low_limits (b.low_limit_in_in_bytes + c.low_limit_in_in_bytes <= a.low_limit_in_in_bytes). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] memcg: deprecate memory.force_empty knob
On Tue, May 13 2014, Michal Hocko wrote: > force_empty has been introduced primarily to drop memory before it gets > reparented on the group removal. This alone doesn't sound fully > justified because reparented pages which are not in use can be reclaimed > also later when there is a memory pressure on the parent level. > > Mark the knob CFTYPE_INSANE which tells the cgroup core that it > shouldn't create the knob with the experimental sane_behavior. Other > users will get informed about the deprecation and asked to tell us more > because I do not expect most users will use sane_behavior cgroups mode > very soon. > Anyway I expect that most users will be simply cgroup remove handlers > which do that since ever without having any good reason for it. > > If somebody really cares because reparented pages, which would be > dropped otherwise, push out more important ones then we should fix the > reparenting code and put pages to the tail. I should mention a case where I've needed to use memory.force_empty: to synchronously flush stats from child to parent. Without force_empty memory.stat is temporarily inconsistent until async css_offline reparents charges. Here is an example on v3.14 showing that parent/memory.stat contents are in-flux immediately after rmdir of parent/child. $ cat /test #!/bin/bash # Create parent and child. Add some non-reclaimable anon rss to child, # then move running task to parent. mkdir p p/c (echo $BASHPID > p/c/cgroup.procs && exec sleep 1d) & pid=$! sleep 1 echo $pid > p/cgroup.procs grep 'rss ' {p,p/c}/memory.stat if [[ $1 == force ]]; then echo 1 > p/c/memory.force_empty fi rmdir p/c echo 'For a small time the p/c memory has not been reparented to p.' grep 'rss ' {p,p/c}/memory.stat sleep 1 echo 'After waiting all memory has been reparented' grep 'rss ' {p,p/c}/memory.stat kill $pid rmdir p -- First, demonstrate that just rmdir, without memory.force_empty, temporarily hides reparented child memory stats. $ /test p/memory.stat:rss 0 p/memory.stat:total_rss 69632 p/c/memory.stat:rss 69632 p/c/memory.stat:total_rss 69632 For a small time the p/c memory has not been reparented to p. p/memory.stat:rss 0 p/memory.stat:total_rss 0 grep: p/c/memory.stat: No such file or directory After waiting all memory has been reparented p/memory.stat:rss 69632 p/memory.stat:total_rss 69632 grep: p/c/memory.stat: No such file or directory /test: Terminated ( echo $BASHPID > p/c/cgroup.procs && exec sleep 1d ) -- Demonstrate that using memory.force_empty before rmdir, behaves more sensibly. Stats for reparented child memory are not hidden. $ /test force p/memory.stat:rss 0 p/memory.stat:total_rss 69632 p/c/memory.stat:rss 69632 p/c/memory.stat:total_rss 69632 For a small time the p/c memory has not been reparented to p. p/memory.stat:rss 69632 p/memory.stat:total_rss 69632 grep: p/c/memory.stat: No such file or directory After waiting all memory has been reparented p/memory.stat:rss 69632 p/memory.stat:total_rss 69632 grep: p/c/memory.stat: No such file or directory /test: Terminated ( echo $BASHPID > p/c/cgroup.procs && exec sleep 1d ) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 1/4] sl[au]b: do not charge large allocations to memcg
On Wed, Mar 26 2014, Vladimir Davydov wrote: > We don't track any random page allocation, so we shouldn't track kmalloc > that falls back to the page allocator. This seems like a change which will leads to confusing (and arguably improper) kernel behavior. I prefer the behavior prior to this patch. Before this change both of the following allocations are charged to memcg (assuming kmem accounting is enabled): a = kmalloc(KMALLOC_MAX_CACHE_SIZE, GFP_KERNEL) b = kmalloc(KMALLOC_MAX_CACHE_SIZE + 1, GFP_KERNEL) After this change only 'a' is charged; 'b' goes directly to page allocator which no longer does accounting. > Signed-off-by: Vladimir Davydov > Cc: Johannes Weiner > Cc: Michal Hocko > Cc: Glauber Costa > Cc: Christoph Lameter > Cc: Pekka Enberg > --- > include/linux/slab.h |2 +- > mm/memcontrol.c | 27 +-- > mm/slub.c|4 ++-- > 3 files changed, 4 insertions(+), 29 deletions(-) > > diff --git a/include/linux/slab.h b/include/linux/slab.h > index 3dd389aa91c7..8a928ff71d93 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -363,7 +363,7 @@ kmalloc_order(size_t size, gfp_t flags, unsigned int > order) > { > void *ret; > > - flags |= (__GFP_COMP | __GFP_KMEMCG); > + flags |= __GFP_COMP; > ret = (void *) __get_free_pages(flags, order); > kmemleak_alloc(ret, size, 1, flags); > return ret; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index b4b6aef562fa..81a162d01d4d 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3528,35 +3528,10 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct > mem_cgroup **_memcg, int order) > > *_memcg = NULL; > > - /* > - * Disabling accounting is only relevant for some specific memcg > - * internal allocations. Therefore we would initially not have such > - * check here, since direct calls to the page allocator that are marked > - * with GFP_KMEMCG only happen outside memcg core. We are mostly > - * concerned with cache allocations, and by having this test at > - * memcg_kmem_get_cache, we are already able to relay the allocation to > - * the root cache and bypass the memcg cache altogether. > - * > - * There is one exception, though: the SLUB allocator does not create > - * large order caches, but rather service large kmallocs directly from > - * the page allocator. Therefore, the following sequence when backed by > - * the SLUB allocator: > - * > - * memcg_stop_kmem_account(); > - * kmalloc() > - * memcg_resume_kmem_account(); > - * > - * would effectively ignore the fact that we should skip accounting, > - * since it will drive us directly to this function without passing > - * through the cache selector memcg_kmem_get_cache. Such large > - * allocations are extremely rare but can happen, for instance, for the > - * cache arrays. We bring this test here. > - */ > - if (!current->mm || current->memcg_kmem_skip_account) > + if (!current->mm) > return true; > > memcg = get_mem_cgroup_from_mm(current->mm); > - > if (!memcg_can_account_kmem(memcg)) { > css_put(&memcg->css); > return true; > diff --git a/mm/slub.c b/mm/slub.c > index 5e234f1f8853..c2e58a787443 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -3325,7 +3325,7 @@ static void *kmalloc_large_node(size_t size, gfp_t > flags, int node) > struct page *page; > void *ptr = NULL; > > - flags |= __GFP_COMP | __GFP_NOTRACK | __GFP_KMEMCG; > + flags |= __GFP_COMP | __GFP_NOTRACK; > page = alloc_pages_node(node, flags, get_order(size)); > if (page) > ptr = page_address(page); > @@ -3395,7 +3395,7 @@ void kfree(const void *x) > if (unlikely(!PageSlab(page))) { > BUG_ON(!PageCompound(page)); > kfree_hook(x); > - __free_memcg_kmem_pages(page, compound_order(page)); > + __free_pages(page, compound_order(page)); > return; > } > slab_free(page->slab_cache, page, object, _RET_IP_); > -- > 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] memcg: remove lookup_cgroup_page() prototype
6b208e3f6e35 ("mm: memcg: remove unused node/section info from pc->flags") deleted the lookup_cgroup_page() function but left a prototype for it. Kill the vestigial prototype. Signed-off-by: Greg Thelen --- include/linux/page_cgroup.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index 777a524716db..0ff470de3c12 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -42,7 +42,6 @@ static inline void __init page_cgroup_init(void) #endif struct page_cgroup *lookup_page_cgroup(struct page *page); -struct page *lookup_cgroup_page(struct page_cgroup *pc); #define TESTPCGFLAG(uname, lname) \ static inline int PageCgroup##uname(struct page_cgroup *pc)\ -- 2.0.0.526.g5318336 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/5] Virtual Memory Resource Controller for cgroups
On Wed, Jul 9, 2014 at 9:36 AM, Vladimir Davydov wrote: > Hi Tim, > > On Wed, Jul 09, 2014 at 08:08:07AM -0700, Tim Hockin wrote: >> How is this different from RLIMIT_AS? You specifically mentioned it >> earlier but you don't explain how this is different. > > The main difference is that RLIMIT_AS is per process while this > controller is per cgroup. RLIMIT_AS doesn't allow us to limit VSIZE for > a group of unrelated or cooperating through shmem processes. > > Also RLIMIT_AS accounts for total VM usage (including file mappings), > while this only charges private writable and shared mappings, whose > faulted-in pages always occupy mem+swap and therefore cannot be just > synced and dropped like file pages. In other words, this controller > works exactly as the global overcommit control. > >> From my perspective, this is pointless. There's plenty of perfectly >> correct software that mmaps files without concern for VSIZE, because >> they never fault most of those pages in. > > But there's also software that correctly handles ENOMEM returned by > mmap. For example, mongodb keeps growing its buffers until mmap fails. > Therefore, if there's no overcommit control, it will be OOM-killed > sooner or later, which may be pretty annoying. And we did have customers > complaining about that. Is mongodb's buffer growth causing the oom kills? If yes, I wonder if apps, like mongodb, that want ENOMEM should (1) use MAP_POPULATE and (2) we change vm_map_pgoff() to propagate mm_populate() ENOMEM failures back to mmap()? >> From my observations it is not generally possible to predict an >> average VSIZE limit that would satisfy your concerns *and* not kill >> lots of valid apps. > > Yes, it's difficult. Actually, we can only guess. Nevertheless, we > predict and set the VSIZE limit system-wide by default. > >> It sounds like what you want is to limit or even disable swap usage. > > I want to avoid OOM kill if it's possible to return ENOMEM. OOM can be > painful. It can kill lots of innocent processes. Of course, the user can > protect some processes by setting oom_score_adj, but this is difficult > and requires time and expertise, so an average user won't do that. > >> Given your example, your hypothetical user would probably be better of >> getting an OOM kill early so she can fix her job spec to request more >> memory. > > In my example the user won't get OOM kill *early*... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
On Fri, Jun 06 2014, Michal Hocko wrote: > Some users (e.g. Google) would like to have stronger semantic than low > limit offers currently. The fallback mode is not desirable and they > prefer hitting OOM killer rather than ignoring low limit for protected > groups. There are other possible usecases which can benefit from hard > guarantees. I can imagine workloads where setting low_limit to the same > value as hard_limit to prevent from any reclaim at all makes a lot of > sense because reclaim is much more disrupting than restart of the load. > > This patch adds a new per memcg memory.reclaim_strategy knob which > tells what to do in a situation when memory reclaim cannot do any > progress because all groups in the reclaimed hierarchy are within their > low_limit. There are two options available: > - low_limit_best_effort - the current mode when reclaim falls > back to the even reclaim of all groups in the reclaimed > hierarchy > - low_limit_guarantee - groups within low_limit are never > reclaimed and OOM killer is triggered instead. OOM message > will mention the fact that the OOM was triggered due to > low_limit reclaim protection. To (a) be consistent with existing hard and soft limits APIs and (b) allow use of both best effort and guarantee memory limits, I wonder if it's best to offer three per memcg limits, rather than two limits (hard, low_limit) and a related reclaim_strategy knob. The three limits I'm thinking about are: 1) hard_limit (aka the existing limit_in_bytes cgroupfs file). No change needed here. This is an upper bound on a memcg hierarchy's memory consumption (assuming use_hierarchy=1). 2) best_effort_limit (aka desired working set). This allow an application or administrator to provide a hint to the kernel about desired working set size. Before oom'ing the kernel is allowed to reclaim below this limit. I think the current soft_limit_in_bytes claims to provide this. If we prefer to deprecate soft_limit_in_bytes, then a new desired_working_set_in_bytes (or a hopefully better named) API seems reasonable. 3) low_limit_guarantee which is a lower bound of memory usage. A memcg would prefer to be oom killed rather than operate below this threshold. Default value is zero to preserve compatibility with existing apps. Logically hard_limit >= best_effort_limit >= low_limit_guarantee. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] memcg: Allow hard guarantee mode for low limit reclaim
On Tue, Jun 10 2014, Johannes Weiner wrote: > On Mon, Jun 09, 2014 at 03:52:51PM -0700, Greg Thelen wrote: >> >> On Fri, Jun 06 2014, Michal Hocko wrote: >> >> > Some users (e.g. Google) would like to have stronger semantic than low >> > limit offers currently. The fallback mode is not desirable and they >> > prefer hitting OOM killer rather than ignoring low limit for protected >> > groups. There are other possible usecases which can benefit from hard >> > guarantees. I can imagine workloads where setting low_limit to the same >> > value as hard_limit to prevent from any reclaim at all makes a lot of >> > sense because reclaim is much more disrupting than restart of the load. >> > >> > This patch adds a new per memcg memory.reclaim_strategy knob which >> > tells what to do in a situation when memory reclaim cannot do any >> > progress because all groups in the reclaimed hierarchy are within their >> > low_limit. There are two options available: >> >- low_limit_best_effort - the current mode when reclaim falls >> > back to the even reclaim of all groups in the reclaimed >> > hierarchy >> >- low_limit_guarantee - groups within low_limit are never >> > reclaimed and OOM killer is triggered instead. OOM message >> > will mention the fact that the OOM was triggered due to >> > low_limit reclaim protection. >> >> To (a) be consistent with existing hard and soft limits APIs and (b) >> allow use of both best effort and guarantee memory limits, I wonder if >> it's best to offer three per memcg limits, rather than two limits (hard, >> low_limit) and a related reclaim_strategy knob. The three limits I'm >> thinking about are: >> >> 1) hard_limit (aka the existing limit_in_bytes cgroupfs file). No >>change needed here. This is an upper bound on a memcg hierarchy's >>memory consumption (assuming use_hierarchy=1). > > This creates internal pressure. Outside reclaim is not affected by > it, but internal charges can not exceed this limit. This is set to > hard limit the maximum memory consumption of a group (max). > >> 2) best_effort_limit (aka desired working set). This allow an >>application or administrator to provide a hint to the kernel about >>desired working set size. Before oom'ing the kernel is allowed to >>reclaim below this limit. I think the current soft_limit_in_bytes >>claims to provide this. If we prefer to deprecate >>soft_limit_in_bytes, then a new desired_working_set_in_bytes (or a >>hopefully better named) API seems reasonable. > > This controls how external pressure applies to the group. > > But it's conceivable that we'd like to have the equivalent of such a > soft limit for *internal* pressure. Set below the hard limit, this > internal soft limit would have charges trigger direct reclaim in the > memcg but allow them to continue to the hard limit. This would create > a situation wherein the allocating tasks are not killed, but throttled > under reclaim, which gives the administrator a window to detect the > situation with vmpressure and possibly intervene. Because as it > stands, once the current hard limit is hit things can go down pretty > fast and the window for reacting to vmpressure readings is often too > small. This would offer a more gradual deterioration. It would be > set to the upper end of the working set size range (high). > > I think for many users such an internal soft limit would actually be > preferred over the current hard limit, as they'd rather have some > reclaim throttling than an OOM kill when the group reaches its upper > bound. The current hard limit would be reserved for more advanced or > paid cases, where the admin would rather see a memcg get OOM killed > than exceed a certain size. > > Then, as you proposed, we'd have the soft limit for external pressure, > where the kernel only reclaims groups within that limit in order to > avoid OOM kills. It would be set to the estimated lower end of the > working set size range (low). > >> 3) low_limit_guarantee which is a lower bound of memory usage. A memcg >>would prefer to be oom killed rather than operate below this >>threshold. Default value is zero to preserve compatibility with >>existing apps. > > And this would be the external pressure hard limit, which would be set > to the absolute minimum requirement of the group (min). > > Either because it would be hopelessly thrashing without it, or because > this guaranteed memory is actually paid for. Again, I
Re: [PATCH -mm 1/4] sl[au]b: do not charge large allocations to memcg
On Thu, Mar 27, 2014 at 12:37 AM, Vladimir Davydov wrote: > Hi Greg, > > On 03/27/2014 08:31 AM, Greg Thelen wrote: >> On Wed, Mar 26 2014, Vladimir Davydov wrote: >> >>> We don't track any random page allocation, so we shouldn't track kmalloc >>> that falls back to the page allocator. >> This seems like a change which will leads to confusing (and arguably >> improper) kernel behavior. I prefer the behavior prior to this patch. >> >> Before this change both of the following allocations are charged to >> memcg (assuming kmem accounting is enabled): >> a = kmalloc(KMALLOC_MAX_CACHE_SIZE, GFP_KERNEL) >> b = kmalloc(KMALLOC_MAX_CACHE_SIZE + 1, GFP_KERNEL) >> >> After this change only 'a' is charged; 'b' goes directly to page >> allocator which no longer does accounting. > > Why do we need to charge 'b' in the first place? Can the userspace > trigger such allocations massively? If there can only be one or two such > allocations from a cgroup, is there any point in charging them? Of the top of my head I don't know of any >8KIB kmalloc()s so I can't say if they're directly triggerable by user space en masse. But we recently ran into some order:3 allocations in networking. The networking allocations used a non-generic kmem_cache (rather than kmalloc which started this discussion). For details, see ed98df3361f0 ("net: use __GFP_NORETRY for high order allocations"). I can't say if such allocations exist in device drivers, but given the networking example, it's conceivable that they may (or will) exist. With slab this isn't a problem because sla has kmalloc kmem_caches for all supported allocation sizes. However, slub shows this issue for any kmalloc() allocations larger than 8KIB (at least on x86_64). It seems like a strange directly to take kmem accounting to say that kmalloc allocations are kmem limited, but only if they are either less than a threshold size or done with slab. Simply increasing the size of a data structure doesn't seem like it should automatically cause the memory to become exempt from kmem limits. > In fact, do we actually need to charge every random kmem allocation? I > guess not. For instance, filesystems often allocate data shared among > all the FS users. It's wrong to charge such allocations to a particular > memcg, IMO. That said the next step is going to be adding a per kmem > cache flag specifying if allocations from this cache should be charged > so that accounting will work only for those caches that are marked so > explicitly. It's a question of what direction to approach kmem slab accounting from: either opt-out (as the code currently is), or opt-in (with per kmem_cache flags as you suggest). I agree that some structures end up being shared (e.g. filesystem block bit map structures). In an opt-out system these are charged to a memcg initially and remain charged there until the memcg is deleted at which point the shared objects are reparented to a shared location. While this isn't perfect, it's unclear if it's better or worse than analyzing each class of allocation and deciding if they should be opt'd-in. One could (though I'm not) make the case that even dentries are easily shareable between containers and thus shouldn't be accounted to a single memcg. But given user space's ability to DoS a machine with dentires, they should be accounted. > There is one more argument for removing kmalloc_large accounting - we > don't have an easy way to track such allocations, which prevents us from > reparenting kmemcg charges on css offline. Of course, we could link > kmalloc_large pages in some sort of per-memcg list which would allow us > to find them on css offline, but I don't think such a complication is > justified. I assume that reparenting of such non kmem_cache allocations (e.g. large kmalloc) is difficult because such pages refer to the memcg, which we're trying to delete and the memcg has no index of such pages. If such zombie memcg are undesirable, then an alternative to indexing the pages is to define a kmem context object which such large pages point to. The kmem context would be reparented without needing to adjust the individual large pages. But there are plenty of options. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] memcg: Low-limit reclaim
On Wed, May 28 2014, Johannes Weiner wrote: > On Wed, May 28, 2014 at 04:21:44PM +0200, Michal Hocko wrote: >> On Wed 28-05-14 09:49:05, Johannes Weiner wrote: >> > On Wed, May 28, 2014 at 02:10:23PM +0200, Michal Hocko wrote: >> > > Hi Andrew, Johannes, >> > > >> > > On Mon 28-04-14 14:26:41, Michal Hocko wrote: >> > > > This patchset introduces such low limit that is functionally similar >> > > > to a minimum guarantee. Memcgs which are under their lowlimit are not >> > > > considered eligible for the reclaim (both global and hardlimit) unless >> > > > all groups under the reclaimed hierarchy are below the low limit when >> > > > all of them are considered eligible. >> > > > >> > > > The previous version of the patchset posted as a RFC >> > > > (http://marc.info/?l=linux-mm&m=138677140628677&w=2) suggested a >> > > > hard guarantee without any fallback. More discussions led me to >> > > > reconsidering the default behavior and come up a more relaxed one. The >> > > > hard requirement can be added later based on a use case which really >> > > > requires. It would be controlled by memory.reclaim_flags knob which >> > > > would specify whether to OOM or fallback (default) when all groups are >> > > > bellow low limit. >> > > >> > > It seems that we are not in a full agreement about the default behavior >> > > yet. Johannes seems to be more for hard guarantee while I would like to >> > > see the weaker approach first and move to the stronger model later. >> > > Johannes, is this absolutely no-go for you? Do you think it is seriously >> > > handicapping the semantic of the new knob? >> > >> > Well we certainly can't start OOMing where we previously didn't, >> > that's called a regression and automatically limits our options. >> > >> > Any unexpected OOMs will be much more acceptable from a new feature >> > than from configuration that previously "worked" and then stopped. >> >> Yes and we are not talking about regressions, are we? >> >> > > My main motivation for the weaker model is that it is hard to see all >> > > the corner case right now and once we hit them I would like to see a >> > > graceful fallback rather than fatal action like OOM killer. Besides that >> > > the usaceses I am mostly interested in are OK with fallback when the >> > > alternative would be OOM killer. I also feel that introducing a knob >> > > with a weaker semantic which can be made stronger later is a sensible >> > > way to go. >> > >> > We can't make it stronger, but we can make it weaker. >> >> Why cannot we make it stronger by a knob/configuration option? > > Why can't we make it weaker by a knob? Why should we design the > default for unforeseeable cornercases rather than make the default > make sense for existing cases and give cornercases a fallback once > they show up? My 2c... The following works for my use cases: 1) introduce memory.low_limit_in_bytes (default=0 thus no default change from older kernels) 2) interested users will set low_limit_in_bytes to non-zero value. Memory protected by low limit should be as migratable/reclaimable as mlock memory. If a zone full of mlock memory causes oom kills, then so should the low limit. If we find corner cases where low_limit_in_bytes is too strict, then we could discuss a new knob to relax it. But I think we should start with a strict low-limit. If the oom killer gets tied in knots due to low limit, then I'd like to explore fixing the oom killer before relaxing low limit. Disclaimer: new use cases will certainly appear with various requirements. But an oom-killing low_limit_in_bytes seems like a generic opt-in feature, so I think it's worthwhile. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v3 2/6] mm, compaction: return failed migration target pages back to freelist
On Wed, May 07 2014, Andrew Morton wrote: > On Tue, 6 May 2014 19:22:43 -0700 (PDT) David Rientjes > wrote: > >> Memory compaction works by having a "freeing scanner" scan from one end of a >> zone which isolates pages as migration targets while another "migrating >> scanner" >> scans from the other end of the same zone which isolates pages for migration. >> >> When page migration fails for an isolated page, the target page is returned >> to >> the system rather than the freelist built by the freeing scanner. This may >> require the freeing scanner to continue scanning memory after suitable >> migration >> targets have already been returned to the system needlessly. >> >> This patch returns destination pages to the freeing scanner freelist when >> page >> migration fails. This prevents unnecessary work done by the freeing scanner >> but >> also encourages memory to be as compacted as possible at the end of the zone. >> >> Reported-by: Greg Thelen > > What did Greg actually report? IOW, what if any observable problem is > being fixed here? I detected the problem at runtime seeing that ext4 metadata pages (esp the ones read by "sbi->s_group_desc[i] = sb_bread(sb, block)") were constantly visited by compaction calls of migrate_pages(). These pages had a non-zero b_count which caused fallback_migrate_page() -> try_to_release_page() -> try_to_free_buffers() to fail. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] memcg: Low-limit reclaim
On Wed, Dec 11 2013, Michal Hocko wrote: > Hi, > previous discussions have shown that soft limits cannot be reformed > (http://lwn.net/Articles/555249/). This series introduces an alternative > approach to protecting memory allocated to processes executing within > a memory cgroup controller. It is based on a new tunable that was > discussed with Johannes and Tejun held during the last kernel summit. > > This patchset introduces such low limit that is functionally similar to a > minimum guarantee. Memcgs which are under their lowlimit are not considered > eligible for the reclaim (both global and hardlimit). The default value of > the limit is 0 so all groups are eligible by default and an interested > party has to explicitly set the limit. > > The primary use case is to protect an amount of memory allocated to a > workload without it being reclaimed by an unrelated activity. In some > cases this requirement can be fulfilled by mlock but it is not suitable > for many loads and generally requires application awareness. Such > application awareness can be complex. It effectively forbids the > use of memory overcommit as the application must explicitly manage > memory residency. > With low limits, such workloads can be placed in a memcg with a low > limit that protects the estimated working set. > > Another use case might be unreclaimable groups. Some loads might be so > sensitive to reclaim that it is better to kill and start it again (or > since checkpoint) rather than trash. This would be trivial with low > limit set to unlimited and the OOM killer will handle the situation as > required (e.g. kill and restart). > > The hierarchical behavior of the lowlimit is described in the first > patch. It is followed by a direct reclaim fix which is necessary to > handle situation when a no group is eligible because all groups are > below low limit. This is not a big deal for hardlimit reclaim because > we simply retry the reclaim few times and then trigger memcg OOM killer > path. It would blow up in the global case when we would loop without > doing any progress or trigger OOM killer. I would consider configuration > leading to this state invalid but we should handle that gracefully. > > The third patch finally allows setting the lowlimit. > > The last patch tries expedites OOM if it is clear that no group is > eligible for reclaim. It basically breaks out of loops in the direct > reclaim and lets kswapd sleep because it wouldn't do any progress anyway. > > Thoughts? > > Short log says: > Michal Hocko (4): > memcg, mm: introduce lowlimit reclaim > mm, memcg: allow OOM if no memcg is eligible during direct reclaim > memcg: Allow setting low_limit > mm, memcg: expedite OOM if no memcg is reclaimable > > And a diffstat > include/linux/memcontrol.h | 14 +++ > include/linux/res_counter.h | 40 ++ > kernel/res_counter.c| 2 ++ > mm/memcontrol.c | 60 > - > mm/vmscan.c | 59 +--- > 5 files changed, 170 insertions(+), 5 deletions(-) The series looks useful. We (Google) have been using something similar. In practice such a low_limit (or memory guarantee), doesn't nest very well. Example: - parent_memcg: limit 500, low_limit 500, usage 500 1 privately charged non-reclaimable page (e.g. mlock, slab) - child_memcg: limit 500, low_limit 500, usage 499 If a streaming file cache workload (e.g. sha1sum) starts gobbling up page cache it will lead to an oom kill instead of reclaiming. One could argue that this is working as intended because child_memcg was promised 500 but can only get 499. So child_memcg is oom killed rather than being forced to operate below its promised low limit. This has led to various internal workarounds like: - don't charge any memory to interior tree nodes (e.g. parent_memcg); only charge memory to cgroup leafs. This gets tricky when dealing with reparented memory inherited to parent from child during cgroup deletion. - don't set low_limit on non leafs (e.g. do not set low limit on parent_memcg). This constrains the cgroup layout a bit. Some customers want to purchase $MEM and setup their workload with a few child cgroups. A system daemon hands out $MEM by setting low_limit for top-level containers (e.g. parent_memcg). Thereafter such customers are able to partition their workload with sub memcg below child_memcg. Example: parent_memcg \ child_memcg / \ server backup Thereafter customers often want some weak isolation between server and backup. To avoid undesired oom kills the server/backup isolation is provided with a softer memory guarantee (e.g. soft_limit). The soft limit acts like the low_limit until priority becomes desperate. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a mess
Re: [RFC 0/4] memcg: Low-limit reclaim
On Thu, Jan 30 2014, Michal Hocko wrote: > On Wed 29-01-14 11:08:46, Greg Thelen wrote: > [...] >> The series looks useful. We (Google) have been using something similar. >> In practice such a low_limit (or memory guarantee), doesn't nest very >> well. >> >> Example: >> - parent_memcg: limit 500, low_limit 500, usage 500 >> 1 privately charged non-reclaimable page (e.g. mlock, slab) >> - child_memcg: limit 500, low_limit 500, usage 499 > > I am not sure this is a good example. Your setup basically say that no > single page should be reclaimed. I can imagine this might be useful in > some cases and I would like to allow it but it sounds too extreme (e.g. > a load which would start trashing heavily once the reclaim starts and it > makes more sense to start it again rather than crowl - think about some > mathematical simulation which might diverge). Pages will still be reclaimed the usage_in_bytes is exceeds limit_in_bytes. I see the low_limit as a way to tell the kernel: don't reclaim my memory due to external pressure, but internal pressure is different. >> If a streaming file cache workload (e.g. sha1sum) starts gobbling up >> page cache it will lead to an oom kill instead of reclaiming. > > Does it make any sense to protect all of such memory although it is > easily reclaimable? I think protection makes sense in this case. If I know my workload needs 500 to operate well, then I reserve 500 using low_limit. My app doesn't want to run with less than its reservation. >> One could argue that this is working as intended because child_memcg >> was promised 500 but can only get 499. So child_memcg is oom killed >> rather than being forced to operate below its promised low limit. >> >> This has led to various internal workarounds like: >> - don't charge any memory to interior tree nodes (e.g. parent_memcg); >> only charge memory to cgroup leafs. This gets tricky when dealing >> with reparented memory inherited to parent from child during cgroup >> deletion. > > Do those need any protection at all? Interior tree nodes don't need protection from their children. But children and interior nodes need protection from siblings and parents. >> - don't set low_limit on non leafs (e.g. do not set low limit on >> parent_memcg). This constrains the cgroup layout a bit. Some >> customers want to purchase $MEM and setup their workload with a few >> child cgroups. A system daemon hands out $MEM by setting low_limit >> for top-level containers (e.g. parent_memcg). Thereafter such >> customers are able to partition their workload with sub memcg below >> child_memcg. Example: >> parent_memcg >> \ >> child_memcg >> / \ >> server backup > > I think that the low_limit makes sense where you actually want to > protect something from reclaim. And backup sounds like a bad fit for > that. The backup job would presumably have a small low_limit, but it may still have a minimum working set required to make useful forward progress. Example: parent_memcg \ child_memcg limit 500, low_limit 500, usage 500 / \ | backup limit 10, low_limit 10, usage 10 | server limit 490, low_limit 490, usage 490 One could argue that problems appear when server.low_limit+backup.lower_limit=child_memcg.limit. So the safer configuration is leave some padding: server.low_limit + backup.low_limit + padding = child_memcg.limit but this just defers the problem. As memory is reparented into parent, then padding must grow. >> Thereafter customers often want some weak isolation between server and >> backup. To avoid undesired oom kills the server/backup isolation is >> provided with a softer memory guarantee (e.g. soft_limit). The soft >> limit acts like the low_limit until priority becomes desperate. > > Johannes was already suggesting that the low_limit should allow for a > weaker semantic as well. I am not very much inclined to that but I can > leave with a knob which would say oom_on_lowlimit (on by default but > allowed to be set to 0). We would fallback to the full reclaim if > no groups turn out to be reclaimable. I like the strong semantic of your low_limit at least at level:1 cgroups (direct children of root). But I have also encountered situations where a strict guarantee is too strict and a mere preference is desirable. Perhaps the best plan is to continue with the proposed strict low_limit and eventually provide an additional mechanism which provides weaker guarantees (e.g. soft_limit or something else if soft_limit cannot be altered). These two would offer good support for a va
Re: [RFC 0/4] memcg: Low-limit reclaim
On Mon, Feb 03 2014, Michal Hocko wrote: > On Thu 30-01-14 16:28:27, Greg Thelen wrote: >> On Thu, Jan 30 2014, Michal Hocko wrote: >> >> > On Wed 29-01-14 11:08:46, Greg Thelen wrote: >> > [...] >> >> The series looks useful. We (Google) have been using something similar. >> >> In practice such a low_limit (or memory guarantee), doesn't nest very >> >> well. >> >> >> >> Example: >> >> - parent_memcg: limit 500, low_limit 500, usage 500 >> >> 1 privately charged non-reclaimable page (e.g. mlock, slab) >> >> - child_memcg: limit 500, low_limit 500, usage 499 >> > >> > I am not sure this is a good example. Your setup basically say that no >> > single page should be reclaimed. I can imagine this might be useful in >> > some cases and I would like to allow it but it sounds too extreme (e.g. >> > a load which would start trashing heavily once the reclaim starts and it >> > makes more sense to start it again rather than crowl - think about some >> > mathematical simulation which might diverge). >> >> Pages will still be reclaimed the usage_in_bytes is exceeds >> limit_in_bytes. I see the low_limit as a way to tell the kernel: don't >> reclaim my memory due to external pressure, but internal pressure is >> different. > > That sounds strange and very confusing to me. What if the internal > pressure comes from children memcgs? Lowlimit is intended for protecting > a group from reclaim and it shouldn't matter whether the reclaim is a > result of the internal or external pressure. > >> >> If a streaming file cache workload (e.g. sha1sum) starts gobbling up >> >> page cache it will lead to an oom kill instead of reclaiming. >> > >> > Does it make any sense to protect all of such memory although it is >> > easily reclaimable? >> >> I think protection makes sense in this case. If I know my workload >> needs 500 to operate well, then I reserve 500 using low_limit. My app >> doesn't want to run with less than its reservation. >> >> >> One could argue that this is working as intended because child_memcg >> >> was promised 500 but can only get 499. So child_memcg is oom killed >> >> rather than being forced to operate below its promised low limit. >> >> >> >> This has led to various internal workarounds like: >> >> - don't charge any memory to interior tree nodes (e.g. parent_memcg); >> >> only charge memory to cgroup leafs. This gets tricky when dealing >> >> with reparented memory inherited to parent from child during cgroup >> >> deletion. >> > >> > Do those need any protection at all? >> >> Interior tree nodes don't need protection from their children. But >> children and interior nodes need protection from siblings and parents. > > Why? They contains only reparented pages in the above case. Those would > be #1 candidate for reclaim in most cases, no? I think we're on the same page. My example interior node has reclaimed pages and is a #1 candidate for reclaim induced from charges against parent_memcg, but not a candidate for reclaim due to global memory pressure induced by a sibling of parent_memcg. >> >> - don't set low_limit on non leafs (e.g. do not set low limit on >> >> parent_memcg). This constrains the cgroup layout a bit. Some >> >> customers want to purchase $MEM and setup their workload with a few >> >> child cgroups. A system daemon hands out $MEM by setting low_limit >> >> for top-level containers (e.g. parent_memcg). Thereafter such >> >> customers are able to partition their workload with sub memcg below >> >> child_memcg. Example: >> >> parent_memcg >> >> \ >> >> child_memcg >> >> / \ >> >> server backup >> > >> > I think that the low_limit makes sense where you actually want to >> > protect something from reclaim. And backup sounds like a bad fit for >> > that. >> >> The backup job would presumably have a small low_limit, but it may still >> have a minimum working set required to make useful forward progress. >> >> Example: >> parent_memcg >> \ >>child_memcg limit 500, low_limit 500, usage 500 >> / \ >> | backup limit 10, low_limit 10, usage 10 >> | >> server limit 490, low_limit 490,
Re: RIP: mem_cgroup_move_account+0xf4/0x290
>> [ 7691.500464] [] ? insert_kthread_work+0x40/0x40 >> [ 7691.507335] Code: 85 f6 48 8b 55 d0 44 8b 4d c8 4c 8b 45 c0 0f 85 b3 00 >> 00 00 41 8b 4c 24 18 >> 85 c9 0f 88 a6 00 00 00 48 8b b2 30 02 00 00 45 89 ca <4c> 39 56 18 0f 8c 36 >> 01 00 00 44 89 c9 >> f7 d9 89 cf 65 48 01 7e >> [ 7691.528638] RIP [] mem_cgroup_move_account+0xf4/0x290 >> >> Add the required __this_cpu_read(). > > Sorry for my mistake and thanks for the fix up, it looks good to me. > > Reviewed-by: Sha Zhengju > > > Thanks, > Sha >> >> Signed-off-by: Johannes Weiner >> --- >> mm/memcontrol.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 4097a78..a4864b6 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct >> mem_cgroup *from, >> { >> /* Update stat data for mem_cgroup */ >> preempt_disable(); >> -WARN_ON_ONCE(from->stat->count[idx] < nr_pages); >> +WARN_ON_ONCE(__this_cpu_read(from->stat->count[idx]) < nr_pages); >> __this_cpu_add(from->stat->count[idx], -nr_pages); >> __this_cpu_add(to->stat->count[idx], nr_pages); >> preempt_enable(); > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org";> em...@kvack.org I was just polishing up this following patch which I think is better because it should avoid spurious warnings. ---8<--- >From c1f43ef0f4cc42fb2ecaeaca71bd247365e3521e Mon Sep 17 00:00:00 2001 From: Greg Thelen Date: Fri, 25 Oct 2013 21:59:57 -0700 Subject: [PATCH] memcg: remove incorrect underflow check When a memcg is deleted mem_cgroup_reparent_charges() moves charged memory to the parent memcg. As of v3.11-9444-g3ea67d0 "memcg: add per cgroup writeback pages accounting" there's bad pointer read. The goal was to check for counter underflow. The counter is a per cpu counter and there are two problems with the code: (1) per cpu access function isn't used, instead a naked pointer is used which easily causes panic. (2) the check doesn't sum all cpus Test: $ cd /sys/fs/cgroup/memory $ mkdir x $ echo 3 > /proc/sys/vm/drop_caches $ (echo $BASHPID >> x/tasks && exec cat) & [1] 7154 $ grep ^mapped x/memory.stat mapped_file 53248 $ echo 7154 > tasks $ rmdir x The fix is to remove the check. It's currently dangerous and isn't worth fixing it to use something expensive, such as percpu_counter_sum(), for each reparented page. __this_cpu_read() isn't enough to fix this because there's no guarantees of the current cpus count. The only guarantees is that the sum of all per-cpu counter is >= nr_pages. Signed-off-by: Greg Thelen --- mm/memcontrol.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 34d3ca9..aa8185c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3773,7 +3773,6 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup *from, { /* Update stat data for mem_cgroup */ preempt_disable(); - WARN_ON_ONCE(from->stat->count[idx] < nr_pages); __this_cpu_add(from->stat->count[idx], -nr_pages); __this_cpu_add(to->stat->count[idx], nr_pages); preempt_enable(); -- 1.8.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] percpu counter: cast this_cpu_sub() adjustment
this_cpu_sub() is implemented as negation and addition. This patch casts the adjustment to the counter type before negation to sign extend the adjustment. This helps in cases where the counter type is wider than an unsigned adjustment. An alternative to this patch is to declare such operations unsupported, but it seemed useful to avoid surprises. This patch specifically helps the following example: unsigned int delta = 1 preempt_disable() this_cpu_write(long_counter, 0) this_cpu_sub(long_counter, delta) preempt_enable() Before this change long_counter on a 64 bit machine ends with value 0x, rather than 0x. This is because this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta), which is basically: long_counter = 0 + 0x Also apply the same cast to: __this_cpu_sub() this_cpu_sub_return() and __this_cpu_sub_return() All percpu_test.ko passes, especially the following cases which previously failed: l -= ui_one; __this_cpu_sub(long_counter, ui_one); CHECK(l, long_counter, -1); l -= ui_one; this_cpu_sub(long_counter, ui_one); CHECK(l, long_counter, -1); CHECK(l, long_counter, 0x); ul -= ui_one; __this_cpu_sub(ulong_counter, ui_one); CHECK(ul, ulong_counter, -1); CHECK(ul, ulong_counter, 0x); ul = this_cpu_sub_return(ulong_counter, ui_one); CHECK(ul, ulong_counter, 2); ul = __this_cpu_sub_return(ulong_counter, ui_one); CHECK(ul, ulong_counter, 1); Signed-off-by: Greg Thelen --- arch/x86/include/asm/percpu.h | 3 ++- include/linux/percpu.h| 8 lib/percpu_test.c | 2 +- 3 files changed, 7 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 0da5200..b3e18f8 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -128,7 +128,8 @@ do { \ do { \ typedef typeof(var) pao_T__;\ const int pao_ID__ = (__builtin_constant_p(val) && \ - ((val) == 1 || (val) == -1)) ? (val) : 0; \ + ((val) == 1 || (val) == -1)) ?\ + (int)(val) : 0; \ if (0) {\ pao_T__ pao_tmp__; \ pao_tmp__ = (val); \ diff --git a/include/linux/percpu.h b/include/linux/percpu.h index cc88172..c74088a 100644 --- a/include/linux/percpu.h +++ b/include/linux/percpu.h @@ -332,7 +332,7 @@ do { \ #endif #ifndef this_cpu_sub -# define this_cpu_sub(pcp, val)this_cpu_add((pcp), -(val)) +# define this_cpu_sub(pcp, val)this_cpu_add((pcp), -(typeof(pcp))(val)) #endif #ifndef this_cpu_inc @@ -418,7 +418,7 @@ do { \ # define this_cpu_add_return(pcp, val) __pcpu_size_call_return2(this_cpu_add_return_, pcp, val) #endif -#define this_cpu_sub_return(pcp, val) this_cpu_add_return(pcp, -(val)) +#define this_cpu_sub_return(pcp, val) this_cpu_add_return(pcp, -(typeof(pcp))(val)) #define this_cpu_inc_return(pcp) this_cpu_add_return(pcp, 1) #define this_cpu_dec_return(pcp) this_cpu_add_return(pcp, -1) @@ -586,7 +586,7 @@ do { \ #endif #ifndef __this_cpu_sub -# define __this_cpu_sub(pcp, val) __this_cpu_add((pcp), -(val)) +# define __this_cpu_sub(pcp, val) __this_cpu_add((pcp), -(typeof(pcp))(val)) #endif #ifndef __this_cpu_inc @@ -668,7 +668,7 @@ do { \ __pcpu_size_call_return2(__this_cpu_add_return_, pcp, val) #endif -#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, -(val)) +#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, -(typeof(pcp))(val)) #define __this_cpu_inc_return(pcp) __this_cpu_add_return(pcp, 1) #define __this_cpu_dec_return(pcp) __this_cpu_add_return(pcp, -1) diff --git a/lib/percpu_test.c b/lib/percpu_test.c index 1ebeb44..8ab4231 100644 --- a/lib/percpu_test.c +++ b/lib/percpu_test.c @@ -118,7 +118,7 @@ static int __init percpu_test_init(void) CHECK(ul, ulong_counter, 2); ul = __this_cpu_sub_return(ulong_counter, ui_one); - CHECK(ul, ulong_counter, 0); + CHECK(ul, ulong_counter, 1); preempt_enable(); -- 1.8.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More major
[PATCH 1/3] percpu counter: test module
Tests various percpu operations. Enable with CONFIG_PERCPU_TEST=m. Signed-off-by: Greg Thelen --- lib/Kconfig.debug | 9 lib/Makefile | 2 + lib/percpu_test.c | 138 ++ 3 files changed, 149 insertions(+) create mode 100644 lib/percpu_test.c diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 06344d9..cee589d 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1472,6 +1472,15 @@ config INTERVAL_TREE_TEST help A benchmark measuring the performance of the interval tree library +config PERCPU_TEST + tristate "Per cpu counter test" + depends on m && DEBUG_KERNEL + help + Enable this option to build test modules with validates per-cpu + counter operations. + + If unsure, say N. + config ATOMIC64_SELFTEST bool "Perform an atomic64_t self-test at boot" help diff --git a/lib/Makefile b/lib/Makefile index f3bb2cb..bb016e1 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -157,6 +157,8 @@ obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o interval_tree_test-objs := interval_tree_test_main.o interval_tree.o +obj-$(CONFIG_PERCPU_TEST) += percpu_test.o + obj-$(CONFIG_ASN1) += asn1_decoder.o obj-$(CONFIG_FONT_SUPPORT) += fonts/ diff --git a/lib/percpu_test.c b/lib/percpu_test.c new file mode 100644 index 000..1ebeb44 --- /dev/null +++ b/lib/percpu_test.c @@ -0,0 +1,138 @@ +#include + +/* validate @native and @pcp counter values match @expected */ +#define CHECK(native, pcp, expected)\ + do {\ + WARN((native) != (expected),\ +"raw %ld (0x%lx) != expected %ld (0x%lx)", \ +(long)(native), (long)(native),\ +(long)(expected), (long)(expected)); \ + WARN(__this_cpu_read(pcp) != (expected),\ +"pcp %ld (0x%lx) != expected %ld (0x%lx)", \ +(long)__this_cpu_read(pcp), (long)__this_cpu_read(pcp), \ +(long)(expected), (long)(expected)); \ + } while (0) + +static DEFINE_PER_CPU(long, long_counter); +static DEFINE_PER_CPU(unsigned long, ulong_counter); + +static int __init percpu_test_init(void) +{ + /* +* volatile prevents compiler from optimizing it uses, otherwise the +* +ul_one and -ul_one below would replace with inc/dec instructions. +*/ + volatile unsigned int ui_one = 1; + long l = 0; + unsigned long ul = 0; + + pr_info("percpu test start\n"); + + preempt_disable(); + + l += -1; + __this_cpu_add(long_counter, -1); + CHECK(l, long_counter, -1); + + l += 1; + __this_cpu_add(long_counter, 1); + CHECK(l, long_counter, 0); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul += 1UL; + __this_cpu_add(ulong_counter, 1UL); + CHECK(ul, ulong_counter, 1); + + ul += -1UL; + __this_cpu_add(ulong_counter, -1UL); + CHECK(ul, ulong_counter, 0); + + ul += -(unsigned long)1; + __this_cpu_add(ulong_counter, -(unsigned long)1); + CHECK(ul, ulong_counter, -1); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul -= 1; + __this_cpu_dec(ulong_counter); + CHECK(ul, ulong_counter, 0x); + CHECK(ul, ulong_counter, -1); + + l += -ui_one; + __this_cpu_add(long_counter, -ui_one); + CHECK(l, long_counter, 0x); + + l += ui_one; + __this_cpu_add(long_counter, ui_one); + CHECK(l, long_counter, 0x1); + + + l = 0; + __this_cpu_write(long_counter, 0); + + l -= ui_one; + __this_cpu_sub(long_counter, ui_one); + CHECK(l, long_counter, -1); + + l = 0; + __this_cpu_write(long_counter, 0); + + l += ui_one; + __this_cpu_add(long_counter, ui_one); + CHECK(l, long_counter, 1); + + l += -ui_one; + __this_cpu_add(long_counter, -ui_one); + CHECK(l, long_counter, 0x1); + + l = 0; + __this_cpu_write(long_counter, 0); + + l -= ui_one; + this_cpu_sub(long_counter, ui_one); + CHECK(l, long_counter, -1); + CHECK(l, long_counter, 0x); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul += ui_one; + __this_cpu_add(ulong_counter, ui_one); + CHECK(ul, ulong_counter, 1); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul -= ui_one; + __this_cpu_sub(ulong_counter, ui_one); + CHECK(ul, ulong_counter, -1); + CHECK(ul, ulong_counter, 0x); + + ul = 3; + __this_cpu_write(ulong_counter, 3)
[PATCH 0/3] fix unsigned pcp adjustments
As of v3.11-9444-g3ea67d0 "memcg: add per cgroup writeback pages accounting" memcg use of __this_cpu_add(counter, -nr_pages) leads to incorrect statistic values because the negated nr_pages is not sign extended (counter is long, nr_pages is unsigned int). The memcg fix is __this_cpu_sub(counter, nr_pages). But that doesn't simply work because __this_cpu_sub(counter, nr_pages) was implemented as __this_cpu_add(counter, -nr_pages) which suffers the same problem. Example: unsigned int delta = 1 preempt_disable() this_cpu_write(long_counter, 0) this_cpu_sub(long_counter, delta) preempt_enable() Before this change long_counter on a 64 bit machine ends with value 0x, rather than 0x. This is because this_cpu_sub(pcp, delta) boils down to: long_counter = 0 + 0x Patch 1 creates a test module for percpu counters operations which demonstrates the __this_cpu_sub() problems. This patch is independent can be discarded if there is no interest. Patch 2 fixes __this_cpu_sub() to work with unsigned adjustments. Patch 3 uses __this_cpu_sub() in memcg. An alternative smaller solution is for memcg to use: __this_cpu_add(counter, -(int)nr_pages) admitting that __this_cpu_add/sub() doesn't work with unsigned adjustments. But I felt like fixing the core services to prevent this in the future. Greg Thelen (3): percpu counter: test module percpu counter: cast this_cpu_sub() adjustment memcg: use __this_cpu_sub to decrement stats arch/x86/include/asm/percpu.h | 3 +- include/linux/percpu.h| 8 +-- lib/Kconfig.debug | 9 +++ lib/Makefile | 2 + lib/percpu_test.c | 138 ++ mm/memcontrol.c | 2 +- 6 files changed, 156 insertions(+), 6 deletions(-) create mode 100644 lib/percpu_test.c -- 1.8.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] memcg: use __this_cpu_sub to decrement stats
As of v3.11-9444-g3ea67d0 "memcg: add per cgroup writeback pages accounting" memcg counter errors are possible when moving charged memory to a different memcg. Charge movement occurs when processing writes to memory.force_empty, moving tasks to a memcg with memcg.move_charge_at_immigrate=1, or memcg deletion. An example showing error after memory.force_empty: $ cd /sys/fs/cgroup/memory $ mkdir x $ rm /data/tmp/file $ (echo $BASHPID >> x/tasks && exec mmap_writer /data/tmp/file 1M) & [1] 13600 $ grep ^mapped x/memory.stat mapped_file 1048576 $ echo 13600 > tasks $ echo 1 > x/memory.force_empty $ grep ^mapped x/memory.stat mapped_file 4503599627370496 mapped_file should end with 0. 4503599627370496 == 0x10,,, == 0x100,, pages 1048576 == 0x10, == 0x100 pages This issue only affects the source memcg on 64 bit machines; the destination memcg counters are correct. So the rmdir case is not too important because such counters are soon disappearing with the entire memcg. But the memcg.force_empty and memory.move_charge_at_immigrate=1 cases are larger problems as the bogus counters are visible for the (possibly long) remaining life of the source memcg. The problem is due to memcg use of __this_cpu_from(.., -nr_pages), which is subtly wrong because it subtracts the unsigned int nr_pages (either -1 or -512 for THP) from a signed long percpu counter. When nr_pages=-1, -nr_pages=0x. On 64 bit machines stat->count[idx] is signed 64 bit. So memcg's attempt to simply decrement a count (e.g. from 1 to 0) boils down to: long count = 1 unsigned int nr_pages = 1 count += -nr_pages /* -nr_pages == 0x, */ count is now 0x1,, instead of 0 The fix is to subtract the unsigned page count rather than adding its negation. This only works with the "percpu counter: cast this_cpu_sub() adjustment" patch which fixes this_cpu_sub(). Signed-off-by: Greg Thelen --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index aa8185c..b7ace0f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup *from, { /* Update stat data for mem_cgroup */ preempt_disable(); - __this_cpu_add(from->stat->count[idx], -nr_pages); + __this_cpu_sub(from->stat->count[idx], nr_pages); __this_cpu_add(to->stat->count[idx], nr_pages); preempt_enable(); } -- 1.8.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] percpu counter: cast this_cpu_sub() adjustment
On Sun, Oct 27 2013, Tejun Heo wrote: > On Sun, Oct 27, 2013 at 05:04:29AM -0700, Andrew Morton wrote: >> On Sun, 27 Oct 2013 07:22:55 -0400 Tejun Heo wrote: >> >> > We probably want to cc stable for this and the next one. How should >> > these be routed? I can take these through percpu tree or mm works >> > too. Either way, it'd be best to route them together. >> >> Yes, all three look like -stable material to me. I'll grab them later >> in the week if you haven't ;) > > Tried to apply to percpu but the third one is a fix for a patch which > was added to -mm during v3.12-rc1, so these are yours. :) I don't object to stable for the first two non-memcg patches, but it's probably unnecessary. I should have made it more clear, but an audit of v3.12-rc6 shows that only new memcg code is affected - the new mem_cgroup_move_account_page_stat() is the only place where an unsigned adjustment is used. All other callers (e.g. shrink_dcache_sb) already use a signed adjustment, so no problems before v3.12. Though I did not audit the stable kernel trees, so there could be something hiding in there. >> The names of the first two patches distress me. They rather clearly >> assert that the code affects percpu_counter.[ch], but that is not the case. >> Massaging is needed to fix that up. > > Yeah, something like the following would be better > > percpu: add test module for various percpu operations > percpu: fix this_cpu_sub() subtrahend casting for unsigneds > memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend > casting No objection to renaming. Let me know if you want these reposed with updated titles. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] percpu counter: cast this_cpu_sub() adjustment
On Sun, Oct 27 2013, Greg Thelen wrote: > this_cpu_sub() is implemented as negation and addition. > > This patch casts the adjustment to the counter type before negation to > sign extend the adjustment. This helps in cases where the counter > type is wider than an unsigned adjustment. An alternative to this > patch is to declare such operations unsupported, but it seemed useful > to avoid surprises. > > This patch specifically helps the following example: > unsigned int delta = 1 > preempt_disable() > this_cpu_write(long_counter, 0) > this_cpu_sub(long_counter, delta) > preempt_enable() > > Before this change long_counter on a 64 bit machine ends with value > 0x, rather than 0x. This is because > this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta), > which is basically: > long_counter = 0 + 0x > > Also apply the same cast to: > __this_cpu_sub() > this_cpu_sub_return() > and __this_cpu_sub_return() > > All percpu_test.ko passes, especially the following cases which > previously failed: > > l -= ui_one; > __this_cpu_sub(long_counter, ui_one); > CHECK(l, long_counter, -1); > > l -= ui_one; > this_cpu_sub(long_counter, ui_one); > CHECK(l, long_counter, -1); > CHECK(l, long_counter, 0x); > > ul -= ui_one; > __this_cpu_sub(ulong_counter, ui_one); > CHECK(ul, ulong_counter, -1); > CHECK(ul, ulong_counter, 0x); > > ul = this_cpu_sub_return(ulong_counter, ui_one); > CHECK(ul, ulong_counter, 2); > > ul = __this_cpu_sub_return(ulong_counter, ui_one); > CHECK(ul, ulong_counter, 1); > > Signed-off-by: Greg Thelen > --- > arch/x86/include/asm/percpu.h | 3 ++- > include/linux/percpu.h| 8 > lib/percpu_test.c | 2 +- > 3 files changed, 7 insertions(+), 6 deletions(-) > > diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h > index 0da5200..b3e18f8 100644 > --- a/arch/x86/include/asm/percpu.h > +++ b/arch/x86/include/asm/percpu.h > @@ -128,7 +128,8 @@ do { > \ > do { \ > typedef typeof(var) pao_T__;\ > const int pao_ID__ = (__builtin_constant_p(val) && \ > - ((val) == 1 || (val) == -1)) ? (val) : 0; \ > + ((val) == 1 || (val) == -1)) ?\ > + (int)(val) : 0; \ > if (0) {\ > pao_T__ pao_tmp__; \ > pao_tmp__ = (val); \ > diff --git a/include/linux/percpu.h b/include/linux/percpu.h > index cc88172..c74088a 100644 > --- a/include/linux/percpu.h > +++ b/include/linux/percpu.h > @@ -332,7 +332,7 @@ do { > \ > #endif > > #ifndef this_cpu_sub > -# define this_cpu_sub(pcp, val) this_cpu_add((pcp), -(val)) > +# define this_cpu_sub(pcp, val) this_cpu_add((pcp), > -(typeof(pcp))(val)) > #endif > > #ifndef this_cpu_inc > @@ -418,7 +418,7 @@ do { > \ > # define this_cpu_add_return(pcp, val) > __pcpu_size_call_return2(this_cpu_add_return_, pcp, val) > #endif > > -#define this_cpu_sub_return(pcp, val)this_cpu_add_return(pcp, -(val)) > +#define this_cpu_sub_return(pcp, val)this_cpu_add_return(pcp, > -(typeof(pcp))(val)) > #define this_cpu_inc_return(pcp) this_cpu_add_return(pcp, 1) > #define this_cpu_dec_return(pcp) this_cpu_add_return(pcp, -1) > > @@ -586,7 +586,7 @@ do { > \ > #endif > > #ifndef __this_cpu_sub > -# define __this_cpu_sub(pcp, val)__this_cpu_add((pcp), -(val)) > +# define __this_cpu_sub(pcp, val)__this_cpu_add((pcp), > -(typeof(pcp))(val)) > #endif > > #ifndef __this_cpu_inc > @@ -668,7 +668,7 @@ do { > \ > __pcpu_size_call_return2(__this_cpu_add_return_, pcp, val) > #endif > > -#define __this_cpu_sub_return(pcp, val) __this_cpu_add_return(pcp, > -(val)) > +#define __this_cpu_sub_return(pcp, val) __this_cpu_add_return(pcp, > -(typeof(pcp))(val)) > #define __this_cpu_inc_return(pcp) __this_cpu_add_return(pcp, 1) > #define __this_cpu_d
[PATCH v2 0/3] fix unsigned pcp adjustments
As of v3.11-9444-g3ea67d0 "memcg: add per cgroup writeback pages accounting" memcg use of __this_cpu_add(counter, -nr_pages) leads to incorrect statistic values because the negated nr_pages is not sign extended (counter is long, nr_pages is unsigned int). The memcg fix is __this_cpu_sub(counter, nr_pages). But that doesn't simply work because __this_cpu_sub(counter, nr_pages) was implemented as __this_cpu_add(counter, -nr_pages) which suffers the same problem. Example: unsigned int delta = 1 preempt_disable() this_cpu_write(long_counter, 0) this_cpu_sub(long_counter, delta) preempt_enable() Before this change long_counter on a 64 bit machine ends with value 0x, rather than 0x. This is because this_cpu_sub(pcp, delta) boils down to: long_counter = 0 + 0x v3.12-rc6 shows that only new memcg code is affected by this problem - the new mem_cgroup_move_account_page_stat() is the only place where an unsigned adjustment is used. All other callers (e.g. shrink_dcache_sb) already use a signed adjustment, so no problems before v3.12. Though I did not audit the stable kernel trees, so there could be something hiding in there. Patch 1 creates a test module for percpu operations which demonstrates the __this_cpu_sub() problems. This patch is independent can be discarded if there is no interest. Patch 2 fixes __this_cpu_sub() to work with unsigned adjustments. Patch 3 uses __this_cpu_sub() in memcg. An alternative smaller solution is for memcg to use: __this_cpu_add(counter, -(int)nr_pages) admitting that __this_cpu_add/sub() doesn't work with unsigned adjustments. But I felt like fixing the core services to prevent this in the future. Changes from V1: - more accurate patch titles, patch logs, and test module description now referring to per cpu operations rather than per cpu counters. - move small test code update from patch 2 to patch 1 (where the test is introduced). Greg Thelen (3): percpu: add test module for various percpu operations percpu: fix this_cpu_sub() subtrahend casting for unsigneds memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend casting arch/x86/include/asm/percpu.h | 3 +- include/linux/percpu.h| 8 +-- lib/Kconfig.debug | 9 +++ lib/Makefile | 2 + lib/percpu_test.c | 138 ++ mm/memcontrol.c | 2 +- 6 files changed, 156 insertions(+), 6 deletions(-) create mode 100644 lib/percpu_test.c -- 1.8.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 2/3] percpu: fix this_cpu_sub() subtrahend casting for unsigneds
this_cpu_sub() is implemented as negation and addition. This patch casts the adjustment to the counter type before negation to sign extend the adjustment. This helps in cases where the counter type is wider than an unsigned adjustment. An alternative to this patch is to declare such operations unsupported, but it seemed useful to avoid surprises. This patch specifically helps the following example: unsigned int delta = 1 preempt_disable() this_cpu_write(long_counter, 0) this_cpu_sub(long_counter, delta) preempt_enable() Before this change long_counter on a 64 bit machine ends with value 0x, rather than 0x. This is because this_cpu_sub(pcp, delta) boils down to this_cpu_add(pcp, -delta), which is basically: long_counter = 0 + 0x Also apply the same cast to: __this_cpu_sub() __this_cpu_sub_return() this_cpu_sub_return() All percpu_test.ko passes, especially the following cases which previously failed: l -= ui_one; __this_cpu_sub(long_counter, ui_one); CHECK(l, long_counter, -1); l -= ui_one; this_cpu_sub(long_counter, ui_one); CHECK(l, long_counter, -1); CHECK(l, long_counter, 0x); ul -= ui_one; __this_cpu_sub(ulong_counter, ui_one); CHECK(ul, ulong_counter, -1); CHECK(ul, ulong_counter, 0x); ul = this_cpu_sub_return(ulong_counter, ui_one); CHECK(ul, ulong_counter, 2); ul = __this_cpu_sub_return(ulong_counter, ui_one); CHECK(ul, ulong_counter, 1); Signed-off-by: Greg Thelen Acked-by: Tejun Heo --- arch/x86/include/asm/percpu.h | 3 ++- include/linux/percpu.h| 8 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 0da5200..b3e18f8 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -128,7 +128,8 @@ do { \ do { \ typedef typeof(var) pao_T__;\ const int pao_ID__ = (__builtin_constant_p(val) && \ - ((val) == 1 || (val) == -1)) ? (val) : 0; \ + ((val) == 1 || (val) == -1)) ?\ + (int)(val) : 0; \ if (0) {\ pao_T__ pao_tmp__; \ pao_tmp__ = (val); \ diff --git a/include/linux/percpu.h b/include/linux/percpu.h index cc88172..c74088a 100644 --- a/include/linux/percpu.h +++ b/include/linux/percpu.h @@ -332,7 +332,7 @@ do { \ #endif #ifndef this_cpu_sub -# define this_cpu_sub(pcp, val)this_cpu_add((pcp), -(val)) +# define this_cpu_sub(pcp, val)this_cpu_add((pcp), -(typeof(pcp))(val)) #endif #ifndef this_cpu_inc @@ -418,7 +418,7 @@ do { \ # define this_cpu_add_return(pcp, val) __pcpu_size_call_return2(this_cpu_add_return_, pcp, val) #endif -#define this_cpu_sub_return(pcp, val) this_cpu_add_return(pcp, -(val)) +#define this_cpu_sub_return(pcp, val) this_cpu_add_return(pcp, -(typeof(pcp))(val)) #define this_cpu_inc_return(pcp) this_cpu_add_return(pcp, 1) #define this_cpu_dec_return(pcp) this_cpu_add_return(pcp, -1) @@ -586,7 +586,7 @@ do { \ #endif #ifndef __this_cpu_sub -# define __this_cpu_sub(pcp, val) __this_cpu_add((pcp), -(val)) +# define __this_cpu_sub(pcp, val) __this_cpu_add((pcp), -(typeof(pcp))(val)) #endif #ifndef __this_cpu_inc @@ -668,7 +668,7 @@ do { \ __pcpu_size_call_return2(__this_cpu_add_return_, pcp, val) #endif -#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, -(val)) +#define __this_cpu_sub_return(pcp, val)__this_cpu_add_return(pcp, -(typeof(pcp))(val)) #define __this_cpu_inc_return(pcp) __this_cpu_add_return(pcp, 1) #define __this_cpu_dec_return(pcp) __this_cpu_add_return(pcp, -1) -- 1.8.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/3] percpu: add test module for various percpu operations
Tests various percpu operations. Enable with CONFIG_PERCPU_TEST=m. Signed-off-by: Greg Thelen Acked-by: Tejun Heo --- lib/Kconfig.debug | 9 lib/Makefile | 2 + lib/percpu_test.c | 138 ++ 3 files changed, 149 insertions(+) create mode 100644 lib/percpu_test.c diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 06344d9..9fdb452 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1472,6 +1472,15 @@ config INTERVAL_TREE_TEST help A benchmark measuring the performance of the interval tree library +config PERCPU_TEST + tristate "Per cpu operations test" + depends on m && DEBUG_KERNEL + help + Enable this option to build test module which validates per-cpu + operations. + + If unsure, say N. + config ATOMIC64_SELFTEST bool "Perform an atomic64_t self-test at boot" help diff --git a/lib/Makefile b/lib/Makefile index f3bb2cb..bb016e1 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -157,6 +157,8 @@ obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o interval_tree_test-objs := interval_tree_test_main.o interval_tree.o +obj-$(CONFIG_PERCPU_TEST) += percpu_test.o + obj-$(CONFIG_ASN1) += asn1_decoder.o obj-$(CONFIG_FONT_SUPPORT) += fonts/ diff --git a/lib/percpu_test.c b/lib/percpu_test.c new file mode 100644 index 000..fcca49e --- /dev/null +++ b/lib/percpu_test.c @@ -0,0 +1,138 @@ +#include + +/* validate @native and @pcp counter values match @expected */ +#define CHECK(native, pcp, expected)\ + do {\ + WARN((native) != (expected),\ +"raw %ld (0x%lx) != expected %ld (0x%lx)", \ +(long)(native), (long)(native),\ +(long)(expected), (long)(expected)); \ + WARN(__this_cpu_read(pcp) != (expected),\ +"pcp %ld (0x%lx) != expected %ld (0x%lx)", \ +(long)__this_cpu_read(pcp), (long)__this_cpu_read(pcp), \ +(long)(expected), (long)(expected)); \ + } while (0) + +static DEFINE_PER_CPU(long, long_counter); +static DEFINE_PER_CPU(unsigned long, ulong_counter); + +static int __init percpu_test_init(void) +{ + /* +* volatile prevents compiler from optimizing it uses, otherwise the +* +ul_one and -ul_one below would replace with inc/dec instructions. +*/ + volatile unsigned int ui_one = 1; + long l = 0; + unsigned long ul = 0; + + pr_info("percpu test start\n"); + + preempt_disable(); + + l += -1; + __this_cpu_add(long_counter, -1); + CHECK(l, long_counter, -1); + + l += 1; + __this_cpu_add(long_counter, 1); + CHECK(l, long_counter, 0); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul += 1UL; + __this_cpu_add(ulong_counter, 1UL); + CHECK(ul, ulong_counter, 1); + + ul += -1UL; + __this_cpu_add(ulong_counter, -1UL); + CHECK(ul, ulong_counter, 0); + + ul += -(unsigned long)1; + __this_cpu_add(ulong_counter, -(unsigned long)1); + CHECK(ul, ulong_counter, -1); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul -= 1; + __this_cpu_dec(ulong_counter); + CHECK(ul, ulong_counter, 0x); + CHECK(ul, ulong_counter, -1); + + l += -ui_one; + __this_cpu_add(long_counter, -ui_one); + CHECK(l, long_counter, 0x); + + l += ui_one; + __this_cpu_add(long_counter, ui_one); + CHECK(l, long_counter, 0x1); + + + l = 0; + __this_cpu_write(long_counter, 0); + + l -= ui_one; + __this_cpu_sub(long_counter, ui_one); + CHECK(l, long_counter, -1); + + l = 0; + __this_cpu_write(long_counter, 0); + + l += ui_one; + __this_cpu_add(long_counter, ui_one); + CHECK(l, long_counter, 1); + + l += -ui_one; + __this_cpu_add(long_counter, -ui_one); + CHECK(l, long_counter, 0x1); + + l = 0; + __this_cpu_write(long_counter, 0); + + l -= ui_one; + this_cpu_sub(long_counter, ui_one); + CHECK(l, long_counter, -1); + CHECK(l, long_counter, 0x); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul += ui_one; + __this_cpu_add(ulong_counter, ui_one); + CHECK(ul, ulong_counter, 1); + + ul = 0; + __this_cpu_write(ulong_counter, 0); + + ul -= ui_one; + __this_cpu_sub(ulong_counter, ui_one); + CHECK(ul, ulong_counter, -1); + CHECK(ul, ulong_counter, 0x); + + ul = 3; + __this_cpu_w
[PATCH v2 3/3] memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend casting
As of v3.11-9444-g3ea67d0 "memcg: add per cgroup writeback pages accounting" memcg counter errors are possible when moving charged memory to a different memcg. Charge movement occurs when processing writes to memory.force_empty, moving tasks to a memcg with memcg.move_charge_at_immigrate=1, or memcg deletion. An example showing error after memory.force_empty: $ cd /sys/fs/cgroup/memory $ mkdir x $ rm /data/tmp/file $ (echo $BASHPID >> x/tasks && exec mmap_writer /data/tmp/file 1M) & [1] 13600 $ grep ^mapped x/memory.stat mapped_file 1048576 $ echo 13600 > tasks $ echo 1 > x/memory.force_empty $ grep ^mapped x/memory.stat mapped_file 4503599627370496 mapped_file should end with 0. 4503599627370496 == 0x10,,, == 0x100,, pages 1048576 == 0x10, == 0x100 pages This issue only affects the source memcg on 64 bit machines; the destination memcg counters are correct. So the rmdir case is not too important because such counters are soon disappearing with the entire memcg. But the memcg.force_empty and memory.move_charge_at_immigrate=1 cases are larger problems as the bogus counters are visible for the (possibly long) remaining life of the source memcg. The problem is due to memcg use of __this_cpu_from(.., -nr_pages), which is subtly wrong because it subtracts the unsigned int nr_pages (either -1 or -512 for THP) from a signed long percpu counter. When nr_pages=-1, -nr_pages=0x. On 64 bit machines stat->count[idx] is signed 64 bit. So memcg's attempt to simply decrement a count (e.g. from 1 to 0) boils down to: long count = 1 unsigned int nr_pages = 1 count += -nr_pages /* -nr_pages == 0x, */ count is now 0x1,, instead of 0 The fix is to subtract the unsigned page count rather than adding its negation. This only works once "percpu: fix this_cpu_sub() subtrahend casting for unsigneds" is applied to fix this_cpu_sub(). Signed-off-by: Greg Thelen Acked-by: Tejun Heo --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index aa8185c..b7ace0f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3773,7 +3773,7 @@ void mem_cgroup_move_account_page_stat(struct mem_cgroup *from, { /* Update stat data for mem_cgroup */ preempt_disable(); - __this_cpu_add(from->stat->count[idx], -nr_pages); + __this_cpu_sub(from->stat->count[idx], nr_pages); __this_cpu_add(to->stat->count[idx], nr_pages); preempt_enable(); } -- 1.8.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ipc: introduce ipc_valid_object() helper to sort out IPC_RMID races
On Tue, Dec 17 2013, Rafael Aquini wrote: > After the locking semantics for the SysV IPC API got improved, a couple of > IPC_RMID race windows were opened because we ended up dropping the > 'kern_ipc_perm.deleted' check performed way down in ipc_lock(). > The spotted races got sorted out by re-introducing the old test within > the racy critical sections. > > This patch introduces ipc_valid_object() to consolidate the way we cope with > IPC_RMID races by using the same abstraction across the API implementation. > > Signed-off-by: Rafael Aquini Acked-by: Greg Thelen -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ipc,shm: fix shm_file deletion races
When IPC_RMID races with other shm operations there's potential for use-after-free of the shm object's associated file (shm_file). Here's the race before this patch: TASK 1 TASK 2 -- -- shm_rmid() ipc_lock_object() shmctl() shp = shm_obtain_object_check() shm_destroy() shum_unlock() fput(shp->shm_file) ipc_lock_object() shmem_lock(shp->shm_file) The oops is caused because shm_destroy() calls fput() after dropping the ipc_lock. fput() clears the file's f_inode, f_path.dentry, and f_path.mnt, which causes various NULL pointer references in task 2. I reliably see the oops in task 2 if with shmlock, shmu This patch fixes the races by: 1) set shm_file=NULL in shm_destroy() while holding ipc_object_lock(). 2) modify at risk operations to check shm_file while holding ipc_object_lock(). Example workloads, which each trigger oops... Workload 1: while true; do id=$(shmget 1 4096) shm_rmid $id & shmlock $id & wait done The oops stack shows accessing NULL f_inode due to racing fput: _raw_spin_lock shmem_lock SyS_shmctl Workload 2: while true; do id=$(shmget 1 4096) shmat $id 4096 & shm_rmid $id & wait done The oops stack is similar to workload 1 due to NULL f_inode: touch_atime shmem_mmap shm_mmap mmap_region do_mmap_pgoff do_shmat SyS_shmat Workload 3: while true; do id=$(shmget 1 4096) shmlock $id shm_rmid $id & shmunlock $id & wait done The oops stack shows second fput tripping on an NULL f_inode. The first fput() completed via from shm_destroy(), but a racing thread did a get_file() and queued this fput(): locks_remove_flock __fput fput task_work_run do_notify_resume int_signal Fixes: c2c737a0461e ("ipc,shm: shorten critical region for shmat") Fixes: 2caacaa82a51 ("ipc,shm: shorten critical region for shmctl") Signed-off-by: Greg Thelen Cc: # 3.10.17+ 3.11.6+ --- ipc/shm.c | 28 +++- 1 file changed, 23 insertions(+), 5 deletions(-) diff --git a/ipc/shm.c b/ipc/shm.c index d69739610fd4..0bdf21c6814e 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -208,15 +208,18 @@ static void shm_open(struct vm_area_struct *vma) */ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp) { + struct file *shm_file; + + shm_file = shp->shm_file; + shp->shm_file = NULL; ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT; shm_rmid(ns, shp); shm_unlock(shp); - if (!is_file_hugepages(shp->shm_file)) - shmem_lock(shp->shm_file, 0, shp->mlock_user); + if (!is_file_hugepages(shm_file)) + shmem_lock(shm_file, 0, shp->mlock_user); else if (shp->mlock_user) - user_shm_unlock(file_inode(shp->shm_file)->i_size, - shp->mlock_user); - fput (shp->shm_file); + user_shm_unlock(file_inode(shm_file)->i_size, shp->mlock_user); + fput(shm_file); ipc_rcu_putref(shp, shm_rcu_free); } @@ -983,6 +986,13 @@ SYSCALL_DEFINE3(shmctl, int, shmid, int, cmd, struct shmid_ds __user *, buf) } shm_file = shp->shm_file; + + /* check if shm_destroy() is tearing down shp */ + if (shm_file == NULL) { + err = -EIDRM; + goto out_unlock0; + } + if (is_file_hugepages(shm_file)) goto out_unlock0; @@ -1101,6 +,14 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg, ulong *raddr, goto out_unlock; ipc_lock_object(&shp->shm_perm); + + /* check if shm_destroy() is tearing down shp */ + if (shp->shm_file == NULL) { + ipc_unlock_object(&shp->shm_perm); + err = -EIDRM; + goto out_unlock; + } + path = shp->shm_file->f_path; path_get(&path); shp->shm_nattch++; -- 1.8.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests
On Thu, Aug 07 2014, Johannes Weiner wrote: > On Thu, Aug 07, 2014 at 03:08:22PM +0200, Michal Hocko wrote: >> On Mon 04-08-14 17:14:54, Johannes Weiner wrote: >> > Instead of passing the request size to direct reclaim, memcg just >> > manually loops around reclaiming SWAP_CLUSTER_MAX pages until the >> > charge can succeed. That potentially wastes scan progress when huge >> > page allocations require multiple invocations, which always have to >> > restart from the default scan priority. >> > >> > Pass the request size as a reclaim target to direct reclaim and leave >> > it to that code to reach the goal. >> >> THP charge then will ask for 512 pages to be (direct) reclaimed. That >> is _a lot_ and I would expect long stalls to achieve this target. I >> would also expect quick priority drop down and potential over-reclaim >> for small and moderately sized memcgs (e.g. memcg with 1G worth of pages >> would need to drop down below DEF_PRIORITY-2 to have a chance to scan >> that many pages). All that done for a charge which can fallback to a >> single page charge. >> >> The current code is quite hostile to THP when we are close to the limit >> but solving this by introducing long stalls instead doesn't sound like a >> proper approach to me. > > THP latencies are actually the same when comparing high limit nr_pages > reclaim with the current hard limit SWAP_CLUSTER_MAX reclaim, although > system time is reduced with the high limit. > > High limit reclaim with SWAP_CLUSTER_MAX has better fault latency but > it doesn't actually contain the workload - with 1G high and a 4G load, > the consumption at the end of the run is 3.7G. > > So what I'm proposing works and is of equal quality from a THP POV. > This change is complicated enough when we stick to the facts, let's > not make up things based on gut feeling. I think that high order non THP page allocations also benefit from this. Such allocations don't have a small page fallback. This may be in flux, but linux-next shows me that: * mem_cgroup_reclaim() frees at least SWAP_CLUSTER_MAX (32) pages. * try_charge() calls mem_cgroup_reclaim() indefinitely for costly (3) or smaller orders assuming that something is reclaimed on each iteration. * try_charge() uses a loop of MEM_CGROUP_RECLAIM_RETRIES (5) for larger-than-costly orders. So for larger-than-costly allocations, try_charge() should be able to reclaim 160 (5*32) pages which satisfies an order:7 allocation. But for order:8+ allocations try_charge() and mem_cgroup_reclaim() are too eager to give up without something like this. So I think this patch is a step in the right direction. Coincidentally, we've been recently been experimenting with something like this. Though we didn't modify the interface between mem_cgroup_reclaim() and try_to_free_mem_cgroup_pages() - instead we looped within mem_cgroup_reclaim() until nr_pages of margin were found. But I have no objection the proposed plumbing of nr_pages all the way into try_to_free_mem_cgroup_pages(). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure
mho...@kernel.org wrote: > From: Michal Hocko > > Journal transaction might fail prematurely because the frozen_buffer > is allocated by GFP_NOFS request: > [ 72.440013] do_get_write_access: OOM for frozen_buffer > [ 72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: > Out of memory in __ext4_journal_get_write_access > [ 72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: > Out of memory > (...snipped) > [ 72.495559] do_get_write_access: OOM for frozen_buffer > [ 72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: > Out of memory in __ext4_journal_get_write_access > [ 72.496839] do_get_write_access: OOM for frozen_buffer > [ 72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: > Out of memory in __ext4_journal_get_write_access > [ 72.505766] Aborting journal on device sda1-8. > [ 72.505851] EXT4-fs (sda1): Remounting filesystem read-only > > This wasn't a problem until "mm: page_alloc: do not lock up GFP_NOFS > allocations upon OOM" because small GPF_NOFS allocations never failed. > This allocation seems essential for the journal and GFP_NOFS is too > restrictive to the memory allocator so let's use __GFP_NOFAIL here to > emulate the previous behavior. > > jbd code has the very same issue so let's do the same there as well. > > Signed-off-by: Michal Hocko > --- > fs/jbd/transaction.c | 11 +-- > fs/jbd2/transaction.c | 14 +++--- > 2 files changed, 4 insertions(+), 21 deletions(-) > > diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c > index 1695ba8334a2..bf7474deda2f 100644 > --- a/fs/jbd/transaction.c > +++ b/fs/jbd/transaction.c > @@ -673,16 +673,7 @@ do_get_write_access(handle_t *handle, struct > journal_head *jh, > jbd_unlock_bh_state(bh); > frozen_buffer = > jbd_alloc(jh2bh(jh)->b_size, > - GFP_NOFS); > - if (!frozen_buffer) { > - printk(KERN_ERR > -"%s: OOM for frozen_buffer\n", > -__func__); > - JBUFFER_TRACE(jh, "oom!"); > - error = -ENOMEM; > - jbd_lock_bh_state(bh); > - goto done; > - } > + GFP_NOFS|__GFP_NOFAIL); > goto repeat; > } > jh->b_frozen_data = frozen_buffer; > diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c > index ff2f2e6ad311..bff071e21553 100644 > --- a/fs/jbd2/transaction.c > +++ b/fs/jbd2/transaction.c > @@ -923,16 +923,7 @@ do_get_write_access(handle_t *handle, struct > journal_head *jh, > jbd_unlock_bh_state(bh); > frozen_buffer = > jbd2_alloc(jh2bh(jh)->b_size, > - GFP_NOFS); > - if (!frozen_buffer) { > - printk(KERN_ERR > -"%s: OOM for frozen_buffer\n", > -__func__); > - JBUFFER_TRACE(jh, "oom!"); > - error = -ENOMEM; > - jbd_lock_bh_state(bh); > - goto done; > - } > + GFP_NOFS|__GFP_NOFAIL); > goto repeat; > } > jh->b_frozen_data = frozen_buffer; > @@ -1157,7 +1148,8 @@ int jbd2_journal_get_undo_access(handle_t *handle, > struct buffer_head *bh) > > repeat: > if (!jh->b_committed_data) { > - committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS); > + committed_data = jbd2_alloc(jh2bh(jh)->b_size, > + GFP_NOFS|__GFP_NOFAIL); > if (!committed_data) { > printk(KERN_ERR "%s: No memory for committed data\n", > __func__); Is this "if (!committed_data) {" check now dead code? I also see other similar suspected dead sites in the rest of the series. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] memcg: fix stale mem_cgroup_force_empty() comment
commit f61c42a7d911 ("memcg: remove tasks/children test from mem_cgroup_force_empty()") removed memory reparenting from the function. Fix the function's comment. Signed-off-by: Greg Thelen --- mm/memcontrol.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fe787f5c41bd..19fd76168a05 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2636,8 +2636,7 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg) } /* - * Reclaims as many pages from the given memcg as possible and moves - * the rest to the parent. + * Reclaims as many pages from the given memcg as possible. * * Caller is responsible for holding css reference for memcg. */ -- 2.8.0.rc3.226.g39d4020
Re: [PATCH v3] x86, kaslr: Prevent .bss from overlaping initrd
On Fri, Oct 31 2014, Junjie Mao wrote: > When choosing a random address, the current implementation does not take into > account the reversed space for .bss and .brk sections. Thus the relocated > kernel > may overlap other components in memory. Here is an example of the overlap > from a > x86_64 kernel in qemu (the ranges of physical addresses are presented): > > Physical Address > > 0x0fe0 --++ <-- randomized base >/ | relocated kernel | >vmlinux.bin| (from vmlinux.bin) | > 0x1336d000(an ELF file) ++-- >\ || \ > 0x1376d870 --++ | > |relocs table| | > 0x13c1c2a8++ .bss and .brk > || | > 0x13ce6000++ | > || / > 0x13f77000| initrd |-- > || > 0x13fef374++ > > The initrd image will then be overwritten by the memset during early > initialization: > > [1.655204] Unpacking initramfs... > [1.662831] Initramfs unpacking failed: junk in compressed archive > > This patch prevents the above situation by requiring a larger space when > looking > for a random kernel base, so that existing logic can effectively avoids the > overlap. > > Fixes: 82fa9637a2 ("x86, kaslr: Select random position from e820 maps") > Reported-by: Fengguang Wu > Signed-off-by: Junjie Mao > [kees: switched to perl to avoid hex translation pain in mawk vs gawk] > [kees: calculated overlap without relocs table] > Signed-off-by: Kees Cook > Cc: sta...@vger.kernel.org > --- > This version updates the commit log only. > > Kees, please help review the documentation. Thanks! > > Best Regards > Junjie Mao [...] > diff --git a/arch/x86/tools/calc_run_size.pl b/arch/x86/tools/calc_run_size.pl > new file mode 100644 > index ..0b0b124d3ece > --- /dev/null > +++ b/arch/x86/tools/calc_run_size.pl > @@ -0,0 +1,30 @@ > +#!/usr/bin/perl > +# > +# Calculate the amount of space needed to run the kernel, including room for > +# the .bss and .brk sections. > +# > +# Usage: > +# objdump -h a.out | perl calc_run_size.pl > +use strict; > + > +my $mem_size = 0; > +my $file_offset = 0; > + > +my $sections=" *[0-9]+ \.(?:bss|brk) +"; > +while (<>) { > + if (/^$sections([0-9a-f]+) +(?:[0-9a-f]+ +){2}([0-9a-f]+)/) { > + my $size = hex($1); > + my $offset = hex($2); > + $mem_size += $size; > + if ($file_offset == 0) { > + $file_offset = $offset; > + } elsif ($file_offset != $offset) { > + die ".bss and .brk lack common file offset\n"; > + } > + } > +} > + > +if ($file_offset == 0) { > + die "Never found .bss or .brk file offset\n"; > +} > +printf("%d\n", $mem_size + $file_offset); Given that bss and brk are nobits (i.e. only ALLOC) sections, does file_offset make sense as a load address. This fails with gold: $ git checkout v3.18-rc5 $ make # with gold [...] ..bss and .brk lack common file offset ..bss and .brk lack common file offset ..bss and .brk lack common file offset ..bss and .brk lack common file offset MKPIGGY arch/x86/boot/compressed/piggy.S Usage: arch/x86/boot/compressed/mkpiggy compressed_file run_size make[2]: *** [arch/x86/boot/compressed/piggy.S] Error 1 make[1]: *** [arch/x86/boot/compressed/vmlinux] Error 2 make: *** [bzImage] Error 2 In ld.bfd brk/bss file_offsets match, but they differ with ld.gold: $ objdump -h vmlinux.ld [...] 0 .text 00818bb3 8100 0100 0020 2**12 CONTENTS, ALLOC, LOAD, READONLY, CODE [...] 26 .bss 000e 81fe8000 01fe8000 013e8000 2**12 ALLOC 27 .brk 00026000 820c8000 020c8000 013e8000 2**0 ALLOC $ objdump -h vmlinux.ld | perl arch/x86/tools/calc_run_size.pl 21946368 # aka 0x14ee000 $ objdump -h vmlinux.gold [...] 0 .text 00818bb3 8100 0100 1000 2**12 CONTENTS, ALLOC, LOAD, READONLY, CODE [...] 26 .bss 000e 81feb000 01feb000 00e9 2**12 ALLOC 27 .brk 00026000 820cb000 020cb000 00f7 2**0 ALLOC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] x86, kaslr: Prevent .bss from overlaping initrd
On Mon, Nov 17 2014, Greg Thelen wrote: [...] > Given that bss and brk are nobits (i.e. only ALLOC) sections, does > file_offset make sense as a load address. This fails with gold: > > $ git checkout v3.18-rc5 > $ make # with gold > [...] > ..bss and .brk lack common file offset > ..bss and .brk lack common file offset > ..bss and .brk lack common file offset > ..bss and .brk lack common file offset > MKPIGGY arch/x86/boot/compressed/piggy.S > Usage: arch/x86/boot/compressed/mkpiggy compressed_file run_size > make[2]: *** [arch/x86/boot/compressed/piggy.S] Error 1 > make[1]: *** [arch/x86/boot/compressed/vmlinux] Error 2 > make: *** [bzImage] Error 2 [...] I just saw http://www.spinics.net/lists/kernel/msg1869328.html which fixes things for me. Sorry for the noise. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] mm, hugetlb: abort __get_user_pages if current has been oom killed
On Mon, Mar 09 2015, David Rientjes wrote: > If __get_user_pages() is faulting a significant number of hugetlb pages, > usually as the result of mmap(MAP_LOCKED), it can potentially allocate a > very large amount of memory. > > If the process has been oom killed, this will cause a lot of memory to > be overcharged to its memcg since it has access to memory reserves or > could potentially deplete all system memory reserves. s/memcg/hugetlb_cgroup/ but I don't think hugetlb has any fatal_signal_pending() based overcharging. I no objection to the patch, but this doesn't seems like a cgroup thing, so the commit log could stand a tweak. > In the same way that commit 4779280d1ea4 ("mm: make get_user_pages() > interruptible") aborted for pending SIGKILLs when faulting non-hugetlb > memory, based on the premise of commit 462e00cc7151 ("oom: stop > allocating user memory if TIF_MEMDIE is set"), hugetlb page faults now > terminate when the process has been oom killed. > > Cc: Naoya Horiguchi > Cc: Davidlohr Bueso > Cc: "Kirill A. Shutemov" > Signed-off-by: David Rientjes > --- > v2: check signal inside follow_huegtlb_page() loop per Kirill > > mm/hugetlb.c | 9 + > 1 file changed, 9 insertions(+) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3276,6 +3276,15 @@ long follow_hugetlb_page(struct mm_struct *mm, struct > vm_area_struct *vma, > struct page *page; > > /* > + * If we have a pending SIGKILL, don't keep faulting pages and > + * potentially allocating memory. > + */ > + if (unlikely(fatal_signal_pending(current))) { > + remainder = 0; > + break; > + } > + > + /* >* Some archs (sparc64, sh*) have multiple pte_ts to >* each hugepage. We have to make sure we get the >* first, for the page indexing below to work. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v3] mm, hugetlb: abort __get_user_pages if current has been oom killed
On Mon, Mar 09 2015, David Rientjes wrote: > If __get_user_pages() is faulting a significant number of hugetlb pages, > usually as the result of mmap(MAP_LOCKED), it can potentially allocate a > very large amount of memory. > > If the process has been oom killed, this will cause a lot of memory to > potentially deplete memory reserves. > > In the same way that commit 4779280d1ea4 ("mm: make get_user_pages() > interruptible") aborted for pending SIGKILLs when faulting non-hugetlb > memory, based on the premise of commit 462e00cc7151 ("oom: stop > allocating user memory if TIF_MEMDIE is set"), hugetlb page faults now > terminate when the process has been oom killed. > > Cc: Greg Thelen > Cc: Naoya Horiguchi > Cc: Davidlohr Bueso > Acked-by: "Kirill A. Shutemov" > Signed-off-by: David Rientjes Looks good. Acked-by: "Greg Thelen" -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Making memcg track ownership per address_space or anon_vma
On Fri, Feb 6, 2015 at 6:17 AM, Tejun Heo wrote: > Hello, Greg. > > On Thu, Feb 05, 2015 at 04:03:34PM -0800, Greg Thelen wrote: >> So this is a system which charges all cgroups using a shared inode >> (recharge on read) for all resident pages of that shared inode. There's >> only one copy of the page in memory on just one LRU, but the page may be >> charged to multiple container's (shared_)usage. > > Yeap. > >> Perhaps I missed it, but what happens when a child's limit is >> insufficient to accept all pages shared by its siblings? Example >> starting with 2M cached of a shared file: >> >> A >> +-B(usage=2M lim=3M hosted_usage=2M) >> +-C (usage=0 lim=2M shared_usage=2M) >> +-D (usage=0 lim=2M shared_usage=2M) >> \-E (usage=0 lim=1M shared_usage=0) >> >> If E faults in a new 4K page within the shared file, then E is a sharing >> participant so it'd be charged the 2M+4K, which pushes E over it's >> limit. > > OOM? It shouldn't be participating in sharing of an inode if it can't > match others' protection on the inode, I think. What we're doing now > w/ page based charging is kinda unfair because in the situations like > above the one under pressure can end up siphoning off of the larger > cgroups' protection if they actually use overlapping areas; however, > for disjoint areas, per-page charging would behave correctly. > > So, this part comes down to the same question - whether multiple > cgroups accessing disjoint areas of a single inode is an important > enough use case. If we say yes to that, we better make writeback > support that too. If cgroups are about isolation then writing to shared files should be rare, so I'm willing to say that we don't need to handle shared writers well. Shared readers seem like a more valuable use cases (thin provisioning). I'm getting overwhelmed with the thought exercise of automatically moving inodes to common ancestors and back charging the sharers for shared_usage. I haven't wrapped my head around how these shared data pages will get protected. It seems like they'd no longer be protected by child min watermarks. So I know this thread opened with the claim "both memcg and blkcg must be looking at the same picture. Deviating them is highly likely to lead to long-term issues forcing us to look at this again anyway, only with far more baggage." But I'm still wondering if the following is simpler: (1) leave memcg as a per page controller. (2) maintain a per inode i_memcg which is set to the common dirtying ancestor. If not shared then it'll point to the memcg that the page was charged to. (3) when memcg dirtying page pressure is seen, walk up the cgroup tree writing dirty inodes, this will write shared inodes using blkcg priority of the respective levels. (4) background limit wb_check_background_flush() and time based wb_check_old_data_flush() can feel free to attack shared inodes to hopefully restore them to non-shared state. For non-shared inodes, this should behave the same. For shared inodes it should only affect those in the hierarchy which is sharing. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Making memcg track ownership per address_space or anon_vma
On Wed, Feb 04 2015, Tejun Heo wrote: > Hello, > > On Tue, Feb 03, 2015 at 03:30:31PM -0800, Greg Thelen wrote: >> If a machine has several top level memcg trying to get some form of >> isolation (using low, min, soft limit) then a shared libc will be >> moved to the root memcg where it's not protected from global memory >> pressure. At least with the current per page accounting such shared >> pages often land into some protected memcg. > > Yes, it becomes interesting with the low limit as the pressure > direction is reversed but at the same time overcommitting low limits > doesn't lead to a sane setup to begin with as it's asking for global > OOMs anyway, which means that things like libc would end up competing > at least fairly with other pages for global pressure and should stay > in memory under most circumstances, which may or may not be > sufficient. I agree. Clarification... I don't plan to overcommit low or min limits. On machines without overcommited min limits the existing system offers some protection for shared libs from global reclaim. Pushing them to root doesn't. > Hmm need to think more about it but this only becomes a problem > with the root cgroup because it doesn't have min setting which is > expected to be inclusive of all descendants, right? Maybe the right > thing to do here is treating the inodes which get pushed to the root > as a special case and we can implement a mechanism where the root is > effectively borrowing from the mins of its children which doesn't have > to be completely correct - e.g. just charge it against all children > repeatedly and if any has min protection, put it under min protection. > IOW, make it the baseload for all of them. I think the linux-next low (and the TBD min) limits also have the problem for more than just the root memcg. I'm thinking of a 2M file shared between C and D below. The file will be charged to common parent B. A +-B(usage=2M lim=3M min=2M) +-C (usage=0 lim=2M min=1M shared_usage=2M) +-D (usage=0 lim=2M min=1M shared_usage=2M) \-E (usage=0 lim=2M min=0) The problem arises if A/B/E allocates more than 1M of private reclaimable file data. This pushes A/B into reclaim which will reclaim both the shared file from A/B and private file from A/B/E. In contrast, the current per-page memcg would've protected the shared file in either C or D leaving A/B reclaim to only attack A/B/E. Pinning the shared file to either C or D, using TBD policy such as mount option, would solve this for tightly shared files. But for wide fanout file (libc) the admin would need to assign a global bucket and this would be a pain to size due to various job requirements. >> If two cgroups collude they can use more memory than their limit and >> oom the entire machine. Admittedly the current per-page system isn't >> perfect because deleting a memcg which contains mlocked memory >> (referenced by a remote memcg) moves the mlocked memory to root >> resulting in the same issue. But I'd argue this is more likely with > > Hmmm... why does it do that? Can you point me to where it's > happening? My mistake, I was thinking of older kernels which reparent memory. Though I can't say v3.19-rc7 handles this collusion any better. Instead of reparenting the mlocked memory, it's left in an invisible (offline) memcg. Unlike older kernels the memory doesn't appear in root/memory.stat[unevictable], instead it buried in root/memory.stat[total_unevictable] which includes mlocked memory in visible (online) and invisible (offline) children. >> the RFC because it doesn't involve the cgroup deletion/reparenting. A > > One approach could be expanding on the forementioned scheme and make > all sharing cgroups to get charged for the shared inodes they're > using, which should render such collusions entirely pointless. > e.g. let's say we start with the following. > > A (usage=48M) > +-B (usage=16M) > \-C (usage=32M) > > And let's say, C starts accessing an inode which is 8M and currently > associated with B. > > A (usage=48M, hosted= 8M) > +-B (usage= 8M, shared= 8M) > \-C (usage=32M, shared= 8M) > > The only extra charging that we'd be doing is charing C with extra > 8M. Let's say another cgroup D gets created and uses 4M. > > A (usage=56M, hosted= 8M) > +-B (usage= 8M, shared= 8M) > +-C (usage=32M, shared= 8M) > \-D (usage= 8M) > > and it also accesses the inode. > > A (usage=56M, hosted= 8M) > +-B (usage= 8M, shared= 8M) > +-C (usage=32M, shared= 8M) > \-D (usage= 8M, shared= 8M) > > We
Re: [RFC] Making memcg track ownership per address_space or anon_vma
On Thu, Feb 05 2015, Tejun Heo wrote: > Hello, Greg. > > On Wed, Feb 04, 2015 at 03:51:01PM -0800, Greg Thelen wrote: >> I think the linux-next low (and the TBD min) limits also have the >> problem for more than just the root memcg. I'm thinking of a 2M file >> shared between C and D below. The file will be charged to common parent >> B. >> >> A >> +-B(usage=2M lim=3M min=2M) >>+-C (usage=0 lim=2M min=1M shared_usage=2M) >>+-D (usage=0 lim=2M min=1M shared_usage=2M) >>\-E (usage=0 lim=2M min=0) >> >> The problem arises if A/B/E allocates more than 1M of private >> reclaimable file data. This pushes A/B into reclaim which will reclaim >> both the shared file from A/B and private file from A/B/E. In contrast, >> the current per-page memcg would've protected the shared file in either >> C or D leaving A/B reclaim to only attack A/B/E. >> >> Pinning the shared file to either C or D, using TBD policy such as mount >> option, would solve this for tightly shared files. But for wide fanout >> file (libc) the admin would need to assign a global bucket and this >> would be a pain to size due to various job requirements. > > Shouldn't we be able to handle it the same way as I proposed for > handling sharing? The above would look like > > A > +-B(usage=2M lim=3M min=2M hosted_usage=2M) > +-C (usage=0 lim=2M min=1M shared_usage=2M) > +-D (usage=0 lim=2M min=1M shared_usage=2M) > \-E (usage=0 lim=2M min=0) > > Now, we don't wanna use B's min verbatim on the hosted inodes shared > by children but we're unconditionally charging the shared amount to > all sharing children, which means that we're eating into the min > settings of all participating children, so, we should be able to use > sum of all sharing children's min-covered amount as the inode's min, > which of course is to be contained inside the min of the parent. > > Above, we're charging 2M to C and D, each of which has 1M min which is > being consumed by the shared charge (the shared part won't get > reclaimed from the internal pressure of children, so we're really > taking that part away from it). Summing them up, the shared inode > would have 2M protection which is honored as long as B as a whole is > under its 3M limit. This is similar to creating a dedicated child for > each shared resource for low limits. The downside is that we end up > guarding the shared inodes more than non-shared ones, but, after all, > we're charging it to everybody who's using it. > > Would something like this work? Maybe, but I want to understand more about how pressure works in the child. As C (or D) allocates non shared memory does it perform reclaim to ensure that its (C.usage + C.shared_usage < C.lim). Given C's shared_usage is linked into B.LRU it wouldn't be naturally reclaimable by C. Are you thinking that charge failures on cgroups with non zero shared_usage would, as needed, induce reclaim of parent's hosted_usage? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Making memcg track ownership per address_space or anon_vma
On Thu, Feb 05 2015, Tejun Heo wrote: > Hey, > > On Thu, Feb 05, 2015 at 02:05:19PM -0800, Greg Thelen wrote: >> >A >> >+-B(usage=2M lim=3M min=2M hosted_usage=2M) >> > +-C (usage=0 lim=2M min=1M shared_usage=2M) >> > +-D (usage=0 lim=2M min=1M shared_usage=2M) >> > \-E (usage=0 lim=2M min=0) > ... >> Maybe, but I want to understand more about how pressure works in the >> child. As C (or D) allocates non shared memory does it perform reclaim >> to ensure that its (C.usage + C.shared_usage < C.lim). Given C's > > Yes. > >> shared_usage is linked into B.LRU it wouldn't be naturally reclaimable >> by C. Are you thinking that charge failures on cgroups with non zero >> shared_usage would, as needed, induce reclaim of parent's hosted_usage? > > Hmmm I'm not really sure but why not? If we properly account for > the low protection when pushing inodes to the parent, I don't think > it'd break anything. IOW, allow the amount beyond the sum of low > limits to be reclaimed when one of the sharers is under pressure. > > Thanks. I'm not saying that it'd break anything. I think it's required that children perform reclaim on shared data hosted in the parent. The child is limited by shared_usage, so it needs ability to reclaim it. So I think we're in agreement. Child will reclaim parent's hosted_usage when the child is charged for shared_usage. Ideally the only parental memory reclaimed in this situation would be shared. But I think (though I can't claim to have followed the new memcg philosophy discussions) that internal nodes in the cgroup tree (i.e. parents) do not have any resources charged directly to them. All resources are charged to leaf cgroups which linger until resources are uncharged. Thus the LRUs of parent will only contain hosted (shared) memory. This thankfully focus parental reclaim easy on shared pages. Child pressure will, unfortunately, reclaim shared pages used by any container. But if shared pages were charged all sharing containers, then it will help relieve pressure in the caller. So this is a system which charges all cgroups using a shared inode (recharge on read) for all resident pages of that shared inode. There's only one copy of the page in memory on just one LRU, but the page may be charged to multiple container's (shared_)usage. Perhaps I missed it, but what happens when a child's limit is insufficient to accept all pages shared by its siblings? Example starting with 2M cached of a shared file: A +-B(usage=2M lim=3M hosted_usage=2M) +-C (usage=0 lim=2M shared_usage=2M) +-D (usage=0 lim=2M shared_usage=2M) \-E (usage=0 lim=1M shared_usage=0) If E faults in a new 4K page within the shared file, then E is a sharing participant so it'd be charged the 2M+4K, which pushes E over it's limit. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Making memcg track ownership per address_space or anon_vma
On Mon, Feb 2, 2015 at 11:46 AM, Tejun Heo wrote: > Hey, > > On Mon, Feb 02, 2015 at 10:26:44PM +0300, Konstantin Khlebnikov wrote: > >> Keeping shared inodes in common ancestor is reasonable. >> We could schedule asynchronous moving when somebody opens or mmaps >> inode from outside of its current cgroup. But it's not clear when >> inode should be moved into opposite direction: when inode should >> become private and how detect if it's no longer shared. >> >> For example each inode could keep yet another pointer to memcg where >> it will track subtree of cgroups where it was accessed in past 5 >> minutes or so. And sometimes that informations goes into moving thread. >> >> Actually I don't see other options except that time-based estimation: >> tracking all cgroups for each inode is too expensive, moving pages >> from one lru to another is expensive too. So, moving inodes back and >> forth at each access from the outside world is not an option. >> That should be rare operation which runs in background or in reclaimer. > > Right, what strategy to use for migration is up for debate, even for > moving to the common ancestor. e.g. should we do that on the first > access? In the other direction, it get more interesting. Let's say > if we decide to move back an inode to a descendant, what if that > triggers OOM condition? Do we still go through it and cause OOM in > the target? Do we even want automatic moving in this direction? > > For explicit cases, userland can do FADV_DONTNEED, I suppose. > > Thanks. > > -- > tejun I don't have any killer objections, most of my worries are isolation concerns. If a machine has several top level memcg trying to get some form of isolation (using low, min, soft limit) then a shared libc will be moved to the root memcg where it's not protected from global memory pressure. At least with the current per page accounting such shared pages often land into some protected memcg. If two cgroups collude they can use more memory than their limit and oom the entire machine. Admittedly the current per-page system isn't perfect because deleting a memcg which contains mlocked memory (referenced by a remote memcg) moves the mlocked memory to root resulting in the same issue. But I'd argue this is more likely with the RFC because it doesn't involve the cgroup deletion/reparenting. A possible tweak to shore up the current system is to move such mlocked pages to the memcg of the surviving locker. When the machine is oom it's often nice to examine memcg state to determine which container is using the memory. Tracking down who's contributing to a shared container is non-trivial. I actually have a set of patches which add a memcg=M mount option to memory backed file systems. I was planning on proposing them, regardless of this RFC, and this discussion makes them even more appealing. If we go in this direction, then we'd need a similar notion for disk based filesystems. As Konstantin suggested, it'd be really nice to specify charge policy on a per file, or directory, or bind mount basis. This allows shared files to be deterministically charged to a known container. We'd need to flesh out the policies: e.g. if two bind mound each specify different charge targets for the same inode, I guess we just pick one. Though the nature of this catch-all shared container is strange. Presumably a machine manager would need to create it as an unlimited container (or at least as big as the sum of all shared files) so that any app which decided it wants to mlock all shared files has a way to without ooming the shared container. In the current per-page approach it's possible to lock shared libs. But the machine manager would need to decide how much system ram to set aside for this catch-all shared container. When there's large incidental sharing, then things get sticky. A periodic filesystem scanner (e.g. virus scanner, or grep foo -r /) in a small container would pull all pages to the root memcg where they are exposed to root pressure which breaks isolation. This is concerning. Perhaps the such accesses could be decorated with (O_NO_MOVEMEM). So this RFC change will introduce significant change to user space machine managers and perturb isolation. Is the resulting system better? It's not clear, it's the devil know vs devil unknown. Maybe it'd be easier if the memcg's I'm talking about were not allowed to share page cache (aka copy-on-read) even for files which are jointly visible. That would provide today's interface while avoiding the problematic sharing. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Making memcg track ownership per address_space or anon_vma
On Tue, Feb 10, 2015 at 6:19 PM, Tejun Heo wrote: > Hello, again. > > On Sat, Feb 07, 2015 at 09:38:39AM -0500, Tejun Heo wrote: >> If we can argue that memcg and blkcg having different views is >> meaningful and characterize and justify the behaviors stemming from >> the deviation, sure, that'd be fine, but I don't think we have that as >> of now. > > If we assume that memcg and blkcg having different views is something > which represents an acceptable compromise considering the use cases > and implementation convenience - IOW, if we assume that read-sharing > is something which can happen regularly while write sharing is a > corner case and that while not completely correct the existing > self-corrective behavior from tracking ownership per-page at the point > of instantiation is good enough (as a memcg under pressure is likely > to give up shared pages to be re-instantiated by another sharer w/ > more budget), we need to do the impedance matching between memcg and > blkcg at the writeback layer. > > The main issue there is that the last chain of IO pressure propagation > is realized by making individual dirtying tasks to converge on a > common target dirty ratio point which naturally depending on those > tasks seeing the same picture in terms of the current write bandwidth > and available memory and how much of it is dirty. Tasks dirtying > pages belonging to the same memcg while some of them are mostly being > written out by a different blkcg would wreck the mechanism. It won't > be difficult for one subset to make the other to consider themselves > under severe IO pressure when there actually isn't one in that group > possibly stalling and starving those tasks unduly. At more basic > level, it's just wrong for one group to be writing out significant > amount for another. > > These issues can persist indefinitely if we follow the same > instantiator-owns rule for inode writebacks. Even if we reset the > ownership when an inode becomes clea, it wouldn't work as it can be > dirtied over and over again while under writeback, and when things > like this happen, the behavior may become extremely difficult to > understand or characterize. We don't have visibility into how > individual pages of an inode get distributed across multiple cgroups, > who's currently responsible for writing back a specific inode or how > dirty ratio mechanism is behaving in the face of the unexpected > combination of parameters. > > Even if we assume that write sharing is a fringe case, we need > something better than first-whatever rule when choosing which blkcg is > responsible for writing a shared inode out. There needs to be a > constant corrective pressure so that incidental and temporary sharings > don't end up screwing up the mechanism for an extended period of time. > > Greg mentioned chossing the closest ancestor of the sharers, which > basically pushes inode sharing policy implmentation down to writeback > from memcg. This could work but we end up with the same collusion > problem as when this is used for memcg and it's even more difficult to > solve this at writeback layer - we'd have to communicate the shared > state all the way down to block layer and then implement a mechanism > there to take corrective measures and even after that we're likely to > end up with prolonged state where dirty ratio propagation is > essentially broken as the dirtier and writer would be seeing different > pictures. > > So, based on the assumption that write sharings are mostly incidental > and temporary (ie. we're basically declaring that we don't support > persistent write sharing), how about something like the following? > > 1. memcg contiues per-page tracking. > > 2. Each inode is associated with a single blkcg at a given time and >written out by that blkcg. > > 3. While writing back, if the number of pages from foreign memcg's is >higher than certain ratio of total written pages, the inode is >marked as disowned and the writeback instance is optionally >terminated early. e.g. if the ratio of foreign pages is over 50% >after writing out the number of pages matching 5s worth of write >bandwidth for the bdi, mark the inode as disowned. > > 4. On the following dirtying of the inode, the inode is associated >with the matching blkcg of the dirtied page. Note that this could >be the next cycle as the inode could already have been marked dirty >by the time the above condition triggered. In that case, the >following writeback would be terminated early too. > > This should provide sufficient corrective pressure so that incidental > and temporary sharing of an inode doesn't become a persistent issue > while keeping the complexity necessary for implementing such pressure > fairly minimal and self-contained. Also, the changes necessary for > individual filesystems would be minimal. > > I think this should work well enough as long as the forementioned > assumptions are true - IOW, if we maintain tha
Re: [RFC] Making memcg track ownership per address_space or anon_vma
On Wed, Feb 11, 2015 at 12:33 PM, Tejun Heo wrote: [...] >> page count to throttle based on blkcg's bandwidth. Note: memcg >> doesn't yet have dirty page counts, but several of us have made >> attempts at adding the counters. And it shouldn't be hard to get them >> merged. > > Can you please post those? Will do. Rebasing and testing needed, so it won't be today. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Making memcg track ownership per address_space or anon_vma
On Thu, Jan 29 2015, Tejun Heo wrote: > Hello, > > Since the cgroup writeback patchset[1] have been posted, several > people brought up concerns about the complexity of allowing an inode > to be dirtied against multiple cgroups is necessary for the purpose of > writeback and it is true that a significant amount of complexity (note > that bdi still needs to be split, so it's still not trivial) can be > removed if we assume that an inode always belongs to one cgroup for > the purpose of writeback. > > However, as mentioned before, this issue is directly linked to whether > memcg needs to track the memory ownership per-page. If there are > valid use cases where the pages of an inode must be tracked to be > owned by different cgroups, cgroup writeback must be able to handle > that situation properly. If there aren't no such cases, the cgroup > writeback support can be simplified but again we should put memcg on > the same cadence and enforce per-inode (or per-anon_vma) ownership > from the beginning. The conclusion can be either way - per-page or > per-inode - but both memcg and blkcg must be looking at the same > picture. Deviating them is highly likely to lead to long-term issues > forcing us to look at this again anyway, only with far more baggage. > > One thing to note is that the per-page tracking which is currently > employed by memcg seems to have been born more out of conveninence > rather than requirements for any actual use cases. Per-page ownership > makes sense iff pages of an inode have to be associated with different > cgroups - IOW, when an inode is accessed by multiple cgroups; however, > currently, memcg assigns a page to its instantiating memcg and leaves > it at that till the page is released. This means that if a page is > instantiated by one cgroup and then subsequently accessed only by a > different cgroup, whether the page's charge gets moved to the cgroup > which is actively using it is purely incidental. If the page gets > reclaimed and released at some point, it'll be moved. If not, it > won't. > > AFAICS, the only case where the current per-page accounting works > properly is when disjoint sections of an inode are used by different > cgroups and the whole thing hinges on whether this use case justifies > all the added overhead including page->mem_cgroup pointer and the > extra complexity in the writeback layer. FWIW, I'm doubtful. > Johannes, Michal, Greg, what do you guys think? > > If the above use case - a huge file being actively accssed disjointly > by multiple cgroups - isn't significant enough and there aren't other > use cases that I missed which can benefit from the per-page tracking > that's currently implemented, it'd be logical to switch to per-inode > (or per-anon_vma or per-slab) ownership tracking. For the short term, > even just adding an extra ownership information to those containing > objects and inherting those to page->mem_cgroup could work although > it'd definitely be beneficial to eventually get rid of > page->mem_cgroup. > > As with per-page, when the ownership terminates is debatable w/ > per-inode tracking. Also, supporting some form of shared accounting > across different cgroups may be useful (e.g. shared library's memory > being equally split among anyone who accesses it); however, these > aren't likely to be major and trying to do something smart may affect > other use cases adversely, so it'd probably be best to just keep it > dumb and clear the ownership when the inode loses all pages (a cgroup > can disown such inode through FADV_DONTNEED if necessary). > > What do you guys think? If making memcg track ownership at per-inode > level, even for just the unified hierarchy, is the direction we can > take, I'll go ahead and simplify the cgroup writeback patchset. > > Thanks. I find simplification appealing. But I not sure it will fly, if for no other reason than the shared accountings. I'm ignoring intentional sharing, used by carefully crafted apps, and just thinking about incidental sharing (e.g. libc). Example: $ mkdir small $ echo 1M > small/memory.limit_in_bytes $ (echo $BASHPID > small/cgroup.procs && exec sleep 1h) & $ mkdir big $ echo 10G > big/memory.limit_in_bytes $ (echo $BASHPID > big/cgroup.procs && exec mlockall_database 1h) & Assuming big/mlockall_database mlocks all of libc, then it will oom kill the small memcg because libc is owned by small due it having touched it first. It'd be hard to figure out what small did wrong to deserve the oom kill. FWIW we've been using memcg writeback where inodes have a memcg writeback owner. Once multiple memcg write to an inode then the inode becomes writeback shared which makes it more likely to be written. Once cleaned the inode is then again able to be privately owned: https://lkml.org/lkml/2011/8/17/200 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org
Re: [GIT PULL] ext4 bug fixes for 4.6
Theodore Ts'o wrote: > The following changes since commit 243d50678583100855862bc084b8b307eea67f68: > > Merge branch 'overlayfs-linus' of > git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs (2016-03-22 > 13:11:15 -0700) > n> are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git > tags/ext4_for_linus_stable > > for you to fetch changes up to c325a67c72903e1cc30e990a15ce745bda0dbfde: > > ext4: ignore quota mount options if the quota feature is enabled > (2016-04-03 17:03:37 -0400) > > > These changes contains a fix for overlayfs interacting with some > (badly behaved) dentry code in various file systems. These have been > reviewed by Al and the respective file system mtinainers and are going > through the ext4 tree for convenience. > > This also has a few ext4 encryption bug fixes that were discovered in > Android testing (yes, we will need to get these sync'ed up with the > fs/crypto code; I'll take care of that). It also has some bug fixes > and a change to ignore the legacy quota options to allow for xfstests > regression testing of ext4's internal quota feature and to be more > consistent with how xfs handles this case. > > > Dan Carpenter (1): > ext4 crypto: fix some error handling > > Filipe Manana (1): > btrfs: fix crash/invalid memory access on fsync when using overlayfs > > Jan Kara (1): > ext4: retry block allocation for failed DIO and DAX writes > > Miklos Szeredi (4): > fs: add file_dentry() > nfs: use file_dentry() > ext4: use dget_parent() in ext4_file_open() > ext4: use file_dentry() > > Theodore Ts'o (7): > ext4: check if in-inode xattr is corrupted in > ext4_expand_extra_isize_ea() > ext4 crypto: don't let data integrity writebacks fail with ENOMEM > ext4 crypto: use dget_parent() in ext4_d_revalidate() > ext4: allow readdir()'s of large empty directories to be interrupted Ted, I've been testing 5b5b7fd185e9 (linus/master) and seeing that interrupted readdir() now returns duplicate dirents. Reverting commit 1028b55bafb7 ("ext4: allow readdir()'s of large empty directories to be interrupted") avoids duplicates. On 5b5b7fd185e9 a SIGPROF to the test program below occasionally causes "already seen name" error. On older kernels, it runs indefinitely without error. /* mkdir /tmp/foo cd /tmp/foo for i in $(seq 0 599); do touch $i; done /tmp/scanner & kill -PROF %1 */ #include #include #include #include #include #include static void handler(int unused) { } int main() { enum { count = 600 }; char seen[count]; struct dirent *ent; DIR *dir; struct sigaction sa = {0}; sa.sa_handler = handler; if (sigemptyset(&sa.sa_mask)) err(1, "sigemptyset"); sa.sa_flags = SA_RESTART; if (sigaction(SIGPROF, &sa, NULL)) err(1, "sigaction"); for (;;) { memset(seen, 0, sizeof(seen)); dir = opendir("."); if (dir == NULL) err(1, "opendir(.)"); while ((ent = readdir(dir)) != NULL) { int idx; if ((strcmp(ent->d_name, ".") == 0) || (strcmp(ent->d_name, "..") == 0)) { continue; } idx = atoi(ent->d_name); if (idx >= count) { errx(1, "bogus name %s index %d", ent->d_name, idx); } if (seen[idx]) { errx(1, "already seen name %s index %d", ent->d_name, idx); } seen[idx] = 1; } if (closedir(dir)) err(1, "closedir(.)"); } } > ext4: add lockdep annotations for i_data_sem > ext4: avoid calling dquot_get_next_id() if quota is not enabled > ext4: ignore quota mount options if the quota feature is enabled > > fs/btrfs/file.c| 2 +- > fs/dcache.c| 5 - > fs/ext4/crypto.c | 49 + > fs/ext4/dir.c | 5 + > fs/ext4/ext4.h | 29 +++-- > fs/ext4/file.c | 12 > fs/ext4/inode.c| 58 > -- > fs/ext4/move_extent.c | 11 +-- > fs/ext4/namei.c| 5 + > fs/ext4/page-io.c | 14 +- > fs/ext4/readpage.c | 2 +- > fs/ext4/super.c| 61 > +++-- > fs/ext4/xattr.c| 32 > fs/nfs/dir.c | 6
Re: [PATCH 1/5] mm: uncharge kmem pages from generic free_page path
Vladimir Davydov wrote: > Currently, to charge a page to kmemcg one should use alloc_kmem_pages > helper. When the page is not needed anymore it must be freed with > free_kmem_pages helper, which will uncharge the page before freeing it. > Such a design is acceptable for thread info pages and kmalloc large > allocations, which are currently the only users of alloc_kmem_pages, but > it gets extremely inconvenient if one wants to make use of batched free > (e.g. to charge page tables - see release_pages) or page reference > counter (pipe buffers - see anon_pipe_buf_release). > > To overcome this limitation, this patch moves kmemcg uncharge code to > the generic free path and zaps free_kmem_pages helper. To distinguish > kmem pages from other page types, it makes alloc_kmem_pages initialize > page->_mapcount to a special value and introduces a new PageKmem helper, > which returns true if it sees this value. > > Signed-off-by: Vladimir Davydov > --- > include/linux/gfp.h| 3 --- > include/linux/page-flags.h | 22 ++ > kernel/fork.c | 2 +- > mm/page_alloc.c| 26 -- > mm/slub.c | 2 +- > mm/swap.c | 3 ++- > 6 files changed, 34 insertions(+), 24 deletions(-) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index f92cbd2f4450..b46147c45966 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -384,9 +384,6 @@ extern void *__alloc_page_frag(struct page_frag_cache *nc, > unsigned int fragsz, gfp_t gfp_mask); > extern void __free_page_frag(void *addr); > > -extern void __free_kmem_pages(struct page *page, unsigned int order); > -extern void free_kmem_pages(unsigned long addr, unsigned int order); > - > #define __free_page(page) __free_pages((page), 0) > #define free_page(addr) free_pages((addr), 0) > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 416509e26d6d..a190719c2f46 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -594,6 +594,28 @@ static inline void __ClearPageBalloon(struct page *page) > } > > /* > + * PageKmem() returns true if the page was allocated with alloc_kmem_pages(). > + */ > +#define PAGE_KMEM_MAPCOUNT_VALUE (-512) > + > +static inline int PageKmem(struct page *page) > +{ > + return atomic_read(&page->_mapcount) == PAGE_KMEM_MAPCOUNT_VALUE; > +} > + > +static inline void __SetPageKmem(struct page *page) > +{ > + VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page); > + atomic_set(&page->_mapcount, PAGE_KMEM_MAPCOUNT_VALUE); > +} What do you think about several special mapcount values for various types of kmem? It's helps user and administrators break down memory usage. A nice equation is: memory.usage_in_bytes = memory.stat[file + anon + unevictable + kmem] Next, it's helpful to be able to breakdown kmem into: kmem = stack + pgtable + slab + ... On one hand (and the kernel I use internally) we can use separate per memcg counters for each kmem type. Then reconstitute memory.kmem as needed by adding them together. But using keeping a single kernel kmem counter is workable if there is a way to breakdown the memory charge to a container (e.g. by walking /proc/kpageflags-ish or per memcg memory.kpageflags-ish file). > +static inline void __ClearPageKmem(struct page *page) > +{ > + VM_BUG_ON_PAGE(!PageKmem(page), page); > + atomic_set(&page->_mapcount, -1); > +} > + > +/* > * If network-based swap is enabled, sl*b must keep track of whether pages > * were allocated from pfmemalloc reserves. > */ > diff --git a/kernel/fork.c b/kernel/fork.c > index 2845623fb582..c23f8a17e99e 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -169,7 +169,7 @@ static struct thread_info *alloc_thread_info_node(struct > task_struct *tsk, > > static inline void free_thread_info(struct thread_info *ti) > { > - free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER); > + free_pages((unsigned long)ti, THREAD_SIZE_ORDER); > } > # else > static struct kmem_cache *thread_info_cache; > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 48aaf7b9f253..88d85367c81e 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -942,6 +942,10 @@ static bool free_pages_prepare(struct page *page, > unsigned int order) > > if (PageAnon(page)) > page->mapping = NULL; > + if (PageKmem(page)) { > + memcg_kmem_uncharge_pages(page, order); > + __ClearPageKmem(page); > + } > bad += free_pages_check(page); > for (i = 1; i < (1 << order); i++) { > if (compound) > @@ -3434,6 +3438,8 @@ struct page *alloc_kmem_pages(gfp_t gfp_mask, unsigned > int order) > return NULL; > page = alloc_pages(gfp_mask, order); > memcg_kmem_commit_charge(page, memcg, order); > + if (page) > + __SetPageKmem(page); > return page; > } >
[PATCH v2] writeback: use exact memcg dirty counts
Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") memcg dirty and writeback counters are managed as: 1) per-memcg per-cpu values in range of [-32..32] 2) per-memcg atomic counter When a per-cpu counter cannot fit in [-32..32] it's flushed to the atomic. Stat readers only check the atomic. Thus readers such as balance_dirty_pages() may see a nontrivial error margin: 32 pages per cpu. Assuming 100 cpus: 4k x86 page_size: 13 MiB error per memcg 64k ppc page_size: 200 MiB error per memcg Considering that dirty+writeback are used together for some decisions the errors double. This inaccuracy can lead to undeserved oom kills. One nasty case is when all per-cpu counters hold positive values offsetting an atomic negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32). balance_dirty_pages() only consults the atomic and does not consider throttling the next n_cpu*32 dirty pages. If the file_lru is in the 13..200 MiB range then there's absolutely no dirty throttling, which burdens vmscan with only dirty+writeback pages thus resorting to oom kill. It could be argued that tiny containers are not supported, but it's more subtle. It's the amount the space available for file lru that matters. If a container has memory.max-200MiB of non reclaimable memory, then it will also suffer such oom kills on a 100 cpu machine. The following test reliably ooms without this patch. This patch avoids oom kills. $ cat test mount -t cgroup2 none /dev/cgroup cd /dev/cgroup echo +io +memory > cgroup.subtree_control mkdir test cd test echo 10M > memory.max (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo) (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100) $ cat memcg-writeback-stress.c /* * Dirty pages from all but one cpu. * Clean pages from the non dirtying cpu. * This is to stress per cpu counter imbalance. * On a 100 cpu machine: * - per memcg per cpu dirty count is 32 pages for each of 99 cpus * - per memcg atomic is -99*32 pages * - thus the complete dirty limit: sum of all counters 0 * - balance_dirty_pages() only sees atomic count -99*32 pages, which * it max()s to 0. * - So a workload can dirty -99*32 pages before balance_dirty_pages() * cares. */ #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include static char *buf; static int bufSize; static void set_affinity(int cpu) { cpu_set_t affinity; CPU_ZERO(&affinity); CPU_SET(cpu, &affinity); if (sched_setaffinity(0, sizeof(affinity), &affinity)) err(1, "sched_setaffinity"); } static void dirty_on(int output_fd, int cpu) { int i, wrote; set_affinity(cpu); for (i = 0; i < 32; i++) { for (wrote = 0; wrote < bufSize; ) { int ret = write(output_fd, buf+wrote, bufSize-wrote); if (ret == -1) err(1, "write"); wrote += ret; } } } int main(int argc, char **argv) { int cpu, flush_cpu = 1, output_fd; const char *output; if (argc != 2) errx(1, "usage: output_file"); output = argv[1]; bufSize = getpagesize(); buf = malloc(getpagesize()); if (buf == NULL) errx(1, "malloc failed"); output_fd = open(output, O_CREAT|O_RDWR); if (output_fd == -1) err(1, "open(%s)", output); for (cpu = 0; cpu < get_nprocs(); cpu++) { if (cpu != flush_cpu) dirty_on(output_fd, cpu); } set_affinity(flush_cpu); if (fsync(output_fd)) err(1, "fsync(%s)", output); if (close(output_fd)) err(1, "close(%s)", output); free(buf); } Make balance_dirty_pages() and wb_over_bg_thresh() work harder to collect exact per memcg counters. This avoids the aforementioned oom kills. This does not affect the overhead of memory.stat, which still reads the single atomic counter. Why not use percpu_counter? memcg already handles cpus going offline, so no need for that overhead from percpu_counter. And the percpu_counter spinlocks are more heavyweight than is required. It probably also makes sense to use exact dirty and writeback counters in memcg oom reports. But that is saved for later. Cc: sta...@vger.kernel.org # v4.16+ Signed-off-by: Greg Thelen --- Changelog since v1: - Move memcg_exact_page_state() into memcontrol.c. - Unconditionally gather exact (per cpu) counters in mem_cgroup_wb_stats(), it's not called in performance sensitive paths. - Unconditionally che
Re: [PATCH] writeback: sum memcg dirty counters as needed
Andrew Morton wrote: > On Thu, 7 Mar 2019 08:56:32 -0800 Greg Thelen wrote: > >> Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in >> memory.stat reporting") memcg dirty and writeback counters are managed >> as: >> 1) per-memcg per-cpu values in range of [-32..32] >> 2) per-memcg atomic counter >> When a per-cpu counter cannot fit in [-32..32] it's flushed to the >> atomic. Stat readers only check the atomic. >> Thus readers such as balance_dirty_pages() may see a nontrivial error >> margin: 32 pages per cpu. >> Assuming 100 cpus: >>4k x86 page_size: 13 MiB error per memcg >> 64k ppc page_size: 200 MiB error per memcg >> Considering that dirty+writeback are used together for some decisions >> the errors double. >> >> This inaccuracy can lead to undeserved oom kills. One nasty case is >> when all per-cpu counters hold positive values offsetting an atomic >> negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32). >> balance_dirty_pages() only consults the atomic and does not consider >> throttling the next n_cpu*32 dirty pages. If the file_lru is in the >> 13..200 MiB range then there's absolutely no dirty throttling, which >> burdens vmscan with only dirty+writeback pages thus resorting to oom >> kill. >> >> It could be argued that tiny containers are not supported, but it's more >> subtle. It's the amount the space available for file lru that matters. >> If a container has memory.max-200MiB of non reclaimable memory, then it >> will also suffer such oom kills on a 100 cpu machine. >> >> ... >> >> Make balance_dirty_pages() and wb_over_bg_thresh() work harder to >> collect exact per memcg counters when a memcg is close to the >> throttling/writeback threshold. This avoids the aforementioned oom >> kills. >> >> This does not affect the overhead of memory.stat, which still reads the >> single atomic counter. >> >> Why not use percpu_counter? memcg already handles cpus going offline, >> so no need for that overhead from percpu_counter. And the >> percpu_counter spinlocks are more heavyweight than is required. >> >> It probably also makes sense to include exact dirty and writeback >> counters in memcg oom reports. But that is saved for later. > > Nice changelog, thanks. > >> Signed-off-by: Greg Thelen > > Did you consider cc:stable for this? We may as well - the stablebots > backport everything which might look slightly like a fix anyway :( Good idea. Done in -v2 of the patch. >> --- a/include/linux/memcontrol.h >> +++ b/include/linux/memcontrol.h >> @@ -573,6 +573,22 @@ static inline unsigned long memcg_page_state(struct >> mem_cgroup *memcg, >> return x; >> } >> >> +/* idx can be of type enum memcg_stat_item or node_stat_item */ >> +static inline unsigned long >> +memcg_exact_page_state(struct mem_cgroup *memcg, int idx) >> +{ >> +long x = atomic_long_read(&memcg->stat[idx]); >> +#ifdef CONFIG_SMP >> +int cpu; >> + >> +for_each_online_cpu(cpu) >> +x += per_cpu_ptr(memcg->stat_cpu, cpu)->count[idx]; >> +if (x < 0) >> +x = 0; >> +#endif >> +return x; >> +} > > This looks awfully heavyweight for an inline function. Why not make it > a regular function and avoid the bloat and i-cache consumption? Done in -v2. > Also, did you instead consider making this spill the percpu counters > into memcg->stat[idx]? That might be more useful for potential future > callers. It would become a little more expensive though. I looked at that approach, but couldn't convince myself it was safe. I kept staring at "Remote [...] Write accesses can cause unique problems due to the relaxed synchronization requirements for this_cpu operations." from this_cpu_ops.txt. So I'd like to delay this possible optimization for a later patch.
Re: [PATCH] writeback: sum memcg dirty counters as needed
Johannes Weiner wrote: > On Thu, Mar 07, 2019 at 08:56:32AM -0800, Greg Thelen wrote: >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -3880,6 +3880,7 @@ struct wb_domain *mem_cgroup_wb_domain(struct >> bdi_writeback *wb) >> * @pheadroom: out parameter for number of allocatable pages according to >> memcg >> * @pdirty: out parameter for number of dirty pages >> * @pwriteback: out parameter for number of pages under writeback >> + * @exact: determines exact counters are required, indicates more work. >> * >> * Determine the numbers of file, headroom, dirty, and writeback pages in >> * @wb's memcg. File, dirty and writeback are self-explanatory. Headroom >> @@ -3890,18 +3891,29 @@ struct wb_domain *mem_cgroup_wb_domain(struct >> bdi_writeback *wb) >> * ancestors. Note that this doesn't consider the actual amount of >> * available memory in the system. The caller should further cap >> * *@pheadroom accordingly. >> + * >> + * Return value is the error precision associated with *@pdirty >> + * and *@pwriteback. When @exact is set this a minimal value. >> */ >> -void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long >> *pfilepages, >> - unsigned long *pheadroom, unsigned long *pdirty, >> - unsigned long *pwriteback) >> +unsigned long >> +mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, >> +unsigned long *pheadroom, unsigned long *pdirty, >> +unsigned long *pwriteback, bool exact) >> { >> struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); >> struct mem_cgroup *parent; >> +unsigned long precision; >> >> -*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); >> - >> +if (exact) { >> +precision = 0; >> +*pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); >> +*pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); >> +} else { >> +precision = MEMCG_CHARGE_BATCH * num_online_cpus(); >> +*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); >> +*pwriteback = memcg_page_state(memcg, NR_WRITEBACK); >> +} >> /* this should eventually include NR_UNSTABLE_NFS */ >> -*pwriteback = memcg_page_state(memcg, NR_WRITEBACK); >> *pfilepages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) | >> (1 << LRU_ACTIVE_FILE)); >> *pheadroom = PAGE_COUNTER_MAX; >> @@ -3913,6 +3925,8 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, >> unsigned long *pfilepages, >> *pheadroom = min(*pheadroom, ceiling - min(ceiling, used)); >> memcg = parent; >> } >> + >> +return precision; > > Have you considered unconditionally using the exact version here? > > It does for_each_online_cpu(), but until very, very recently we did > this per default for all stats, for years. It only became a problem in > conjunction with the for_each_memcg loops when frequently reading > memory stats at the top of a very large hierarchy. > > balance_dirty_pages() is called against memcgs that actually own the > inodes/memory and doesn't do the additional recursive tree collection. > > It's also not *that* hot of a function, and in the io path... > > It would simplify this patch immensely. Good idea. Done in -v2 of the patch: https://lore.kernel.org/lkml/20190329174609.164344-1-gthe...@google.com/
[PATCH] kbuild: clean compressed initramfs image
Since commit 9e3596b0c653 ("kbuild: initramfs cleanup, set target from Kconfig") "make clean" leaves behind compressed initramfs images. Example: $ make defconfig $ sed -i 's|CONFIG_INITRAMFS_SOURCE=""|CONFIG_INITRAMFS_SOURCE="/tmp/ir.cpio"|' .config $ make olddefconfig $ make -s $ make -s clean $ git clean -ndxf | grep initramfs Would remove usr/initramfs_data.cpio.gz clean rules do not have CONFIG_* context so they do not know which compression format was used. Thus they don't know which files to delete. Tell clean to delete all possible compression formats. Once patched usr/initramfs_data.cpio.gz and friends are deleted by "make clean". Fixes: 9e3596b0c653 ("kbuild: initramfs cleanup, set target from Kconfig") Signed-off-by: Greg Thelen --- usr/Makefile | 3 +++ 1 file changed, 3 insertions(+) diff --git a/usr/Makefile b/usr/Makefile index 6a89eb019275..e6f7cb2f81db 100644 --- a/usr/Makefile +++ b/usr/Makefile @@ -11,6 +11,9 @@ datafile_y = initramfs_data.cpio$(suffix_y) datafile_d_y = .$(datafile_y).d AFLAGS_initramfs_data.o += -DINITRAMFS_IMAGE="usr/$(datafile_y)" +# clean rules do not have CONFIG_INITRAMFS_COMPRESSION. So clean up after all +# possible compression formats. +clean-files += initramfs_data.cpio* # Generate builtin.o based on initramfs_data.o obj-$(CONFIG_BLK_DEV_INITRD) := initramfs_data.o -- 2.22.0.657.g960e92d24f-goog
[PATCH] writeback: fix inode cgroup switching comment
Commit 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates") refers to inode_switch_wb_work_fn() which never got merged. Switch the comments to inode_switch_wbs_work_fn(). Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates") Signed-off-by: Greg Thelen --- include/linux/backing-dev.h | 2 +- include/linux/fs.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index c28a47cbe355..f9b029180241 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -365,7 +365,7 @@ unlocked_inode_to_wb_begin(struct inode *inode, struct wb_lock_cookie *cookie) rcu_read_lock(); /* -* Paired with store_release in inode_switch_wb_work_fn() and +* Paired with store_release in inode_switch_wbs_work_fn() and * ensures that we see the new wb if we see cleared I_WB_SWITCH. */ cookie->locked = smp_load_acquire(&inode->i_state) & I_WB_SWITCH; diff --git a/include/linux/fs.h b/include/linux/fs.h index fd423fec8d83..08f26046233e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2091,7 +2091,7 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp) * I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to * synchronize competing switching instances and to tell * wb stat updates to grab the i_pages lock. See - * inode_switch_wb_work_fn() for details. + * inode_switch_wbs_work_fn() for details. * * I_OVL_INUSE Used by overlayfs to get exclusive ownership on upper * and work dirs among overlayfs mounts. -- 2.21.0.352.gf09ad66450-goog
[PATCH] writeback: sum memcg dirty counters as needed
Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") memcg dirty and writeback counters are managed as: 1) per-memcg per-cpu values in range of [-32..32] 2) per-memcg atomic counter When a per-cpu counter cannot fit in [-32..32] it's flushed to the atomic. Stat readers only check the atomic. Thus readers such as balance_dirty_pages() may see a nontrivial error margin: 32 pages per cpu. Assuming 100 cpus: 4k x86 page_size: 13 MiB error per memcg 64k ppc page_size: 200 MiB error per memcg Considering that dirty+writeback are used together for some decisions the errors double. This inaccuracy can lead to undeserved oom kills. One nasty case is when all per-cpu counters hold positive values offsetting an atomic negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32). balance_dirty_pages() only consults the atomic and does not consider throttling the next n_cpu*32 dirty pages. If the file_lru is in the 13..200 MiB range then there's absolutely no dirty throttling, which burdens vmscan with only dirty+writeback pages thus resorting to oom kill. It could be argued that tiny containers are not supported, but it's more subtle. It's the amount the space available for file lru that matters. If a container has memory.max-200MiB of non reclaimable memory, then it will also suffer such oom kills on a 100 cpu machine. The following test reliably ooms without this patch. This patch avoids oom kills. $ cat test mount -t cgroup2 none /dev/cgroup cd /dev/cgroup echo +io +memory > cgroup.subtree_control mkdir test cd test echo 10M > memory.max (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo) (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100) $ cat memcg-writeback-stress.c /* * Dirty pages from all but one cpu. * Clean pages from the non dirtying cpu. * This is to stress per cpu counter imbalance. * On a 100 cpu machine: * - per memcg per cpu dirty count is 32 pages for each of 99 cpus * - per memcg atomic is -99*32 pages * - thus the complete dirty limit: sum of all counters 0 * - balance_dirty_pages() only sees atomic count -99*32 pages, which * it max()s to 0. * - So a workload can dirty -99*32 pages before balance_dirty_pages() * cares. */ #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include static char *buf; static int bufSize; static void set_affinity(int cpu) { cpu_set_t affinity; CPU_ZERO(&affinity); CPU_SET(cpu, &affinity); if (sched_setaffinity(0, sizeof(affinity), &affinity)) err(1, "sched_setaffinity"); } static void dirty_on(int output_fd, int cpu) { int i, wrote; set_affinity(cpu); for (i = 0; i < 32; i++) { for (wrote = 0; wrote < bufSize; ) { int ret = write(output_fd, buf+wrote, bufSize-wrote); if (ret == -1) err(1, "write"); wrote += ret; } } } int main(int argc, char **argv) { int cpu, flush_cpu = 1, output_fd; const char *output; if (argc != 2) errx(1, "usage: output_file"); output = argv[1]; bufSize = getpagesize(); buf = malloc(getpagesize()); if (buf == NULL) errx(1, "malloc failed"); output_fd = open(output, O_CREAT|O_RDWR); if (output_fd == -1) err(1, "open(%s)", output); for (cpu = 0; cpu < get_nprocs(); cpu++) { if (cpu != flush_cpu) dirty_on(output_fd, cpu); } set_affinity(flush_cpu); if (fsync(output_fd)) err(1, "fsync(%s)", output); if (close(output_fd)) err(1, "close(%s)", output); free(buf); } Make balance_dirty_pages() and wb_over_bg_thresh() work harder to collect exact per memcg counters when a memcg is close to the throttling/writeback threshold. This avoids the aforementioned oom kills. This does not affect the overhead of memory.stat, which still reads the single atomic counter. Why not use percpu_counter? memcg already handles cpus going offline, so no need for that overhead from percpu_counter. And the percpu_counter spinlocks are more heavyweight than is required. It probably also makes sense to include exact dirty and writeback counters in memcg oom reports. But that is saved for later. Signed-off-by: Greg Thelen --- include/linux/memcontrol.h | 33 + mm/memcontrol.c| 26 -- mm/page-writeback.c| 27 +-- 3 fi
[PATCH] trace: fix SKIP_STACK_VALIDATION=1 build
Non gcc-5 builds with CONFIG_STACK_VALIDATION=y and SKIP_STACK_VALIDATION=1 fail. Example output: /bin/sh: init/.tmp_main.o: Permission denied commit 96f60dfa5819 ("trace: Use -mcount-record for dynamic ftrace"), added a mismatched endif. This causes cmd_objtool to get mistakenly set. Relocate endif to balance the newly added -record-mcount check. Fixes: 96f60dfa5819 ("trace: Use -mcount-record for dynamic ftrace") Signed-off-by: Greg Thelen --- scripts/Makefile.build | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/Makefile.build b/scripts/Makefile.build index 34d9e9ce97c2..e7889f486ca1 100644 --- a/scripts/Makefile.build +++ b/scripts/Makefile.build @@ -239,6 +239,7 @@ cmd_record_mcount = \ "$(CC_FLAGS_FTRACE)" ]; then \ $(sub_cmd_record_mcount)\ fi; +endif # -record-mcount endif # CONFIG_FTRACE_MCOUNT_RECORD ifdef CONFIG_STACK_VALIDATION @@ -263,7 +264,6 @@ ifneq ($(RETPOLINE_CFLAGS),) objtool_args += --retpoline endif endif -endif ifdef CONFIG_MODVERSIONS -- 2.18.0.rc1.242.g61856ae69a-goog
[PATCH] RDMA/ucma: reject AF_IB ip multicast requests
syzbot discovered that ucma_join_ip_multicast() mishandles AF_IB request addresses. If an RDMA_USER_CM_CMD_JOIN_IP_MCAST request has cmd.addr.sa_family=AF_IB then ucma_join_ip_multicast() reads beyond the end of its cmd.addr. Reject non IP RDMA_USER_CM_CMD_JOIN_IP_MCAST requests. RDMA_USER_CM_CMD_JOIN_MCAST is interface for AF_IB multicast. And add a buffer length safety check. Fixes: 5bc2b7b397b0 ("RDMA/ucma: Allow user space to specify AF_IB when joining multicast") Signed-off-by: Greg Thelen --- drivers/infiniband/core/ucma.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index e5a1e7d81326..e410e03940ff 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -1423,11 +1423,19 @@ static ssize_t ucma_join_ip_multicast(struct ucma_file *file, if (copy_from_user(&cmd, inbuf, sizeof(cmd))) return -EFAULT; + switch (cmd.addr.sin6_family) { + case AF_INET: + case AF_INET6: + break; + default: + return -EINVAL; + } + join_cmd.response = cmd.response; join_cmd.uid = cmd.uid; join_cmd.id = cmd.id; join_cmd.addr_size = rdma_addr_size((struct sockaddr *) &cmd.addr); - if (!join_cmd.addr_size) + if (!join_cmd.addr_size || join_cmd.addr_size > sizeof(cmd.addr)) return -EINVAL; join_cmd.join_flags = RDMA_MC_JOIN_FLAG_FULLMEMBER; -- 2.17.0.rc1.321.gba9d0f2565-goog
Re: [PATCH] RDMA/ucma: reject AF_IB ip multicast requests
On Thu, Mar 29, 2018 at 9:24 PM, Greg Thelen wrote: > syzbot discovered that ucma_join_ip_multicast() mishandles AF_IB request > addresses. If an RDMA_USER_CM_CMD_JOIN_IP_MCAST request has > cmd.addr.sa_family=AF_IB then ucma_join_ip_multicast() reads beyond the > end of its cmd.addr. > > Reject non IP RDMA_USER_CM_CMD_JOIN_IP_MCAST requests. > RDMA_USER_CM_CMD_JOIN_MCAST is interface for AF_IB multicast. > > And add a buffer length safety check. > > Fixes: 5bc2b7b397b0 ("RDMA/ucma: Allow user space to specify AF_IB when > joining multicast") > Signed-off-by: Greg Thelen > --- > drivers/infiniband/core/ucma.c | 10 +- > 1 file changed, 9 insertions(+), 1 deletion(-) My patch is no longer needed. linus/master recently picked up 84652aefb347 ("RDMA/ucma: Introduce safer rdma_addr_size() variants") which fixes the same issue. > diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c > index e5a1e7d81326..e410e03940ff 100644 > --- a/drivers/infiniband/core/ucma.c > +++ b/drivers/infiniband/core/ucma.c > @@ -1423,11 +1423,19 @@ static ssize_t ucma_join_ip_multicast(struct > ucma_file *file, > if (copy_from_user(&cmd, inbuf, sizeof(cmd))) > return -EFAULT; > > + switch (cmd.addr.sin6_family) { > + case AF_INET: > + case AF_INET6: > + break; > + default: > + return -EINVAL; > + } > + > join_cmd.response = cmd.response; > join_cmd.uid = cmd.uid; > join_cmd.id = cmd.id; > join_cmd.addr_size = rdma_addr_size((struct sockaddr *) &cmd.addr); > - if (!join_cmd.addr_size) > + if (!join_cmd.addr_size || join_cmd.addr_size > sizeof(cmd.addr)) > return -EINVAL; > > join_cmd.join_flags = RDMA_MC_JOIN_FLAG_FULLMEMBER; > -- > 2.17.0.rc1.321.gba9d0f2565-goog >
[RFC PATCH 0/2] memory.low,min reclaim
=> 0k # F: 4203964k => 4203908k # F/g : 2102016k => 2101992k # F/H : 2101948k => 2101916k # i: 4204408k => 4203988k # i/j : 2101984k => 2101936k # i/K : 2102424k => 2102052k # L: 4189960k => 3886296k # L/m : 2101968k => 2838704k # L/N : 2087992k => 1047592k set -o errexit set -o nounset set -o pipefail LIM="$1"; shift ANTAGONIST="$1"; shift CGPATH=/tmp/cgroup vmtouch2() { rm -f "$2" (echo $BASHPID > "${CGPATH}/$1/cgroup.procs" && exec /tmp/mmap --loop 1 --file "$2" "$3") } vmtouch() { # twice to ensure slab caches are warmed up and all objs are charged to cgroup. vmtouch2 "$1" "$2" "$3" vmtouch2 "$1" "$2" "$3" } dump() { for i in A A/b A/C A/D A/e F F/g F/H i i/j i/K L L/m L/N; do printf "%-5s: %sk\n" $i $(($(cat "${CGPATH}/${i}/memory.current")/1024)) done } rm -f /file_? if [[ -e "${CGPATH}/A" ]]; then rmdir ${CGPATH}/?/? ${CGPATH}/? fi echo "+memory" > "${CGPATH}/cgroup.subtree_control" mkdir "${CGPATH}/A" "${CGPATH}/F" "${CGPATH}/i" "${CGPATH}/L" echo "+memory" > "${CGPATH}/A/cgroup.subtree_control" echo "+memory" > "${CGPATH}/F/cgroup.subtree_control" echo "+memory" > "${CGPATH}/i/cgroup.subtree_control" echo "+memory" > "${CGPATH}/L/cgroup.subtree_control" mkdir "${CGPATH}/A/b" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/e" mkdir "${CGPATH}/F/g" "${CGPATH}/F/H" mkdir "${CGPATH}/i/j" "${CGPATH}/i/K" mkdir "${CGPATH}/L/m" "${CGPATH}/L/N" echo 2G > "${CGPATH}/A/memory.${LIM}" echo 3G > "${CGPATH}/A/b/memory.${LIM}" echo 1G > "${CGPATH}/A/C/memory.${LIM}" echo 0 > "${CGPATH}/A/D/memory.${LIM}" echo 10G > "${CGPATH}/A/e/memory.${LIM}" echo 2G > "${CGPATH}/F/memory.${LIM}" echo 3G > "${CGPATH}/F/g/memory.${LIM}" echo 1G > "${CGPATH}/F/H/memory.${LIM}" echo 5G > "${CGPATH}/i/memory.${LIM}" echo 3G > "${CGPATH}/i/j/memory.${LIM}" echo 1G > "${CGPATH}/i/K/memory.${LIM}" echo 2G > "${CGPATH}/L/memory.${LIM}" echo 4G > "${CGPATH}/L/memory.max" echo 3G > "${CGPATH}/L/m/memory.${LIM}" echo 1G > "${CGPATH}/L/N/memory.${LIM}" vmtouch A/b /file_b 2G vmtouch A/C /file_C 2G vmtouch A/D /file_D 2G vmtouch F/g /file_g 2G vmtouch F/H /file_H 2G vmtouch i/j /file_j 2G vmtouch i/K /file_K 2G vmtouch L/m /file_m 2G vmtouch L/N /file_N 2G vmtouch2 "$ANTAGONIST" /file_ant 150G echo echo "after $ANTAGONIST antagonist" dump rmdir "${CGPATH}/A/b" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/e" rmdir "${CGPATH}/F/g" "${CGPATH}/F/H" rmdir "${CGPATH}/i/j" "${CGPATH}/i/K" rmdir "${CGPATH}/L/m" "${CGPATH}/L/N" rmdir "${CGPATH}/A" "${CGPATH}/F" "${CGPATH}/i" "${CGPATH}/L" rm /file_ant /file_b /file_C /file_D /file_g /file_H /file_j /file_K Greg Thelen (2): memcg: fix memory.low memcg: add memory.min include/linux/memcontrol.h | 8 + mm/memcontrol.c| 70 ++ mm/vmscan.c| 3 ++ 3 files changed, 75 insertions(+), 6 deletions(-) -- 2.17.0.484.g0c8726318c-goog
[RFC PATCH 1/2] memcg: fix memory.low
When targeting reclaim to a memcg, protect that memcg from reclaim is memory consumption of any level is below respective memory.low. Signed-off-by: Greg Thelen --- mm/memcontrol.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 670e99b68aa6..9668f620203a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5341,8 +5341,8 @@ struct cgroup_subsys memory_cgrp_subsys = { * @root: the top ancestor of the sub-tree being checked * @memcg: the memory cgroup to check * - * Returns %true if memory consumption of @memcg, and that of all - * ancestors up to (but not including) @root, is below the normal range. + * Returns %true if memory consumption of @memcg, or any of its ancestors + * up to (but not including) @root, is below the normal range. * * @root is exclusive; it is never low when looked at directly and isn't * checked when traversing the hierarchy. @@ -5379,12 +5379,12 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg) if (memcg == root) return false; + /* If any level is under, then protect @memcg from reclaim */ for (; memcg != root; memcg = parent_mem_cgroup(memcg)) { - if (page_counter_read(&memcg->memory) >= memcg->low) - return false; + if (page_counter_read(&memcg->memory) <= memcg->low) + return true; /* protect from reclaim */ } - - return true; + return false; /* not protected from reclaim */ } /** -- 2.17.0.484.g0c8726318c-goog
[RFC PATCH 2/2] memcg: add memory.min
The new memory.min limit is similar to memory.low, just no bypassing it when reclaim is desparate. Prefer oom kills before reclaim memory below memory.min. Sharing more code with memory_cgroup_low() is possible, but the idea is posted here for simplicity. Signed-off-by: Greg Thelen --- include/linux/memcontrol.h | 8 ++ mm/memcontrol.c| 58 ++ mm/vmscan.c| 3 ++ 3 files changed, 69 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c46016bb25eb..22bb4a88653a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -178,6 +178,7 @@ struct mem_cgroup { struct page_counter tcpmem; /* Normal memory consumption range */ + unsigned long min; unsigned long low; unsigned long high; @@ -281,6 +282,7 @@ static inline bool mem_cgroup_disabled(void) return !cgroup_subsys_enabled(memory_cgrp_subsys); } +bool mem_cgroup_min(struct mem_cgroup *root, struct mem_cgroup *memcg); bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg); int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm, @@ -726,6 +728,12 @@ static inline void mem_cgroup_event(struct mem_cgroup *memcg, { } +static inline bool mem_cgroup_min(struct mem_cgroup *root, + struct mem_cgroup *memcg) +{ + return false; +} + static inline bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9668f620203a..b2aaed4003b4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5058,6 +5058,36 @@ static u64 memory_current_read(struct cgroup_subsys_state *css, return (u64)page_counter_read(&memcg->memory) * PAGE_SIZE; } +static int memory_min_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); + unsigned long min = READ_ONCE(memcg->min); + + if (min == PAGE_COUNTER_MAX) + seq_puts(m, "max\n"); + else + seq_printf(m, "%llu\n", (u64)min * PAGE_SIZE); + + return 0; +} + +static ssize_t memory_min_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + unsigned long min; + int err; + + buf = strstrip(buf); + err = page_counter_memparse(buf, "max", &min); + if (err) + return err; + + memcg->min = min; + + return nbytes; +} + static int memory_low_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); @@ -5288,6 +5318,12 @@ static struct cftype memory_files[] = { .flags = CFTYPE_NOT_ON_ROOT, .read_u64 = memory_current_read, }, + { + .name = "min", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_min_show, + .write = memory_min_write, + }, { .name = "low", .flags = CFTYPE_NOT_ON_ROOT, @@ -5336,6 +5372,28 @@ struct cgroup_subsys memory_cgrp_subsys = { .early_init = 0, }; +/** + * mem_cgroup_min returns true for a memcg below its min limit. Such memcg are + * excempt from reclaim. + */ +bool mem_cgroup_min(struct mem_cgroup *root, struct mem_cgroup *memcg) +{ + if (mem_cgroup_disabled()) + return false; + + if (!root) + root = root_mem_cgroup; + + if (memcg == root) + return false; + + for (; memcg != root; memcg = parent_mem_cgroup(memcg)) { + if (page_counter_read(&memcg->memory) <= memcg->min) + return true; /* protect */ + } + return false; /* !protect */ +} + /** * mem_cgroup_low - check if memory consumption is below the normal range * @root: the top ancestor of the sub-tree being checked diff --git a/mm/vmscan.c b/mm/vmscan.c index cd5dc3faaa57..15ae19a38ad5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2539,6 +2539,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) unsigned long reclaimed; unsigned long scanned; + if (mem_cgroup_min(root, memcg)) + continue; + if (mem_cgroup_low(root, memcg)) { if (!sc->memcg_low_reclaim) { sc->memcg_low_skipped = 1; -- 2.17.0.484.g0c8726318c-goog
Re: [RFC PATCH 0/2] memory.low,min reclaim
On Mon, Apr 23, 2018 at 3:38 AM Roman Gushchin wrote: > Hi, Greg! > On Sun, Apr 22, 2018 at 01:26:10PM -0700, Greg Thelen wrote: > > Roman's previously posted memory.low,min patches add per memcg effective > > low limit to detect overcommitment of parental limits. But if we flip > > low,min reclaim to bail if usage<{low,min} at any level, then we don't need > > an effective low limit, which makes the code simpler. When parent limits > > are overcommited memory.min will oom kill, which is more drastic but makes > > the memory.low a simpler concept. If memcg a/b wants oom kill before > > reclaim, then give it to them. It seems a bit strange for a/b/memory.low's > > behaviour to depend on a/c/memory.low (i.e. a/b.low is strong unless > > a/b.low+a/c.low exceed a.low). > It's actually not strange: a/b and a/c are sharing a common resource: > a/memory.low. > Exactly as a/b/memory.max and a/c/memory.max are sharing a/memory.max. > If there are sibling cgroups which are consuming memory, a cgroup can't > exceed parent's memory.max, even if its memory.max is grater. > > > > I think there might be a simpler way (ableit it doesn't yet include > > Documentation): > > - memcg: fix memory.low > > - memcg: add memory.min > > 3 files changed, 75 insertions(+), 6 deletions(-) > > > > The idea of this alternate approach is for memory.low,min to avoid reclaim > > if any portion of under-consideration memcg ancestry is under respective > > limit. > This approach has a significant downside: it breaks hierarchical constraints > for memory.low/min. There are two important outcomes: > 1) Any leaf's memory.low/min value is respected, even if parent's value > is lower or even 0. It's not possible anymore to limit the amount of > protected memory for a sub-tree. > This is especially bad in case of delegation. As someone who has been using something like memory.min for a while, I have cases where it needs to be a strong protection. Such jobs prefer oom kill to reclaim. These jobs know they need X MB of memory. But I guess it's on me to avoid configuring machines which overcommit memory.min at such cgroup levels all the way to the root. > 2) If a cgroup has an ancestor with the usage under its memory.low/min, > it becomes protection, even if its memory.low/min is 0. So it becomes > impossible to have unprotected cgroups in protected sub-tree. Fair point. One use case is where a non trivial job which has several memory accounting subcontainers. Is there a way to only set memory.low at the top and have the offer protection to the job? The case I'm thinking of is: % cd /cgroup % echo +memory > cgroup.subtree_control % mkdir top % echo +memory > top/cgroup.subtree_control % mkdir top/part1 top/part2 % echo 1GB > top/memory.min % (echo $BASHPID > top/part1/cgroup.procs && part1) % (echo $BASHPID > top/part2/cgroup.procs && part2) Empirically it's been measured that the entire workload (/top) needs 1GB to perform well. But we don't care how the memory is distributed between part1,part2. Is the strategy for that to set /top, /top/part1.min, and /top/part2.min to 1GB? What do you think about exposing emin and elow to user space? I think that would reduce admin/user confusion in situations where memory.min is internally discounted. (tangent) Delegation in v2 isn't something I've been able to fully internalize yet. The "no interior processes" rule challenges my notion of subdelegation. My current model is where a system controller creates a container C with C.min and then starts client manager process M in C. Then M can choose to further divide C's resources (e.g. C/S). This doesn't seem possible because v2 doesn't allow for interior processes. So the system manager would need to create C, set C.low, create C/sub_manager, create C/sub_resources, set C/sub_manager.low, set C/sub_resources.low, then start M in C/sub_manager. Then sub_manager can create and manage C/sub_resources/S. PS: Thanks for the memory.low and memory.min work. Regardless of how we proceed it's better than the upstream memory.soft_limit_in_bytes!
[PATCH] memcg: mark memcg1_events static const
Mark memcg1_events static: it's only used by memcontrol.c. And mark it const: it's not modified. Signed-off-by: Greg Thelen --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2bd3df3d101a..c9c7e5ea0e2f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3083,7 +3083,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) #endif /* CONFIG_NUMA */ /* Universal VM events cgroup1 shows, original sort order */ -unsigned int memcg1_events[] = { +static const unsigned int memcg1_events[] = { PGPGIN, PGPGOUT, PGFAULT, -- 2.17.0.441.gb46fe60e1d-goog
[PATCH] IB: remove redundant INFINIBAND kconfig dependencies
INFINIBAND_ADDR_TRANS depends on INFINIBAND. So there's no need for options which depend INFINIBAND_ADDR_TRANS to also depend on INFINIBAND. Remove the unnecessary INFINIBAND depends. Signed-off-by: Greg Thelen --- drivers/infiniband/ulp/srpt/Kconfig | 2 +- drivers/nvme/host/Kconfig | 2 +- drivers/nvme/target/Kconfig | 2 +- drivers/staging/lustre/lnet/Kconfig | 2 +- fs/cifs/Kconfig | 2 +- net/9p/Kconfig | 2 +- net/rds/Kconfig | 2 +- net/sunrpc/Kconfig | 2 +- 8 files changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/ulp/srpt/Kconfig b/drivers/infiniband/ulp/srpt/Kconfig index fb8b7182f05e..25bf6955b6d0 100644 --- a/drivers/infiniband/ulp/srpt/Kconfig +++ b/drivers/infiniband/ulp/srpt/Kconfig @@ -1,6 +1,6 @@ config INFINIBAND_SRPT tristate "InfiniBand SCSI RDMA Protocol target support" - depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE + depends on INFINIBAND_ADDR_TRANS && TARGET_CORE ---help--- Support for the SCSI RDMA Protocol (SRP) Target driver. The diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig index 88a8b5916624..dbb7464c018c 100644 --- a/drivers/nvme/host/Kconfig +++ b/drivers/nvme/host/Kconfig @@ -27,7 +27,7 @@ config NVME_FABRICS config NVME_RDMA tristate "NVM Express over Fabrics RDMA host driver" - depends on INFINIBAND && INFINIBAND_ADDR_TRANS && BLOCK + depends on INFINIBAND_ADDR_TRANS && BLOCK select NVME_CORE select NVME_FABRICS select SG_POOL diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig index 3c7b61ddb0d1..7595664ee753 100644 --- a/drivers/nvme/target/Kconfig +++ b/drivers/nvme/target/Kconfig @@ -27,7 +27,7 @@ config NVME_TARGET_LOOP config NVME_TARGET_RDMA tristate "NVMe over Fabrics RDMA target support" - depends on INFINIBAND && INFINIBAND_ADDR_TRANS + depends on INFINIBAND_ADDR_TRANS depends on NVME_TARGET select SGL_ALLOC help diff --git a/drivers/staging/lustre/lnet/Kconfig b/drivers/staging/lustre/lnet/Kconfig index ad049e6f24e4..f3b1ad4bd3dc 100644 --- a/drivers/staging/lustre/lnet/Kconfig +++ b/drivers/staging/lustre/lnet/Kconfig @@ -34,7 +34,7 @@ config LNET_SELFTEST config LNET_XPRT_IB tristate "LNET infiniband support" - depends on LNET && PCI && INFINIBAND && INFINIBAND_ADDR_TRANS + depends on LNET && PCI && INFINIBAND_ADDR_TRANS default LNET && INFINIBAND help This option allows the LNET users to use infiniband as an diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig index 5f132d59dfc2..d61e2de8d0eb 100644 --- a/fs/cifs/Kconfig +++ b/fs/cifs/Kconfig @@ -197,7 +197,7 @@ config CIFS_SMB311 config CIFS_SMB_DIRECT bool "SMB Direct support (Experimental)" - depends on CIFS=m && INFINIBAND && INFINIBAND_ADDR_TRANS || CIFS=y && INFINIBAND=y && INFINIBAND_ADDR_TRANS=y + depends on CIFS=m && INFINIBAND_ADDR_TRANS || CIFS=y && INFINIBAND_ADDR_TRANS=y help Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1. SMB Direct allows transferring SMB packets over RDMA. If unsure, diff --git a/net/9p/Kconfig b/net/9p/Kconfig index e6014e0e51f7..46c39f7da444 100644 --- a/net/9p/Kconfig +++ b/net/9p/Kconfig @@ -32,7 +32,7 @@ config NET_9P_XEN config NET_9P_RDMA - depends on INET && INFINIBAND && INFINIBAND_ADDR_TRANS + depends on INET && INFINIBAND_ADDR_TRANS tristate "9P RDMA Transport (Experimental)" help This builds support for an RDMA transport. diff --git a/net/rds/Kconfig b/net/rds/Kconfig index bffde4b46c5d..1a31502ee7db 100644 --- a/net/rds/Kconfig +++ b/net/rds/Kconfig @@ -8,7 +8,7 @@ config RDS config RDS_RDMA tristate "RDS over Infiniband" - depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS + depends on RDS && INFINIBAND_ADDR_TRANS ---help--- Allow RDS to use Infiniband as a transport. This transport supports RDMA operations. diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig index ac09ca803296..6358e5271070 100644 --- a/net/sunrpc/Kconfig +++ b/net/sunrpc/Kconfig @@ -50,7 +50,7 @@ config SUNRPC_DEBUG config SUNRPC_XPRT_RDMA tristate "RPC-over-RDMA transport" - depends on SUNRPC && INFINIBAND && INFINIBAND_ADDR_TRANS + depends on SUNRPC && INFINIBAND_ADDR_TRANS default SUNRPC && INFINIBAND select SG_POOL help -- 2.17.0.441.gb46fe60e1d-goog
Re: [PATCH 3/5] ib_srpt: depend on INFINIBAND_ADDR_TRANS
On Tue, May 1, 2018 at 1:48 PM Jason Gunthorpe wrote: > On Tue, May 01, 2018 at 03:08:57AM +0000, Greg Thelen wrote: > > On Mon, Apr 30, 2018 at 4:35 PM Jason Gunthorpe wrote: > > > > > On Wed, Apr 25, 2018 at 03:33:39PM -0700, Greg Thelen wrote: > > > > INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols. > > > > So declare the kconfig dependency. This is necessary to allow for > > > > enabling INFINIBAND without INFINIBAND_ADDR_TRANS. > > > > > > > > Signed-off-by: Greg Thelen > > > > Cc: Tarick Bedeir > > > > drivers/infiniband/ulp/srpt/Kconfig | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/infiniband/ulp/srpt/Kconfig > > b/drivers/infiniband/ulp/srpt/Kconfig > > > > index 31ee83d528d9..fb8b7182f05e 100644 > > > > +++ b/drivers/infiniband/ulp/srpt/Kconfig > > > > @@ -1,6 +1,6 @@ > > > > config INFINIBAND_SRPT > > > > tristate "InfiniBand SCSI RDMA Protocol target support" > > > > - depends on INFINIBAND && TARGET_CORE > > > > + depends on INFINIBAND && INFINIBAND_ADDR_TRANS && TARGET_CORE > > > > > Isn't INFINIBAND && INFINIBAND_ADDR_TRANS a bit redundant? Can't have > > > INFINIBAND_ADDR_TRANS without INFINIBAND. > > > > By kconfig INFINIBAND_ADDR_TRANS depends on INFINIBAND. So yes, it seems > > redundant. I don't know if anyone has designs to break this dependency and > > allow for ADDR_TRANS without INFINIBAND. Assuming not, I'd be willing to > > amend my series removing redundant INFINIBAND and a followup series to > > remove it from similar depends. Though I'm not familiar with rdma dev tree > > lifecycle. Is rdma/for-rc a throw away branch (akin to linux-next), or > > will it be merged into linus/master? If throwaway, then we can amend > > its patches, otherwise followups will be needed. > > > > Let me know what you'd prefer. Thanks. > I think a single update patch to fix all of these, and the > pre-existing ones would be great I just posted the cleanup: https://lkml.org/lkml/2018/5/3/1045
[PATCH v2] IB: make INFINIBAND_ADDR_TRANS configurable
Allow INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen Cc: Tarick Bedeir Change-Id: I6fbbf8a432e467710fa65e4904b7d61880b914e5 --- drivers/infiniband/Kconfig | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index ee270e065ba9..2a972ed6851b 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING pages on demand instead. config INFINIBAND_ADDR_TRANS - bool + bool "RDMA/CM" depends on INFINIBAND default y + ---help--- + Support for RDMA communication manager (CM). + This allows for a generic connection abstraction over RDMA. config INFINIBAND_ADDR_TRANS_CONFIGFS bool -- 2.17.0.484.g0c8726318c-goog
Re: [PATCH v2] IB: make INFINIBAND_ADDR_TRANS configurable
On Sat, Apr 14, 2018 at 8:13 AM Dennis Dalessandro < dennis.dalessan...@intel.com> wrote: > On 4/13/2018 1:27 PM, Greg Thelen wrote: > > Allow INFINIBAND without INFINIBAND_ADDR_TRANS. > > > > Signed-off-by: Greg Thelen > > Cc: Tarick Bedeir > > Change-Id: I6fbbf8a432e467710fa65e4904b7d61880b914e5 > Forgot to remove the Gerrit thing. > -Denny Ack. My bad. Will repost. Unfortunately checkpatch didn't notice.
[PATCH v3] IB: make INFINIBAND_ADDR_TRANS configurable
Allow INFINIBAND without INFINIBAND_ADDR_TRANS. Signed-off-by: Greg Thelen Cc: Tarick Bedeir --- drivers/infiniband/Kconfig | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index ee270e065ba9..2a972ed6851b 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -61,9 +61,12 @@ config INFINIBAND_ON_DEMAND_PAGING pages on demand instead. config INFINIBAND_ADDR_TRANS - bool + bool "RDMA/CM" depends on INFINIBAND default y + ---help--- + Support for RDMA communication manager (CM). + This allows for a generic connection abstraction over RDMA. config INFINIBAND_ADDR_TRANS_CONFIGFS bool -- 2.17.0.484.g0c8726318c-goog
Re: [PATCH] IB: make INFINIBAND_ADDR_TRANS configurable
On Sun, Apr 15, 2018 at 5:06 AM Christoph Hellwig wrote: > On Fri, Apr 13, 2018 at 12:06:44AM -0700, Greg Thelen wrote: > > Allow INFINIBAND without INFINIBAND_ADDR_TRANS. > Why? We are pushing everyone heavily to use RDMA/CM, so making it > optional seems rather counter-intuitive. Bugs. We don't currently use CM. But fuzzy testing has been finding a fair number CM bugs, so we were thinking of disabling it until we need it. > You'll also have to fix tons of ULPs to explicitly depend on > INFINIBAND_ADDR_TRANS, or make code conditional on it. I think I've identified the set of options which use INFINIBAND_ADDR_TRANS without a kconfig depends: * CIFS_SMB_DIRECT * INFINIBAND_SRPT * NVME_RDMA * NVME_TARGET_RDMA I have patches for the above, but need to finish the commit logs. Let me know if they'll be nacked and I'll just patch my kernel and forget upstreaming.
[PATCH 1/2] kasan: drain quarantine of memcg slab objects
Per memcg slab accounting and kasan have a problem with kmem_cache destruction. - kmem_cache_create() allocates a kmem_cache, which is used for allocations from processes running in root (top) memcg. - Processes running in non root memcg and allocating with either __GFP_ACCOUNT or from a SLAB_ACCOUNT cache use a per memcg kmem_cache. - Kasan catches use-after-free by having kfree() and kmem_cache_free() defer freeing of objects. Objects are placed in a quarantine. - kmem_cache_destroy() destroys root and non root kmem_caches. It takes care to drain the quarantine of objects from the root memcg's kmem_cache, but ignores objects associated with non root memcg. This causes leaks because quarantined per memcg objects refer to per memcg kmem cache being destroyed. To see the problem: 1) create a slab cache with kmem_cache_create(,,,SLAB_ACCOUNT,) 2) from non root memcg, allocate and free a few objects from cache 3) dispose of the cache with kmem_cache_destroy() kmem_cache_destroy() will trigger a "Slab cache still has objects" warning indicating that the per memcg kmem_cache structure was leaked. Fix the leak by draining kasan quarantined objects allocated from non root memcg. Racing memcg deletion is tricky, but handled. kmem_cache_destroy() => shutdown_memcg_caches() => __shutdown_memcg_cache() => shutdown_cache() flushes per memcg quarantined objects, even if that memcg has been rmdir'd and gone through memcg_deactivate_kmem_caches(). This leak only affects destroyed SLAB_ACCOUNT kmem caches when kasan is enabled. So I don't think it's worth patching stable kernels. Signed-off-by: Greg Thelen --- include/linux/kasan.h | 4 ++-- mm/kasan/kasan.c | 2 +- mm/kasan/quarantine.c | 1 + mm/slab_common.c | 4 +++- 4 files changed, 7 insertions(+), 4 deletions(-) diff --git a/include/linux/kasan.h b/include/linux/kasan.h index 820c0ad54a01..c908b25bf5a5 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -52,7 +52,7 @@ void kasan_free_pages(struct page *page, unsigned int order); void kasan_cache_create(struct kmem_cache *cache, size_t *size, unsigned long *flags); void kasan_cache_shrink(struct kmem_cache *cache); -void kasan_cache_destroy(struct kmem_cache *cache); +void kasan_cache_shutdown(struct kmem_cache *cache); void kasan_poison_slab(struct page *page); void kasan_unpoison_object_data(struct kmem_cache *cache, void *object); @@ -98,7 +98,7 @@ static inline void kasan_cache_create(struct kmem_cache *cache, size_t *size, unsigned long *flags) {} static inline void kasan_cache_shrink(struct kmem_cache *cache) {} -static inline void kasan_cache_destroy(struct kmem_cache *cache) {} +static inline void kasan_cache_shutdown(struct kmem_cache *cache) {} static inline void kasan_poison_slab(struct page *page) {} static inline void kasan_unpoison_object_data(struct kmem_cache *cache, diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c index 0e9505f66ec1..8d020ad5b74a 100644 --- a/mm/kasan/kasan.c +++ b/mm/kasan/kasan.c @@ -428,7 +428,7 @@ void kasan_cache_shrink(struct kmem_cache *cache) quarantine_remove_cache(cache); } -void kasan_cache_destroy(struct kmem_cache *cache) +void kasan_cache_shutdown(struct kmem_cache *cache) { quarantine_remove_cache(cache); } diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c index baabaad4a4aa..fb362cb19157 100644 --- a/mm/kasan/quarantine.c +++ b/mm/kasan/quarantine.c @@ -273,6 +273,7 @@ static void per_cpu_remove_cache(void *arg) qlist_free_all(&to_free, cache); } +/* Free all quarantined objects belonging to cache. */ void quarantine_remove_cache(struct kmem_cache *cache) { unsigned long flags; diff --git a/mm/slab_common.c b/mm/slab_common.c index 329b03843863..d3c8602dea5d 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -455,6 +455,9 @@ EXPORT_SYMBOL(kmem_cache_create); static int shutdown_cache(struct kmem_cache *s, struct list_head *release, bool *need_rcu_barrier) { + /* free asan quarantined objects */ + kasan_cache_shutdown(s); + if (__kmem_cache_shutdown(s) != 0) return -EBUSY; @@ -715,7 +718,6 @@ void kmem_cache_destroy(struct kmem_cache *s) get_online_cpus(); get_online_mems(); - kasan_cache_destroy(s); mutex_lock(&slab_mutex); s->refcount--; -- 2.8.0.rc3.226.g39d4020
[PATCH 2/2] kasan: add memcg kmem_cache test
Make a kasan test which uses a SLAB_ACCOUNT slab cache. If the test is run within a non default memcg, then it uncovers the bug fixed by "kasan: drain quarantine of memcg slab objects"[1]. If run without fix [1] it shows "Slab cache still has objects", and the kmem_cache structure is leaked. Here's an unpatched kernel test: $ dmesg -c > /dev/null $ mkdir /sys/fs/cgroup/memory/test $ echo $$ > /sys/fs/cgroup/memory/test/tasks $ modprobe test_kasan 2> /dev/null $ dmesg | grep -B1 still [ 123.456789] kasan test: memcg_accounted_kmem_cache allocate memcg accounted object [ 124.456789] kmem_cache_destroy test_cache: Slab cache still has objects Kernels with fix [1] don't have the "Slab cache still has objects" warning or the underlying leak. The new test runs and passes in the default (root) memcg, though in the root memcg it won't uncover the problem fixed by [1]. Signed-off-by: Greg Thelen --- lib/test_kasan.c | 34 ++ 1 file changed, 34 insertions(+) diff --git a/lib/test_kasan.c b/lib/test_kasan.c index fbdf87920093..0b1d3140fbb8 100644 --- a/lib/test_kasan.c +++ b/lib/test_kasan.c @@ -11,6 +11,7 @@ #define pr_fmt(fmt) "kasan test: %s " fmt, __func__ +#include #include #include #include @@ -331,6 +332,38 @@ static noinline void __init kmem_cache_oob(void) kmem_cache_destroy(cache); } +static noinline void __init memcg_accounted_kmem_cache(void) +{ + int i; + char *p; + size_t size = 200; + struct kmem_cache *cache; + + cache = kmem_cache_create("test_cache", size, 0, SLAB_ACCOUNT, NULL); + if (!cache) { + pr_err("Cache allocation failed\n"); + return; + } + + pr_info("allocate memcg accounted object\n"); + /* +* Several allocations with a delay to allow for lazy per memcg kmem +* cache creation. +*/ + for (i = 0; i < 5; i++) { + p = kmem_cache_alloc(cache, GFP_KERNEL); + if (!p) { + pr_err("Allocation failed\n"); + goto free_cache; + } + kmem_cache_free(cache, p); + msleep(100); + } + +free_cache: + kmem_cache_destroy(cache); +} + static char global_array[10]; static noinline void __init kasan_global_oob(void) @@ -460,6 +493,7 @@ static int __init kmalloc_tests_init(void) kmalloc_uaf_memset(); kmalloc_uaf2(); kmem_cache_oob(); + memcg_accounted_kmem_cache(); kasan_stack_oob(); kasan_global_oob(); ksize_unpoisons_memory(); -- 2.8.0.rc3.226.g39d4020
Re: [PATCH] memcg: make mem_cgroup_read_stat() unsigned
Michal Hocko wrote: > On Tue 22-09-15 15:16:32, Greg Thelen wrote: >> mem_cgroup_read_stat() returns a page count by summing per cpu page >> counters. The summing is racy wrt. updates, so a transient negative sum >> is possible. Callers don't want negative values: >> - mem_cgroup_wb_stats() doesn't want negative nr_dirty or nr_writeback. > > OK, this can confuse dirty throttling AFAIU > >> - oom reports and memory.stat shouldn't show confusing negative usage. > > I guess this is not earth shattering. > >> - tree_usage() already avoids negatives. >> >> Avoid returning negative page counts from mem_cgroup_read_stat() and >> convert it to unsigned. >> >> Signed-off-by: Greg Thelen > > I guess we want that for stable 4.2 because of the dirty throttling > part. Longterm we should use generic per-cpu counter. > > Acked-by: Michal Hocko > > Thanks! Correct, this is not an earth shattering patch. The patch only filters out negative memcg stat values from mem_cgroup_read_stat() callers. Negative values should only be temporary due to stat update races. So I'm not sure it's worth sending it to stable. I've heard no reports of it troubling anyone. The worst case without this patch is that memcg temporarily burps up a negative dirty and/or writeback count which causes balance_dirty_pages() to sleep for a (at most) 200ms nap (MAX_PAUSE). Ccing Tejun in case there are more serious consequences to balance_dirty_pages() occasionally seeing a massive (underflowed) dirty or writeback count. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
Dave Hansen wrote: > I've been seeing some strange behavior with 4.3-rc1 kernels on my Ubuntu > 14.04.3 system. The system will run fine for a few hours, but suddenly > start becoming horribly I/O bound. A compile of perf for instance takes > 20-30 minutes and the compile seems entirely I/O bound. But, the SSD is > only seeing tens or hundreds of KB/s of writes. > > Looking at some writeback tracepoints shows it hitting > balance_dirty_pages() pretty hard with a pretty large number of dirty > pages. :) > >> ld-27008 [000] ...1 88895.190770: balance_dirty_pages: bdi >> 8:0: limit=234545 setpoint=204851 dirty=18446744073709513951 >> bdi_setpoint=184364 bdi_dirty=33 dirty_ratelimit=24 task_ratelimit=0 >> dirtied=1 dirtied_pause=0 paused=0 pause=136 period=136 think=0 >> cgroup=/user/1000.user/c2.session > > So something is underflowing dirty. > > I added the attached patch and got a warning pretty quickly, so this > looks pretty reproducible for me. > > I'm not 100% sure this is from the 4.3 merge window. I was running the > 4.2-rcs, but they seemed to have their own issues. Ubuntu seems to be > automatically creating some cgroups, so they're definitely in play here. > > Any ideas what is going on? > >> [ 12.415472] [ cut here ] >> [ 12.415481] WARNING: CPU: 1 PID: 1684 at mm/page-writeback.c:2435 >> account_page_cleaned+0x101/0x110() >> [ 12.415483] MEM_CGROUP_STAT_DIRTY bogus >> [ 12.415484] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 >> iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack >> nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp >> bridge stp llc iptable_filter ip_tables ebtable_nat ebtables x_tables >> dm_crypt cmac rfcomm bnep arc4 iwldvm mac80211 btusb snd_hda_codec_hdmi >> iwlwifi btrtl btbcm btintel snd_hda_codec_realtek snd_hda_codec_generic >> bluetooth snd_hda_intel snd_hda_codec cfg80211 snd_hwdep snd_hda_core >> intel_rapl iosf_mbi hid_logitech_hidpp x86_pkg_temp_thermal snd_pcm coretemp >> ghash_clmulni_intel thinkpad_acpi joydev snd_seq_midi snd_seq_midi_event >> snd_rawmidi nvram snd_seq snd_timer snd_seq_device wmi snd soundcore mac_hid >> lpc_ich aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd >> kvm_intel kvm hid_logitech_dj sdhci_pci sdhci usbhid hid >> [ 12.415538] CPU: 1 PID: 1684 Comm: indicator-keybo Not tainted >> 4.3.0-rc1-dirty #25 >> [ 12.415540] Hardware name: LENOVO 2325AR2/2325AR2, BIOS G2ETA4WW (2.64 ) >> 04/09/2015 >> [ 12.415542] 81aa8172 8800c926ba30 8132e3e2 >> 8800c926ba78 >> [ 12.415544] 8800c926ba68 8105e386 ea000fca4700 >> 880409b96420 >> [ 12.415547] 8803ef979490 880403ff4800 >> 8800c926bac8 >> [ 12.415550] Call Trace: >> [ 12.41] [] dump_stack+0x4b/0x69 >> [ 12.415560] [] warn_slowpath_common+0x86/0xc0 >> [ 12.415563] [] warn_slowpath_fmt+0x4c/0x50 >> [ 12.415566] [] account_page_cleaned+0x101/0x110 >> [ 12.415568] [] cancel_dirty_page+0xbd/0xf0 >> [ 12.415571] [] try_to_free_buffers+0x94/0xb0 >> [ 12.415575] [] >> jbd2_journal_try_to_free_buffers+0x100/0x130 >> [ 12.415578] [] ext4_releasepage+0x52/0xa0 >> [ 12.415582] [] try_to_release_page+0x35/0x50 >> [ 12.415585] [] block_invalidatepage+0x113/0x130 >> [ 12.415587] [] ext4_invalidatepage+0x5e/0xb0 >> [ 12.415590] [] ext4_da_invalidatepage+0x40/0x310 >> [ 12.415593] [] truncate_inode_page+0x83/0x90 >> [ 12.415595] [] truncate_inode_pages_range+0x199/0x730 >> [ 12.415598] [] ? __wake_up+0x44/0x50 >> [ 12.415600] [] ? jbd2_journal_stop+0x1ba/0x3b0 >> [ 12.415603] [] ? ext4_unlink+0x2f4/0x330 >> [ 12.415607] [] ? __inode_wait_for_writeback+0x6d/0xc0 >> [ 12.415609] [] truncate_inode_pages_final+0x4c/0x60 >> [ 12.415612] [] ext4_evict_inode+0x116/0x4c0 >> [ 12.415615] [] evict+0xbc/0x190 >> [ 12.415617] [] iput+0x17d/0x1e0 >> [ 12.415620] [] do_unlinkat+0x1ab/0x2b0 >> [ 12.415622] [] SyS_unlink+0x16/0x20 >> [ 12.415626] [] entry_SYSCALL_64_fastpath+0x12/0x6a >> [ 12.415628] ---[ end trace 6fba1ddd3d240e13 ]--- >> [ 12.418211] [ cut here ] I'm not denying the issue, bug the WARNING splat isn't necessarily catching a problem. The corresponding code comes from your debug patch: + WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus"); This only checks a single cpu's counter, which can be negative. The sum of all counters is what matters. Imagine: cpu1) dirty page: inc cpu2) clean page: dec The sum is properly zero, but cpu2 is -1, which will trigger the WARN. I'll look at the code and also see if I can reproduce the failure using mem_cgroup_read_stat() for all of the new WARNs. Did you notice if the global /proc/meminfo:Dirty count also underflowed? -- To unsubscribe from this list: send th
Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
Greg Thelen wrote: > Dave Hansen wrote: > >> I've been seeing some strange behavior with 4.3-rc1 kernels on my Ubuntu >> 14.04.3 system. The system will run fine for a few hours, but suddenly >> start becoming horribly I/O bound. A compile of perf for instance takes >> 20-30 minutes and the compile seems entirely I/O bound. But, the SSD is >> only seeing tens or hundreds of KB/s of writes. >> >> Looking at some writeback tracepoints shows it hitting >> balance_dirty_pages() pretty hard with a pretty large number of dirty >> pages. :) >> >>> ld-27008 [000] ...1 88895.190770: balance_dirty_pages: bdi >>> 8:0: limit=234545 setpoint=204851 dirty=18446744073709513951 >>> bdi_setpoint=184364 bdi_dirty=33 dirty_ratelimit=24 task_ratelimit=0 >>> dirtied=1 dirtied_pause=0 paused=0 pause=136 period=136 think=0 >>> cgroup=/user/1000.user/c2.session >> >> So something is underflowing dirty. >> >> I added the attached patch and got a warning pretty quickly, so this >> looks pretty reproducible for me. >> >> I'm not 100% sure this is from the 4.3 merge window. I was running the >> 4.2-rcs, but they seemed to have their own issues. Ubuntu seems to be >> automatically creating some cgroups, so they're definitely in play here. >> >> Any ideas what is going on? >> >>> [ 12.415472] [ cut here ] >>> [ 12.415481] WARNING: CPU: 1 PID: 1684 at mm/page-writeback.c:2435 >>> account_page_cleaned+0x101/0x110() >>> [ 12.415483] MEM_CGROUP_STAT_DIRTY bogus >>> [ 12.415484] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 >>> iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 >>> xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM >>> iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables >>> ebtable_nat ebtables x_tables dm_crypt cmac rfcomm bnep arc4 iwldvm >>> mac80211 btusb snd_hda_codec_hdmi iwlwifi btrtl btbcm btintel >>> snd_hda_codec_realtek snd_hda_codec_generic bluetooth snd_hda_intel >>> snd_hda_codec cfg80211 snd_hwdep snd_hda_core intel_rapl iosf_mbi >>> hid_logitech_hidpp x86_pkg_temp_thermal snd_pcm coretemp >>> ghash_clmulni_intel thinkpad_acpi joydev snd_seq_midi snd_seq_midi_event >>> snd_rawmidi nvram snd_seq snd_timer snd_seq_device wmi snd soundcore >>> mac_hid lpc_ich aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper >>> cryptd kvm_intel kvm hid_logitech_dj sdhci_pci sdhci usbhid hid >>> [ 12.415538] CPU: 1 PID: 1684 Comm: indicator-keybo Not tainted >>> 4.3.0-rc1-dirty #25 >>> [ 12.415540] Hardware name: LENOVO 2325AR2/2325AR2, BIOS G2ETA4WW (2.64 ) >>> 04/09/2015 >>> [ 12.415542] 81aa8172 8800c926ba30 8132e3e2 >>> 8800c926ba78 >>> [ 12.415544] 8800c926ba68 8105e386 ea000fca4700 >>> 880409b96420 >>> [ 12.415547] 8803ef979490 880403ff4800 >>> 8800c926bac8 >>> [ 12.415550] Call Trace: >>> [ 12.41] [] dump_stack+0x4b/0x69 >>> [ 12.415560] [] warn_slowpath_common+0x86/0xc0 >>> [ 12.415563] [] warn_slowpath_fmt+0x4c/0x50 >>> [ 12.415566] [] account_page_cleaned+0x101/0x110 >>> [ 12.415568] [] cancel_dirty_page+0xbd/0xf0 >>> [ 12.415571] [] try_to_free_buffers+0x94/0xb0 >>> [ 12.415575] [] >>> jbd2_journal_try_to_free_buffers+0x100/0x130 >>> [ 12.415578] [] ext4_releasepage+0x52/0xa0 >>> [ 12.415582] [] try_to_release_page+0x35/0x50 >>> [ 12.415585] [] block_invalidatepage+0x113/0x130 >>> [ 12.415587] [] ext4_invalidatepage+0x5e/0xb0 >>> [ 12.415590] [] ext4_da_invalidatepage+0x40/0x310 >>> [ 12.415593] [] truncate_inode_page+0x83/0x90 >>> [ 12.415595] [] truncate_inode_pages_range+0x199/0x730 >>> [ 12.415598] [] ? __wake_up+0x44/0x50 >>> [ 12.415600] [] ? jbd2_journal_stop+0x1ba/0x3b0 >>> [ 12.415603] [] ? ext4_unlink+0x2f4/0x330 >>> [ 12.415607] [] ? __inode_wait_for_writeback+0x6d/0xc0 >>> [ 12.415609] [] truncate_inode_pages_final+0x4c/0x60 >>> [ 12.415612] [] ext4_evict_inode+0x116/0x4c0 >>> [ 12.415615] [] evict+0xbc/0x190 >>> [ 12.415617] [] iput+0x17d/0x1e0 >>> [ 12.415620] [] do_unlinkat+0x1ab/0x2b0 >>> [ 12.415622] [] SyS_unlink+0x16/0x20 >>> [ 12.415626] [] entry_SYSCALL_64_fastpath+0x12/0x6a >>> [ 12.415628] ---[ end trace 6fba1ddd3d240e1
Re: [PATCH] memcg: make mem_cgroup_read_stat() unsigned
Andrew Morton wrote: > On Tue, 22 Sep 2015 15:16:32 -0700 Greg Thelen wrote: > >> mem_cgroup_read_stat() returns a page count by summing per cpu page >> counters. The summing is racy wrt. updates, so a transient negative sum >> is possible. Callers don't want negative values: >> - mem_cgroup_wb_stats() doesn't want negative nr_dirty or nr_writeback. >> - oom reports and memory.stat shouldn't show confusing negative usage. >> - tree_usage() already avoids negatives. >> >> Avoid returning negative page counts from mem_cgroup_read_stat() and >> convert it to unsigned. > > Someone please remind me why this code doesn't use the existing > percpu_counter library which solved this problem years ago. > >> for_each_possible_cpu(cpu) > > and which doesn't iterate across offlined CPUs. I found [1] and [2] discussing memory layout differences between: a) existing memcg hand rolled per cpu arrays of counters vs b) array of generic percpu_counter The current approach was claimed to have lower memory overhead and better cache behavior. I assume it's pretty straightforward to create generic percpu_counter_array routines which memcg could use. Possibly something like this could be made general enough could be created to satisfy vmstat, but less clear. [1] http://www.spinics.net/lists/cgroups/msg06216.html [2] https://lkml.org/lkml/2014/9/11/1057 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] memcg: remove pcp_counter_lock
Commit 733a572e66d2 ("memcg: make mem_cgroup_read_{stat|event}() iterate possible cpus instead of online") removed the last use of the per memcg pcp_counter_lock but forgot to remove the variable. Kill the vestigial variable. Signed-off-by: Greg Thelen --- include/linux/memcontrol.h | 1 - mm/memcontrol.c| 1 - 2 files changed, 2 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index ad800e62cb7a..6452ff4c463f 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -242,7 +242,6 @@ struct mem_cgroup { * percpu counter. */ struct mem_cgroup_stat_cpu __percpu *stat; - spinlock_t pcp_counter_lock; #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) struct cg_proto tcp_mem; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6ddaeba34e09..da21143550c0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4179,7 +4179,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void) if (memcg_wb_domain_init(memcg, GFP_KERNEL)) goto out_free_stat; - spin_lock_init(&memcg->pcp_counter_lock); return memcg; out_free_stat: -- 2.6.0.rc0.131.gf624c3d -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] memcg: make mem_cgroup_read_stat() unsigned
Andrew Morton wrote: > On Tue, 22 Sep 2015 17:42:13 -0700 Greg Thelen wrote: > >> Andrew Morton wrote: >> >> > On Tue, 22 Sep 2015 15:16:32 -0700 Greg Thelen wrote: >> > >> >> mem_cgroup_read_stat() returns a page count by summing per cpu page >> >> counters. The summing is racy wrt. updates, so a transient negative sum >> >> is possible. Callers don't want negative values: >> >> - mem_cgroup_wb_stats() doesn't want negative nr_dirty or nr_writeback. >> >> - oom reports and memory.stat shouldn't show confusing negative usage. >> >> - tree_usage() already avoids negatives. >> >> >> >> Avoid returning negative page counts from mem_cgroup_read_stat() and >> >> convert it to unsigned. >> > >> > Someone please remind me why this code doesn't use the existing >> > percpu_counter library which solved this problem years ago. >> > >> >> for_each_possible_cpu(cpu) >> > >> > and which doesn't iterate across offlined CPUs. >> >> I found [1] and [2] discussing memory layout differences between: >> a) existing memcg hand rolled per cpu arrays of counters >> vs >> b) array of generic percpu_counter >> The current approach was claimed to have lower memory overhead and >> better cache behavior. >> >> I assume it's pretty straightforward to create generic >> percpu_counter_array routines which memcg could use. Possibly something >> like this could be made general enough could be created to satisfy >> vmstat, but less clear. >> >> [1] http://www.spinics.net/lists/cgroups/msg06216.html >> [2] https://lkml.org/lkml/2014/9/11/1057 > > That all sounds rather bogus to me. __percpu_counter_add() doesn't > modify struct percpu_counter at all except for when the cpu-local > counter overflows the configured batch size. And for the memcg > application I suspect we can set the batch size to INT_MAX... Nod. The memory usage will be a bit larger, but the code reuse is attractive. I dusted off Vladimir's https://lkml.org/lkml/2014/9/11/710. Next step is to benchmark it before posting. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] fs, seqfile: always allow oom killer
Since commit 5cec38ac866b ("fs, seq_file: fallback to vmalloc instead of oom kill processes") seq_buf_alloc() avoids calling the oom killer for PAGE_SIZE or smaller allocations; but larger allocations can use the oom killer via vmalloc(). Thus reads of small files can return ENOMEM, but larger files use the oom killer to avoid ENOMEM. Memory overcommit requires use of the oom killer to select a victim regardless of file size. Enable oom killer for small seq_buf_alloc() allocations. Signed-off-by: David Rientjes Signed-off-by: Greg Thelen --- fs/seq_file.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/fs/seq_file.c b/fs/seq_file.c index 225586e141ca..a8e288755f24 100644 --- a/fs/seq_file.c +++ b/fs/seq_file.c @@ -25,12 +25,17 @@ static void seq_set_overflow(struct seq_file *m) static void *seq_buf_alloc(unsigned long size) { void *buf; + gfp_t gfp = GFP_KERNEL; /* -* __GFP_NORETRY to avoid oom-killings with high-order allocations - -* it's better to fall back to vmalloc() than to kill things. +* For high order allocations, use __GFP_NORETRY to avoid oom-killing - +* it's better to fall back to vmalloc() than to kill things. For small +* allocations, just use GFP_KERNEL which will oom kill, thus no need +* for vmalloc fallback. */ - buf = kmalloc(size, GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN); + if (size > PAGE_SIZE) + gfp |= __GFP_NORETRY | __GFP_NOWARN; + buf = kmalloc(size, gfp); if (!buf && size > PAGE_SIZE) buf = vmalloc(size); return buf; -- 2.6.0.rc2.230.g3dd15c0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
Dave Hansen wrote: > On 09/17/2015 11:09 PM, Greg Thelen wrote: >> I'm not denying the issue, bug the WARNING splat isn't necessarily >> catching a problem. The corresponding code comes from your debug patch: >> + >> WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > >> (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus"); >> >> This only checks a single cpu's counter, which can be negative. The sum >> of all counters is what matters. >> Imagine: >> cpu1) dirty page: inc >> cpu2) clean page: dec >> The sum is properly zero, but cpu2 is -1, which will trigger the WARN. >> >> I'll look at the code and also see if I can reproduce the failure using >> mem_cgroup_read_stat() for all of the new WARNs. > > D'oh. I'll replace those with the proper mem_cgroup_read_stat() and > test with your patch to see if anything still triggers. Thanks Dave. Here's what I think we should use to fix the issue. I tagged this for v4.2 stable given the way that unpatched performance falls apart without warning or workaround (besides deleting and recreating affected memcg). Feedback welcome. >From f5c39c2e8471c10fe0464ca7b6e6f743ce6920a6 Mon Sep 17 00:00:00 2001 From: Greg Thelen Date: Sat, 19 Sep 2015 16:21:18 -0700 Subject: [PATCH] memcg: fix dirty page migration The problem starts with a file backed dirty page which is charged to a memcg. Then page migration is used to move oldpage to newpage. Migration: - copies the oldpage's data to newpage - clears oldpage.PG_dirty - sets newpage.PG_dirty - uncharges oldpage from memcg - charges newpage to memcg Clearing oldpage.PG_dirty decrements the charged memcg's dirty page count. However, because newpage is not yet charged, setting newpage.PG_dirty does not increment the memcg's dirty page count. After migration completes newpage.PG_dirty is eventually cleared, often in account_page_cleaned(). At this time newpage is charged to a memcg so the memcg's dirty page count is decremented which causes underflow because the count was not previously incremented by migration. This underflow causes balance_dirty_pages() to see a very large unsigned number of dirty memcg pages which leads to aggressive throttling of buffered writes by processes in non root memcg. This issue: - can harm performance of non root memcg buffered writes. - can report too small (even negative) values in memory.stat[(total_)dirty] counters of all memcg, including the root. To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce page_memcg() and set_page_memcg() helpers. Test: 0) setup and enter limited memcg mkdir /sys/fs/cgroup/test echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes echo $$ > /sys/fs/cgroup/test/cgroup.procs 1) buffered writes baseline dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k sync grep ^dirty /sys/fs/cgroup/test/memory.stat 2) buffered writes with compaction antagonist to induce migration yes 1 > /proc/sys/vm/compact_memory & rm -rf /data/tmp/foo dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k kill % sync grep ^dirty /sys/fs/cgroup/test/memory.stat 3) buffered writes without antagonist, should match baseline rm -rf /data/tmp/foo dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k sync grep ^dirty /sys/fs/cgroup/test/memory.stat (speed, dirty residue) unpatched patched 1) 841 MB/s 0 dirty pages 886 MB/s 0 dirty pages 2) 611 MB/s -33427456 dirty pages 793 MB/s 0 dirty pages 3) 114 MB/s -33427456 dirty pages 891 MB/s 0 dirty pages Notice that unpatched baseline performance (1) fell after migration (3): 841 -> 114 MB/s. In the patched kernel, post migration performance matches baseline. Fixes: c4843a7593a9 ("memcg: add per cgroup dirty page accounting") Cc: # 4.2+ Reported-by: Dave Hansen Signed-off-by: Greg Thelen --- include/linux/mm.h | 21 + mm/migrate.c | 12 +++- 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 91c08f6f0dc9..80001de019ba 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -905,6 +905,27 @@ static inline void set_page_links(struct page *page, enum zone_type zone, #endif } +#ifdef CONFIG_MEMCG +static inline struct mem_cgroup *page_memcg(struct page *page) +{ + return page->mem_cgroup; +} + +static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg) +{ + page->mem_cgroup = memcg; +} +#else +static inline struct mem_cgroup *page_memcg(struct page *page) +{ + return NULL; +} + +static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg) +{ +} +#endif + /* * Some inline
[PATCH] memcg: make mem_cgroup_read_stat() unsigned
mem_cgroup_read_stat() returns a page count by summing per cpu page counters. The summing is racy wrt. updates, so a transient negative sum is possible. Callers don't want negative values: - mem_cgroup_wb_stats() doesn't want negative nr_dirty or nr_writeback. - oom reports and memory.stat shouldn't show confusing negative usage. - tree_usage() already avoids negatives. Avoid returning negative page counts from mem_cgroup_read_stat() and convert it to unsigned. Signed-off-by: Greg Thelen --- mm/memcontrol.c | 30 ++ 1 file changed, 18 insertions(+), 12 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6ddaeba34e09..2633e9be4a99 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -644,12 +644,14 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz) } /* + * Return page count for single (non recursive) @memcg. + * * Implementation Note: reading percpu statistics for memcg. * * Both of vmstat[] and percpu_counter has threshold and do periodic * synchronization to implement "quick" read. There are trade-off between * reading cost and precision of value. Then, we may have a chance to implement - * a periodic synchronizion of counter in memcg's counter. + * a periodic synchronization of counter in memcg's counter. * * But this _read() function is used for user interface now. The user accounts * memory usage by memory cgroup and he _always_ requires exact value because @@ -659,17 +661,24 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz) * * If there are kernel internal actions which can make use of some not-exact * value, and reading all cpu value can be performance bottleneck in some - * common workload, threashold and synchonization as vmstat[] should be + * common workload, threashold and synchronization as vmstat[] should be * implemented. */ -static long mem_cgroup_read_stat(struct mem_cgroup *memcg, -enum mem_cgroup_stat_index idx) +static unsigned long +mem_cgroup_read_stat(struct mem_cgroup *memcg, enum mem_cgroup_stat_index idx) { long val = 0; int cpu; + /* Per-cpu values can be negative, use a signed accumulator */ for_each_possible_cpu(cpu) val += per_cpu(memcg->stat->count[idx], cpu); + /* +* Summing races with updates, so val may be negative. Avoid exposing +* transient negative values. +*/ + if (val < 0) + val = 0; return val; } @@ -1254,7 +1263,7 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) continue; - pr_cont(" %s:%ldKB", mem_cgroup_stat_names[i], + pr_cont(" %s:%luKB", mem_cgroup_stat_names[i], K(mem_cgroup_read_stat(iter, i))); } @@ -2819,14 +2828,11 @@ static unsigned long tree_stat(struct mem_cgroup *memcg, enum mem_cgroup_stat_index idx) { struct mem_cgroup *iter; - long val = 0; + unsigned long val = 0; - /* Per-cpu values can be negative, use a signed accumulator */ for_each_mem_cgroup_tree(iter, memcg) val += mem_cgroup_read_stat(iter, idx); - if (val < 0) /* race ? */ - val = 0; return val; } @@ -3169,7 +3175,7 @@ static int memcg_stat_show(struct seq_file *m, void *v) for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) continue; - seq_printf(m, "%s %ld\n", mem_cgroup_stat_names[i], + seq_printf(m, "%s %lu\n", mem_cgroup_stat_names[i], mem_cgroup_read_stat(memcg, i) * PAGE_SIZE); } @@ -3194,13 +3200,13 @@ static int memcg_stat_show(struct seq_file *m, void *v) (u64)memsw * PAGE_SIZE); for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { - long long val = 0; + unsigned long long val = 0; if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) continue; for_each_mem_cgroup_tree(mi, memcg) val += mem_cgroup_read_stat(mi, i) * PAGE_SIZE; - seq_printf(m, "total_%s %lld\n", mem_cgroup_stat_names[i], val); + seq_printf(m, "total_%s %llu\n", mem_cgroup_stat_names[i], val); } for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) { -- 2.6.0.rc0.131.gf624c3d -- To unsubscribe from this list: send the line "unsubscribe linux-kernel&qu
[PATCH] powerpc/powernv/pci: add ifdef to avoid dead code
Commit dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE configuration") removed a couple pnv_ioda_setup_bus_dma() calls. The only remaining calls are behind CONFIG_IOMMU_API. Thus builds without CONFIG_IOMMU_API see: arch/powerpc/platforms/powernv/pci-ioda.c:1888:13: error: 'pnv_ioda_setup_bus_dma' defined but not used Add CONFIG_IOMMU_API ifdef guard to avoid dead code. Fixes: dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE configuration") Signed-off-by: Greg Thelen --- arch/powerpc/platforms/powernv/pci-ioda.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 73a63efcf855..f7762052b7c4 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1885,6 +1885,7 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, return false; } +#ifdef CONFIG_IOMMU_API static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus) { struct pci_dev *dev; @@ -1897,6 +1898,7 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus) pnv_ioda_setup_bus_dma(pe, dev->subordinate); } } +#endif static inline __be64 __iomem *pnv_ioda_get_inval_reg(struct pnv_phb *phb, bool real_mode) -- 2.27.0.290.gba653c62da-goog
[PATCH] e1000e: add ifdef to avoid dead code
Commit e086ba2fccda ("e1000e: disable s0ix entry and exit flows for ME systems") added e1000e_check_me() but it's only called from CONFIG_PM_SLEEP protected code. Thus builds without CONFIG_PM_SLEEP see: drivers/net/ethernet/intel/e1000e/netdev.c:137:13: warning: 'e1000e_check_me' defined but not used [-Wunused-function] Add CONFIG_PM_SLEEP ifdef guard to avoid dead code. Fixes: e086ba2fccda ("e1000e: disable s0ix entry and exit flows for ME systems") Signed-off-by: Greg Thelen --- drivers/net/ethernet/intel/e1000e/netdev.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index a279f4fa9962..165f0aea22c9 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -107,6 +107,7 @@ static const struct e1000_reg_info e1000_reg_info_tbl[] = { {0, NULL} }; +#ifdef CONFIG_PM_SLEEP struct e1000e_me_supported { u16 device_id; /* supported device ID */ }; @@ -145,6 +146,7 @@ static bool e1000e_check_me(u16 device_id) return false; } +#endif /* CONFIG_PM_SLEEP */ /** * __ew32_prepare - prepare to write to MAC CSR register on certain parts -- 2.27.0.290.gba653c62da-goog
[PATCH v2] powerpc/powernv/pci: use ifdef to avoid dead code
Commit dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE configuration") removed a couple pnv_ioda_setup_bus_dma() calls. The only remaining calls are behind CONFIG_IOMMU_API. Thus builds without CONFIG_IOMMU_API see: arch/powerpc/platforms/powernv/pci-ioda.c:1888:13: error: 'pnv_ioda_setup_bus_dma' defined but not used Move pnv_ioda_setup_bus_dma() under CONFIG_IOMMU_API to avoid dead code. Fixes: dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE configuration") Signed-off-by: Greg Thelen --- arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++ 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 73a63efcf855..743d840712da 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1885,19 +1885,6 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev, return false; } -static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus) -{ - struct pci_dev *dev; - - list_for_each_entry(dev, &bus->devices, bus_list) { - set_iommu_table_base(&dev->dev, pe->table_group.tables[0]); - dev->dev.archdata.dma_offset = pe->tce_bypass_base; - - if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate) - pnv_ioda_setup_bus_dma(pe, dev->subordinate); - } -} - static inline __be64 __iomem *pnv_ioda_get_inval_reg(struct pnv_phb *phb, bool real_mode) { @@ -2501,6 +2488,19 @@ static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group, #endif #ifdef CONFIG_IOMMU_API +static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus) +{ + struct pci_dev *dev; + + list_for_each_entry(dev, &bus->devices, bus_list) { + set_iommu_table_base(&dev->dev, pe->table_group.tables[0]); + dev->dev.archdata.dma_offset = pe->tce_bypass_base; + + if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate) + pnv_ioda_setup_bus_dma(pe, dev->subordinate); + } +} + unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift, __u64 window_size, __u32 levels) { -- 2.27.0.290.gba653c62da-goog
Re: [PATCH] shmem, memcg: enable memcg aware shrinker
Yang Shi wrote: > On Sun, May 31, 2020 at 8:22 PM Greg Thelen wrote: >> >> Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged >> shrinkers during memcg shrink_slab()") a memcg aware shrinker is only >> called when the per-memcg per-node shrinker_map indicates that the >> shrinker may have objects to release to the memcg and node. >> >> shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs >> shrinker which advertises per memcg and numa awareness. The shmem >> shrinker releases memory by splitting hugepages that extend beyond >> i_size. >> >> Shmem does not currently set bits in shrinker_map. So, starting with >> b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under >> pressure. This leads to undeserved memcg OOM kills. >> Example that reliably sees memcg OOM kill in unpatched kernel: >> FS=/tmp/fs >> CONTAINER=/cgroup/memory/tmpfs_shrinker >> mkdir -p $FS >> mount -t tmpfs -o huge=always nodev $FS >> # Create 1000 MB container, which shouldn't suffer OOM. >> mkdir $CONTAINER >> echo 1000M > $CONTAINER/memory.limit_in_bytes >> echo $BASHPID >> $CONTAINER/cgroup.procs >> # Create 4000 files. Ideally each file uses 4k data page + a little >> # metadata. Assume 8k total per-file, 32MB (4000*8k) should easily >> # fit within container's 1000 MB. But if data pages use 2MB >> # hugepages (due to aggressive huge=always) then files consume 8GB, >> # which hits memcg 1000 MB limit. >> for i in {1..4000}; do >> echo . > $FS/$i >> done > > It looks all the inodes which have tail THP beyond i_size are on one > single list, then the shrinker actually just splits the first > nr_to_scan inodes. But since the list is not memcg aware, so it seems > it may split the THPs which are not charged to the victim memcg and > the victim memcg still may suffer from pre-mature oom, right? Correct. shmem_unused_huge_shrink() is not memcg aware. In response to memcg pressure it will split the post-i_size tails of nr_to_scan tmpfs inodes regardless of if they're charged to the under-pressure memcg. do_shrink_slab() looks like it'll repeatedly call shmem_unused_huge_shrink(). So it will split tails of many inodes. So I think it'll avoid the oom by over shrinking. This is not ideal. But it seems better than undeserved oom kill. I think the solution (as Kirill Tkhai suggested) a memcg-aware index would solve both: 1) avoid premature oom by registering shrinker to responding to memcg pressure 2) avoid shrinking/splitting inodes unrelated to the under-pressure memcg I can certainly look into that (thanks Kirill for the pointers). In the short term I'm still interested in avoiding premature OOMs with the original thread (i.e. restore pre-4.19 behavior to shmem shrinker for memcg pressure). I plan to test and repost v2.
Re: [PATCH v18 11/14] Documentation: Add documents for DAMON
SeongJae Park wrote: > From: SeongJae Park > > This commit adds documents for DAMON under > `Documentation/admin-guide/mm/damon/` and `Documentation/vm/damon/`. > > Signed-off-by: SeongJae Park > --- > Documentation/admin-guide/mm/damon/guide.rst | 157 ++ > Documentation/admin-guide/mm/damon/index.rst | 15 + > Documentation/admin-guide/mm/damon/plans.rst | 29 ++ > Documentation/admin-guide/mm/damon/start.rst | 98 ++ > Documentation/admin-guide/mm/damon/usage.rst | 298 +++ > Documentation/admin-guide/mm/index.rst | 1 + > Documentation/vm/damon/api.rst | 20 ++ > Documentation/vm/damon/eval.rst | 222 ++ > Documentation/vm/damon/faq.rst | 59 > Documentation/vm/damon/index.rst | 32 ++ > Documentation/vm/damon/mechanisms.rst| 165 ++ > Documentation/vm/index.rst | 1 + > 12 files changed, 1097 insertions(+) > create mode 100644 Documentation/admin-guide/mm/damon/guide.rst > create mode 100644 Documentation/admin-guide/mm/damon/index.rst > create mode 100644 Documentation/admin-guide/mm/damon/plans.rst > create mode 100644 Documentation/admin-guide/mm/damon/start.rst > create mode 100644 Documentation/admin-guide/mm/damon/usage.rst > create mode 100644 Documentation/vm/damon/api.rst > create mode 100644 Documentation/vm/damon/eval.rst > create mode 100644 Documentation/vm/damon/faq.rst > create mode 100644 Documentation/vm/damon/index.rst > create mode 100644 Documentation/vm/damon/mechanisms.rst > > diff --git a/Documentation/admin-guide/mm/damon/guide.rst > b/Documentation/admin-guide/mm/damon/guide.rst > new file mode 100644 > index ..c51fb843efaa > --- /dev/null > +++ b/Documentation/admin-guide/mm/damon/guide.rst > @@ -0,0 +1,157 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +== > +Optimization Guide > +== > + > +This document helps you estimating the amount of benefit that you could get > +from DAMON-based optimizations, and describes how you could achieve it. You > +are assumed to already read :doc:`start`. > + > + > +Check The Signs > +=== > + > +No optimization can provide same extent of benefit to every case. Therefore > +you should first guess how much improvements you could get using DAMON. If > +some of below conditions match your situation, you could consider using > DAMON. > + > +- *Low IPC and High Cache Miss Ratios.* Low IPC means most of the CPU time > is > + spent waiting for the completion of time-consuming operations such as > memory > + access, while high cache miss ratios mean the caches don't help it well. > + DAMON is not for cache level optimization, but DRAM level. However, > + improving DRAM management will also help this case by reducing the memory > + operation latency. > +- *Memory Over-commitment and Unknown Users.* If you are doing memory > + overcommitment and you cannot control every user of your system, a memory > + bank run could happen at any time. You can estimate when it will happen > + based on DAMON's monitoring results and act earlier to avoid or deal better > + with the crisis. > +- *Frequent Memory Pressure.* Frequent memory pressure means your system has > + wrong configurations or memory hogs. DAMON will help you find the right > + configuration and/or the criminals. > +- *Heterogeneous Memory System.* If your system is utilizing memory devices > + that placed between DRAM and traditional hard disks, such as non-volatile > + memory or fast SSDs, DAMON could help you utilizing the devices more > + efficiently. > + > + > +Profile > +=== > + > +If you found some positive signals, you could start by profiling your > workloads > +using DAMON. Find major workloads on your systems and analyze their data > +access pattern to find something wrong or can be improved. The DAMON user > +space tool (``damo``) will be useful for this. > + > +We recommend you to start from working set size distribution check using > ``damo > +report wss``. If the distribution is ununiform or quite different from what > +you estimated, you could consider `Memory Configuration`_ optimization. > + > +Then, review the overall access pattern in heatmap form using ``damo report > +heats``. If it shows a simple pattern consists of a small number of memory > +regions having high contrast of access temperature, you could consider manual > +`Program Modification`_. > + > +If you still want to absorb more benefits, you should develop `Personalized > +DAMON Application`_ for your special case. > + > +You don't need to take only one approach among the above plans, but you could > +use multiple of the above approaches to maximize the benefit. > + > + > +Optimize > + > + > +If the profiling result also says it's worth trying some optimization, you > +could consider below approaches. Note that some of the below approaches > assume > +that y
Re: [PATCH v18 06/14] mm/damon: Implement callbacks for the virtual memory address spaces
SeongJae Park wrote: > From: SeongJae Park > > This commit introduces a reference implementation of the address space > specific low level primitives for the virtual address space, so that > users of DAMON can easily monitor the data accesses on virtual address > spaces of specific processes by simply configuring the implementation to > be used by DAMON. > > The low level primitives for the fundamental access monitoring are > defined in two parts: > 1. Identification of the monitoring target address range for the address > space. > 2. Access check of specific address range in the target space. > > The reference implementation for the virtual address space provided by > this commit is designed as below. > > PTE Accessed-bit Based Access Check > --- > > The implementation uses PTE Accessed-bit for basic access checks. That > is, it clears the bit for next sampling target page and checks whether > it set again after one sampling period. To avoid disturbing other > Accessed bit users such as the reclamation logic, the implementation > adjusts the ``PG_Idle`` and ``PG_Young`` appropriately, as same to the > 'Idle Page Tracking'. > > VMA-based Target Address Range Construction > --- > > Only small parts in the super-huge virtual address space of the > processes are mapped to physical memory and accessed. Thus, tracking > the unmapped address regions is just wasteful. However, because DAMON > can deal with some level of noise using the adaptive regions adjustment > mechanism, tracking every mapping is not strictly required but could > even incur a high overhead in some cases. That said, too huge unmapped > areas inside the monitoring target should be removed to not take the > time for the adaptive mechanism. > > For the reason, this implementation converts the complex mappings to > three distinct regions that cover every mapped area of the address > space. Also, the two gaps between the three regions are the two biggest > unmapped areas in the given address space. The two biggest unmapped > areas would be the gap between the heap and the uppermost mmap()-ed > region, and the gap between the lowermost mmap()-ed region and the stack > in most of the cases. Because these gaps are exceptionally huge in > usual address spacees, excluding these will be sufficient to make a > reasonable trade-off. Below shows this in detail:: > > > > > (small mmap()-ed regions and munmap()-ed regions) > > > > > Signed-off-by: SeongJae Park > Reviewed-by: Leonard Foerster > --- > include/linux/damon.h | 6 + > mm/damon.c| 474 ++ > 2 files changed, 480 insertions(+) > > diff --git a/include/linux/damon.h b/include/linux/damon.h > index 3c0b92a679e8..310d36d123b3 100644 > --- a/include/linux/damon.h > +++ b/include/linux/damon.h > @@ -144,6 +144,12 @@ struct damon_ctx { > void (*aggregate_cb)(struct damon_ctx *context); > }; > > +/* Reference callback implementations for virtual memory */ > +void kdamond_init_vm_regions(struct damon_ctx *ctx); > +void kdamond_update_vm_regions(struct damon_ctx *ctx); > +void kdamond_prepare_vm_access_checks(struct damon_ctx *ctx); > +unsigned int kdamond_check_vm_accesses(struct damon_ctx *ctx); > + > int damon_set_pids(struct damon_ctx *ctx, int *pids, ssize_t nr_pids); > int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int, > unsigned long aggr_int, unsigned long regions_update_int, > diff --git a/mm/damon.c b/mm/damon.c > index b844924b9fdb..386780739007 100644 > --- a/mm/damon.c > +++ b/mm/damon.c > @@ -9,6 +9,9 @@ > * This file is constructed in below parts. > * > * - Functions and macros for DAMON data structures > + * - Functions for the initial monitoring target regions construction > + * - Functions for the dynamic monitoring target regions update > + * - Functions for the access checking of the regions > * - Functions for DAMON core logics and features > * - Functions for the DAMON programming interface > * - Functions for the module loading/unloading > @@ -196,6 +199,477 @@ static unsigned long damon_region_sz_limit(struct > damon_ctx *ctx) > return sz; > } > > +/* > + * Get the mm_struct of the given task > + * > + * Caller _must_ put the mm_struct after use, unless it is NULL. > + * > + * Returns the mm_struct of the task on success, NULL on failure > + */ > +static struct mm_struct *damon_get_mm(struct damon_task *t) > +{ > + struct task_struct *task; > + struct mm_struct *mm; > + > + task = damon_get_task_struct(t); > + if (!task) > + return NULL; > + > + mm = get_task_mm(task); > + put_task_struct(task); > + return mm; > +} > + > +/* > + * Functions for the initial monitoring target regions construction > + */ > + > +/* > + * Size-evenly split a region into 'nr_pieces' small regions > + * > + * Returns 0 on success, o
Re: [PATCH v2] powerpc/powernv/pci: use ifdef to avoid dead code
Oliver O'Halloran wrote: > On Mon, Jun 15, 2020 at 9:33 AM Greg Thelen wrote: >> >> Commit dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE >> configuration") removed a couple pnv_ioda_setup_bus_dma() calls. The >> only remaining calls are behind CONFIG_IOMMU_API. Thus builds without >> CONFIG_IOMMU_API see: >> arch/powerpc/platforms/powernv/pci-ioda.c:1888:13: error: >> 'pnv_ioda_setup_bus_dma' defined but not used >> >> Move pnv_ioda_setup_bus_dma() under CONFIG_IOMMU_API to avoid dead code. > > Doh! Thanks for the fix. > > Reviewed-by: Oliver O'Halloran Is there anything else needed from me on this patch? Given that it fixes a 5.8 commit I figured it'd be 5.8 material.
[PATCH] selftests: more general make nesting support
selftests can be built from the toplevel kernel makefile (e.g. make kselftest-all) or directly (make -C tools/testing/selftests all). The toplevel kernel makefile explicitly disables implicit rules with "MAKEFLAGS += -rR", which is passed to tools/testing/selftests. Some selftest makefiles require implicit make rules, which is why commit 67d8712dcc70 ("selftests: Fix build failures when invoked from kselftest target") reenables implicit rules by clearing MAKEFLAGS if MAKELEVEL=1. So far so good. However, if the toplevel makefile is called from an outer makefile then MAKELEVEL will be elevated, which breaks the MAKELEVEL equality test. Example wrapped makefile error: $ cat ~/Makefile all: $(MAKE) defconfig $(MAKE) kselftest-all $ make -sf ~/Makefile futex_wait_timeout.c /src/tools/testing/selftests/kselftest_harness.h /src/tools/testing/selftests/kselftest.h ../include/futextest.h ../include/atomic.h ../include/logging.h -lpthread -lrt -o /src/tools/testing/selftests/futex/functional/futex_wait_timeout make[4]: futex_wait_timeout.c: Command not found Rather than checking $(MAKELEVEL), check for $(LINK.c), which is a more direct side effect of "make -R". This enables arbitrary makefile nesting. Signed-off-by: Greg Thelen --- tools/testing/selftests/Makefile | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index 1195bd85af38..289a2e4b3f6f 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -84,10 +84,10 @@ endif # of the targets gets built. FORCE_TARGETS ?= -# Clear LDFLAGS and MAKEFLAGS if called from main -# Makefile to avoid test build failures when test -# Makefile doesn't have explicit build rules. -ifeq (1,$(MAKELEVEL)) +# Clear LDFLAGS and MAKEFLAGS when implicit rules are missing. This provides +# implicit rules to sub-test Makefiles which avoids build failures in test +# Makefile that don't have explicit build rules. +ifeq (,$(LINK.c)) override LDFLAGS = override MAKEFLAGS = endif -- 2.28.0.rc0.142.g3c755180ce-goog
Re: [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed
Dave Hansen wrote: > From: Keith Busch > > Migrating pages had been allocating the new page before it was actually > needed. Subsequent operations may still fail, which would have to handle > cleaning up the newly allocated page when it was never used. > > Defer allocating the page until we are actually ready to make use of > it, after locking the original page. This simplifies error handling, > but should not have any functional change in behavior. This is just > refactoring page migration so the main part can more easily be reused > by other code. Is there any concern that the src page is now held PG_locked over the dst page allocation, which might wander into reclaim/cond_resched/oom_kill? I don't have a deadlock in mind. I'm just wondering about the additional latency imposed on unrelated threads who want access src page. > #Signed-off-by: Keith Busch Is commented Signed-off-by intentional? Same applies to later patches. > Signed-off-by: Dave Hansen > Cc: Keith Busch > Cc: Yang Shi > Cc: David Rientjes > Cc: Huang Ying > Cc: Dan Williams > --- > > b/mm/migrate.c | 148 > - > 1 file changed, 75 insertions(+), 73 deletions(-) > > diff -puN mm/migrate.c~0007-mm-migrate-Defer-allocating-new-page-until-needed > mm/migrate.c > --- a/mm/migrate.c~0007-mm-migrate-Defer-allocating-new-page-until-needed > 2020-06-29 16:34:37.896312607 -0700 > +++ b/mm/migrate.c2020-06-29 16:34:37.900312607 -0700 > @@ -1014,56 +1014,17 @@ out: > return rc; > } > > -static int __unmap_and_move(struct page *page, struct page *newpage, > - int force, enum migrate_mode mode) > +static int __unmap_and_move(new_page_t get_new_page, > + free_page_t put_new_page, > + unsigned long private, struct page *page, > + enum migrate_mode mode, > + enum migrate_reason reason) > { > int rc = -EAGAIN; > int page_was_mapped = 0; > struct anon_vma *anon_vma = NULL; > bool is_lru = !__PageMovable(page); > - > - if (!trylock_page(page)) { > - if (!force || mode == MIGRATE_ASYNC) > - goto out; > - > - /* > - * It's not safe for direct compaction to call lock_page. > - * For example, during page readahead pages are added locked > - * to the LRU. Later, when the IO completes the pages are > - * marked uptodate and unlocked. However, the queueing > - * could be merging multiple pages for one bio (e.g. > - * mpage_readpages). If an allocation happens for the > - * second or third page, the process can end up locking > - * the same page twice and deadlocking. Rather than > - * trying to be clever about what pages can be locked, > - * avoid the use of lock_page for direct compaction > - * altogether. > - */ > - if (current->flags & PF_MEMALLOC) > - goto out; > - > - lock_page(page); > - } > - > - if (PageWriteback(page)) { > - /* > - * Only in the case of a full synchronous migration is it > - * necessary to wait for PageWriteback. In the async case, > - * the retry loop is too short and in the sync-light case, > - * the overhead of stalling is too much > - */ > - switch (mode) { > - case MIGRATE_SYNC: > - case MIGRATE_SYNC_NO_COPY: > - break; > - default: > - rc = -EBUSY; > - goto out_unlock; > - } > - if (!force) > - goto out_unlock; > - wait_on_page_writeback(page); > - } > + struct page *newpage; > > /* >* By try_to_unmap(), page->mapcount goes down to 0 here. In this case, > @@ -1082,6 +1043,12 @@ static int __unmap_and_move(struct page > if (PageAnon(page) && !PageKsm(page)) > anon_vma = page_get_anon_vma(page); > > + newpage = get_new_page(page, private); > + if (!newpage) { > + rc = -ENOMEM; > + goto out; > + } > + > /* >* Block others from accessing the new page when we get around to >* establishing additional references. We are usually the only one > @@ -1091,11 +1058,11 @@ static int __unmap_and_move(struct page >* This is much like races on refcount of oldpage: just don't BUG(). >*/ > if (unlikely(!trylock_page(newpage))) > - goto out_unlock; > + goto out_put; > > if (unlikely(!is_lru)) { > rc = move_to_new_page(newpage, page, mode); > - goto out_unlock_both; > + goto out_unlock; > } > > /* > @@ -1114,7 +1081,7 @@