Re: general protection fault in oom_unkillable_task

2019-06-17 Thread Shakeel Butt
On Mon, Jun 17, 2019 at 6:45 PM Andrew Morton  wrote:
>
> On Mon, 17 Jun 2019 06:23:07 -0700 Shakeel Butt  wrote:
>
> > > Here is a patch to use CSS_TASK_ITER_PROCS.
> > >
> > > From 415e52cf55bc4ad931e4f005421b827f0b02693d Mon Sep 17 00:00:00 2001
> > > From: Tetsuo Handa 
> > > Date: Mon, 17 Jun 2019 00:09:38 +0900
> > > Subject: [PATCH] mm: memcontrol: Use CSS_TASK_ITER_PROCS at 
> > > mem_cgroup_scan_tasks().
> > >
> > > Since commit c03cd7738a83b137 ("cgroup: Include dying leaders with live
> > > threads in PROCS iterations") corrected how CSS_TASK_ITER_PROCS works,
> > > mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check
> > > only one thread from each thread group.
> > >
> > > Signed-off-by: Tetsuo Handa 
> >
> > Reviewed-by: Shakeel Butt 
> >
> > Why not add the reproducer in the commit message?
>
> That would be nice.
>
> More nice would be, as always, a descriptoin of the user-visible impact
> of the patch.
>

This is just a cleanup and optimization where instead of traversing
all the threads in a memcg, we only traverse only one thread for each
thread group in a memcg. There is no user visible impact.

> As I understand it, it's just a bit of a cleanup against current
> mainline but without this patch in place, Shakeel's "mm, oom: refactor
> dump_tasks for memcg OOMs" will cause kernel crashes.  Correct?

No, the patch "mm, oom: refactor dump_tasks for memcg OOMs" is making
dump_stacks not depend on the memcg check within
oom_unkillable_task().

"mm, oom: fix oom_unkillable_task for memcg OOMs" is the actual fix
which is making oom_unkillable_task() correctly handle the memcg OOMs
code paths.

Shakeel


[PATCH] mm, vmscan: prevent useless kswapd loops

2019-06-27 Thread Shakeel Butt
On production we have noticed hard lockups on large machines running
large jobs due to kswaps hoarding lru lock within isolate_lru_pages when
sc->reclaim_idx is 0 which is a small zone. The lru was couple hundred
GiBs and the condition (page_zonenum(page) > sc->reclaim_idx) in
isolate_lru_pages was basically skipping GiBs of pages while holding the
LRU spinlock with interrupt disabled.

On further inspection, it seems like there are two issues:

1) If the kswapd on the return from balance_pgdat() could not sleep
(maybe all zones are still unbalanced), the classzone_idx is set to 0,
unintentionally, and the whole reclaim cycle of kswapd will try to reclaim
only the lowest and smallest zone while traversing the whole memory.

2) Fundamentally isolate_lru_pages() is really bad when the allocation
has woken kswapd for a smaller zone on a very large machine running very
large jobs. It can hoard the LRU spinlock while skipping over 100s of
GiBs of pages.

This patch only fixes the (1). The (2) needs a more fundamental solution.

Fixes: e716f2eb24de ("mm, vmscan: prevent kswapd sleeping prematurely
due to mismatched classzone_idx")
Signed-off-by: Shakeel Butt 
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9e3292ee5c7c..786dacfdfe29 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3908,7 +3908,7 @@ static int kswapd(void *p)
 
/* Read the new order and classzone_idx */
alloc_order = reclaim_order = pgdat->kswapd_order;
-   classzone_idx = kswapd_classzone_idx(pgdat, 0);
+   classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
pgdat->kswapd_order = 0;
pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
 
-- 
2.22.0.410.gd8fdbe21b5-goog



Re: [PATCH v3 2/3] mm, oom: remove redundant task_in_mem_cgroup() check

2019-06-27 Thread Shakeel Butt
On Tue, Jun 25, 2019 at 11:38 PM Michal Hocko  wrote:
>
> On Mon 24-06-19 14:26:30, Shakeel Butt wrote:
> > oom_unkillable_task() can be called from three different contexts i.e.
> > global OOM, memcg OOM and oom_score procfs interface. At the moment
> > oom_unkillable_task() does a task_in_mem_cgroup() check on the given
> > process. Since there is no reason to perform task_in_mem_cgroup()
> > check for global OOM and oom_score procfs interface, those contexts
> > provide NULL memcg and skips the task_in_mem_cgroup() check. However for
> > memcg OOM context, the oom_unkillable_task() is always called from
> > mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes
> > redundant. So, just remove the task_in_mem_cgroup() check altogether.
>
> Just a nit. Not only it is redundant but it is effectively a dead code
> after your previous patch.
>

I will update the commit message.

> > Signed-off-by: Shakeel Butt 
> > Signed-off-by: Tetsuo Handa 
>
> Acked-by: Michal Hocko 

Thanks


Re: [PATCH v3 3/3] oom: decouple mems_allowed from oom_unkillable_task

2019-06-27 Thread Shakeel Butt
On Tue, Jun 25, 2019 at 11:55 PM Michal Hocko  wrote:
>
> On Mon 24-06-19 14:26:31, Shakeel Butt wrote:
> > The commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to
> > mem_exclusive cpuset") introduces a heuristic where a potential
> > oom-killer victim is skipped if the intersection of the potential victim
> > and the current (the process triggered the oom) is empty based on the
> > reason that killing such victim most probably will not help the current
> > allocating process. However the commit 7887a3da753e ("[PATCH] oom:
> > cpuset hint") changed the heuristic to just decrease the oom_badness
> > scores of such potential victim based on the reason that the cpuset of
> > such processes might have changed and previously they might have
> > allocated memory on mems where the current allocating process can
> > allocate from.
> >
> > Unintentionally commit 7887a3da753e ("[PATCH] oom: cpuset hint")
> > introduced a side effect as the oom_badness is also exposed to the
> > user space through /proc/[pid]/oom_score, so, readers with different
> > cpusets can read different oom_score of th same process.
> >
> > Later the commit 6cf86ac6f36b ("oom: filter tasks not sharing the same
> > cpuset") fixed the side effect introduced by 7887a3da753e by moving the
> > cpuset intersection back to only oom-killer context and out of
> > oom_badness. However the combination of the commit ab290adbaf8f ("oom:
> > make oom_unkillable_task() helper function") and commit 26ebc984913b
> > ("oom: /proc//oom_score treat kernel thread honestly")
> > unintentionally brought back the cpuset intersection check into the
> > oom_badness calculation function.
>
> Thanks for this excursion into the history. I think it is very useful.
>
> > Other than doing cpuset/mempolicy intersection from oom_badness, the
> > memcg oom context is also doing cpuset/mempolicy intersection which is
> > quite wrong and is caught by syzcaller with the following report:
> >
> > kasan: CONFIG_KASAN_INLINE enabled
> > kasan: GPF could be caused by NULL-ptr deref or user memory access
> > general protection fault:  [#1] PREEMPT SMP KASAN
> > CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
> > RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
> > RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
> > RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
> > Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
> > 00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f
> > 85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
> > RSP: 0018:888000127490 EFLAGS: 00010a03
> > RAX: dc00 RBX: 8880a4cd5438 RCX: 818dae9c
> > RDX: 1c3cc602 RSI: 818dac8d RDI: 0001
> > RBP: 8880001274d0 R08: 88886180 R09: ed1015d26be0
> > R10: ed1015d26bdf R11: 8880ae935efb R12: 800061e63007
> > R13:  R14: 800061e63017 R15: 11124ea6
> > FS:  561f5940() GS:8880ae80() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 00607304 CR3: 9237e000 CR4: 001426f0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe0ff0 DR7: 0600
> > Call Trace:
> >   oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321
> >   mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169
> >   select_bad_process mm/oom_kill.c:374 [inline]
> >   out_of_memory mm/oom_kill.c:1088 [inline]
> >   out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035
> >   mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573
> >   mem_cgroup_oom mm/memcontrol.c:1905 [inline]
> >   try_charge+0xfbe/0x1480 mm/memcontrol.c:2468
> >   mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073
> >   mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088
> >   do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201
> >   do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359
> >   wp_huge_pmd mm/memory.c:3793 [inline]
> >   __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006
> >   handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053
> >   do_user_addr_fault arch/x86/mm/fault.c:1455 [inline]
> >   __do_page_fault+0x5ef/0xda0 arch/x86/mm/fa

[PATCH v4 1/3] mm, oom: refactor dump_tasks for memcg OOMs

2019-06-28 Thread Shakeel Butt
dump_tasks() traverses all the existing processes even for the memcg OOM
context which is not only unnecessary but also wasteful.  This imposes a
long RCU critical section even from a contained context which can be quite
disruptive.

Change dump_tasks() to be aligned with select_bad_process and use
mem_cgroup_scan_tasks to selectively traverse only processes of the target
memcg hierarchy during memcg OOM.

Signed-off-by: Shakeel Butt 
Acked-by: Michal Hocko 
Acked-by: Roman Gushchin 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: Vladimir Davydov 
Cc: David Rientjes 
Cc: KOSAKI Motohiro 
Cc: Paul Jackson 
Cc: Nick Piggin 
Cc: Andrew Morton 
---
Changelog since v3:
- None

Changelog since v2:
- Updated the commit message.

Changelog since v1:
- Divide the patch into two patches.

 mm/oom_kill.c | 68 ++-
 1 file changed, 40 insertions(+), 28 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 085abc91024d..a940d2aa92d6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -380,10 +380,38 @@ static void select_bad_process(struct oom_control *oc)
}
 }
 
+static int dump_task(struct task_struct *p, void *arg)
+{
+   struct oom_control *oc = arg;
+   struct task_struct *task;
+
+   if (oom_unkillable_task(p, NULL, oc->nodemask))
+   return 0;
+
+   task = find_lock_task_mm(p);
+   if (!task) {
+   /*
+* This is a kthread or all of p's threads have already
+* detached their mm's.  There's no need to report
+* them; they can't be oom killed anyway.
+*/
+   return 0;
+   }
+
+   pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %s\n",
+   task->pid, from_kuid(_user_ns, task_uid(task)),
+   task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
+   mm_pgtables_bytes(task->mm),
+   get_mm_counter(task->mm, MM_SWAPENTS),
+   task->signal->oom_score_adj, task->comm);
+   task_unlock(task);
+
+   return 0;
+}
+
 /**
  * dump_tasks - dump current memory state of all system tasks
- * @memcg: current's memory controller, if constrained
- * @nodemask: nodemask passed to page allocator for mempolicy ooms
+ * @oc: pointer to struct oom_control
  *
  * Dumps the current memory state of all eligible tasks.  Tasks not in the same
  * memcg, not in the same cpuset, or bound to a disjoint set of mempolicy nodes
@@ -391,37 +419,21 @@ static void select_bad_process(struct oom_control *oc)
  * State information includes task's pid, uid, tgid, vm size, rss,
  * pgtables_bytes, swapents, oom_score_adj value, and name.
  */
-static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask)
+static void dump_tasks(struct oom_control *oc)
 {
-   struct task_struct *p;
-   struct task_struct *task;
-
pr_info("Tasks state (memory values in pages):\n");
pr_info("[  pid  ]   uid  tgid total_vm  rss pgtables_bytes 
swapents oom_score_adj name\n");
-   rcu_read_lock();
-   for_each_process(p) {
-   if (oom_unkillable_task(p, memcg, nodemask))
-   continue;
 
-   task = find_lock_task_mm(p);
-   if (!task) {
-   /*
-* This is a kthread or all of p's threads have already
-* detached their mm's.  There's no need to report
-* them; they can't be oom killed anyway.
-*/
-   continue;
-   }
+   if (is_memcg_oom(oc))
+   mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
+   else {
+   struct task_struct *p;
 
-   pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %s\n",
-   task->pid, from_kuid(_user_ns, task_uid(task)),
-   task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
-   mm_pgtables_bytes(task->mm),
-   get_mm_counter(task->mm, MM_SWAPENTS),
-   task->signal->oom_score_adj, task->comm);
-   task_unlock(task);
+   rcu_read_lock();
+   for_each_process(p)
+   dump_task(p, oc);
+   rcu_read_unlock();
}
-   rcu_read_unlock();
 }
 
 static void dump_oom_summary(struct oom_control *oc, struct task_struct 
*victim)
@@ -453,7 +465,7 @@ static void dump_header(struct oom_control *oc, struct 
task_struct *p)
dump_unreclaimable_slab();
}
if (sysctl_oom_dump_tasks)
-   dump_tasks(oc->memcg, oc->nodemask);
+   dump_tasks(oc);
if (p)
dump_oom_summary(oc, p);
 }
-- 
2.22.0.410.gd8fdbe21b5-goog



[PATCH v4 2/3] mm, oom: remove redundant task_in_mem_cgroup() check

2019-06-28 Thread Shakeel Butt
oom_unkillable_task() can be called from three different contexts i.e.
global OOM, memcg OOM and oom_score procfs interface.  At the moment
oom_unkillable_task() does a task_in_mem_cgroup() check on the given
process.  Since there is no reason to perform task_in_mem_cgroup() check
for global OOM and oom_score procfs interface, those contexts provide NULL
memcg and skips the task_in_mem_cgroup() check.  However for memcg OOM
context, the oom_unkillable_task() is always called from
mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes
redundant and effectively dead code. So, just remove the
task_in_mem_cgroup() check altogether.

Signed-off-by: Shakeel Butt 
Signed-off-by: Tetsuo Handa 
Acked-by: Michal Hocko 
Acked-by: Roman Gushchin 
Cc: David Rientjes 
Cc: Johannes Weiner 
Cc: KOSAKI Motohiro 
Cc: Nick Piggin 
Cc: Paul Jackson 
Cc: Vladimir Davydov 
Cc: Andrew Morton 
---
Changelog since v3:
- Update commit message.

Changelog since v2:
- Further divided the patch into two patches.
- Incorporated the task_in_mem_cgroup() from Tetsuo.

Changelog since v1:
- Divide the patch into two patches.

 fs/proc/base.c |  2 +-
 include/linux/memcontrol.h |  7 ---
 include/linux/oom.h|  2 +-
 mm/memcontrol.c| 26 --
 mm/oom_kill.c  | 19 +++
 5 files changed, 9 insertions(+), 47 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index b8d5d100ed4a..5eacce5e924a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -532,7 +532,7 @@ static int proc_oom_score(struct seq_file *m, struct 
pid_namespace *ns,
unsigned long totalpages = totalram_pages() + total_swap_pages;
unsigned long points = 0;
 
-   points = oom_badness(task, NULL, NULL, totalpages) *
+   points = oom_badness(task, NULL, totalpages) *
1000 / totalpages;
seq_printf(m, "%lu\n", points);
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9abf31bbe53a..2cbce1fe7780 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -407,7 +407,6 @@ static inline struct lruvec *mem_cgroup_lruvec(struct 
pglist_data *pgdat,
 
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
 
-bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
@@ -896,12 +895,6 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
return true;
 }
 
-static inline bool task_in_mem_cgroup(struct task_struct *task,
- const struct mem_cgroup *memcg)
-{
-   return true;
-}
-
 static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
 {
return NULL;
diff --git a/include/linux/oom.h b/include/linux/oom.h
index d07992009265..b75104690311 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -108,7 +108,7 @@ static inline vm_fault_t check_stable_address_space(struct 
mm_struct *mm)
 bool __oom_reap_task_mm(struct mm_struct *mm);
 
 extern unsigned long oom_badness(struct task_struct *p,
-   struct mem_cgroup *memcg, const nodemask_t *nodemask,
+   const nodemask_t *nodemask,
unsigned long totalpages);
 
 extern bool out_of_memory(struct oom_control *oc);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7532ddcf31b2..b3f67a6b6527 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1259,32 +1259,6 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, 
enum lru_list lru,
*lru_size += nr_pages;
 }
 
-bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg)
-{
-   struct mem_cgroup *task_memcg;
-   struct task_struct *p;
-   bool ret;
-
-   p = find_lock_task_mm(task);
-   if (p) {
-   task_memcg = get_mem_cgroup_from_mm(p->mm);
-   task_unlock(p);
-   } else {
-   /*
-* All threads may have already detached their mm's, but the oom
-* killer still needs to detect if they have already been oom
-* killed to prevent needlessly killing additional tasks.
-*/
-   rcu_read_lock();
-   task_memcg = mem_cgroup_from_task(task);
-   css_get(_memcg->css);
-   rcu_read_unlock();
-   }
-   ret = mem_cgroup_is_descendant(task_memcg, memcg);
-   css_put(_memcg->css);
-   return ret;
-}
-
 /**
  * mem_cgroup_margin - calculate chargeable space of a memory cgroup
  * @memcg: the memory cgroup
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a940d2aa92d6..eff879acc886 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -153,17 +153,13 @@ static inline bool is_memcg_oom(struct oom_control *oc)
 
 /* return true if the task is not adequate as candidate victim ta

[PATCH v4 3/3] oom: decouple mems_allowed from oom_unkillable_task

2019-06-28 Thread Shakeel Butt
RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f
85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
RSP: 0018:888000127490 EFLAGS: 00010a03
RAX: dc00 RBX: 8880a4cd5438 RCX: 818dae9c
RDX: 1c3cc602 RSI: 818dac8d RDI: 0001
RBP: 8880001274d0 R08: 88886180 R09: ed1015d26be0
R10: ed1015d26bdf R11: 8880ae935efb R12: 800061e63007
R13:  R14: 800061e63017 R15: 11124ea6
FS:  561f5940() GS:8880ae80() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 001b2f823000 CR3: 9237e000 CR4: 001426f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0600

The fix is to decouple the cpuset/mempolicy intersection check from
oom_unkillable_task() and make sure cpuset/mempolicy intersection check is
only done in the global oom context.

Signed-off-by: Shakeel Butt 
Reported-by: syzbot+d0fc9d3c166bc5e4a...@syzkaller.appspotmail.com
Acked-by: Michal Hocko 
Acked-by: Roman Gushchin 
Cc: David Rientjes 
Cc: Johannes Weiner 
Cc: KOSAKI Motohiro 
Cc: Nick Piggin 
Cc: Paul Jackson 
Cc: Tetsuo Handa 
Cc: Vladimir Davydov 
Cc: Andrew Morton 
---
Changelog since v3:
- Changed function name and update comment.

Changelog since v2:
- Further divided the patch into two patches.
- More cleaned version.

Changelog since v1:
- Divide the patch into two patches.

 fs/proc/base.c  |  3 +--
 include/linux/oom.h |  1 -
 mm/oom_kill.c   | 57 +
 3 files changed, 33 insertions(+), 28 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 5eacce5e924a..57b7a0d75ef5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -532,8 +532,7 @@ static int proc_oom_score(struct seq_file *m, struct 
pid_namespace *ns,
unsigned long totalpages = totalram_pages() + total_swap_pages;
unsigned long points = 0;
 
-   points = oom_badness(task, NULL, totalpages) *
-   1000 / totalpages;
+   points = oom_badness(task, totalpages) * 1000 / totalpages;
seq_printf(m, "%lu\n", points);
 
return 0;
diff --git a/include/linux/oom.h b/include/linux/oom.h
index b75104690311..c696c265f019 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -108,7 +108,6 @@ static inline vm_fault_t check_stable_address_space(struct 
mm_struct *mm)
 bool __oom_reap_task_mm(struct mm_struct *mm);
 
 extern unsigned long oom_badness(struct task_struct *p,
-   const nodemask_t *nodemask,
unsigned long totalpages);
 
 extern bool out_of_memory(struct oom_control *oc);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index eff879acc886..95872bdfec4e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -64,21 +64,33 @@ int sysctl_oom_dump_tasks = 1;
  */
 DEFINE_MUTEX(oom_lock);
 
+static inline bool is_memcg_oom(struct oom_control *oc)
+{
+   return oc->memcg != NULL;
+}
+
 #ifdef CONFIG_NUMA
 /**
- * has_intersects_mems_allowed() - check task eligiblity for kill
+ * oom_cpuset_eligible() - check task eligiblity for kill
  * @start: task struct of which task to consider
  * @mask: nodemask passed to page allocator for mempolicy ooms
  *
  * Task eligibility is determined by whether or not a candidate task, @tsk,
  * shares the same mempolicy nodes as current if it is bound by such a policy
  * and whether or not it has the same set of allowed cpuset nodes.
+ *
+ * This function is assuming oom-killer context and 'current' has triggered
+ * the oom-killer.
  */
-static bool has_intersects_mems_allowed(struct task_struct *start,
-   const nodemask_t *mask)
+static bool oom_cpuset_eligible(struct task_struct *start,
+   struct oom_control *oc)
 {
struct task_struct *tsk;
bool ret = false;
+   const nodemask_t *mask = oc->nodemask;
+
+   if (is_memcg_oom(oc))
+   return true;
 
rcu_read_lock();
for_each_thread(start, tsk) {
@@ -105,8 +117,7 @@ static bool has_intersects_mems_allowed(struct task_struct 
*start,
return ret;
 }
 #else
-static bool has_intersects_mems_allowed(struct task_struct *tsk,
-   const nodemask_t *mask)
+static bool oom_cpuset_eligible(struct task_struct *tsk, struct oom_control 
*oc)
 {
return true;
 }
@@ -146,24 +157,13 @@ static inline bool is_sysrq_oom(struct oom_control *oc)
return oc->order == -1;
 }
 
-static inline bool is_memcg_oom(struct oom_control *oc)
-{
-   return oc->memcg != NULL;
-}
-
 /* return true if the task is not adequate as candidate victim task. */
-static bool oom_unkillable_

Re: [PATCH] mm, vmscan: prevent useless kswapd loops

2019-06-28 Thread Shakeel Butt
On Fri, Jun 28, 2019 at 11:53 AM Yang Shi  wrote:
>
>
>
> On 6/27/19 6:55 PM, Shakeel Butt wrote:
> > On production we have noticed hard lockups on large machines running
> > large jobs due to kswaps hoarding lru lock within isolate_lru_pages when
> > sc->reclaim_idx is 0 which is a small zone. The lru was couple hundred
> > GiBs and the condition (page_zonenum(page) > sc->reclaim_idx) in
> > isolate_lru_pages was basically skipping GiBs of pages while holding the
> > LRU spinlock with interrupt disabled.
> >
> > On further inspection, it seems like there are two issues:
> >
> > 1) If the kswapd on the return from balance_pgdat() could not sleep
> > (maybe all zones are still unbalanced), the classzone_idx is set to 0,
> > unintentionally, and the whole reclaim cycle of kswapd will try to reclaim
> > only the lowest and smallest zone while traversing the whole memory.
> >
> > 2) Fundamentally isolate_lru_pages() is really bad when the allocation
> > has woken kswapd for a smaller zone on a very large machine running very
> > large jobs. It can hoard the LRU spinlock while skipping over 100s of
> > GiBs of pages.
> >
> > This patch only fixes the (1). The (2) needs a more fundamental solution.
> >
> > Fixes: e716f2eb24de ("mm, vmscan: prevent kswapd sleeping prematurely
> > due to mismatched classzone_idx")
> > Signed-off-by: Shakeel Butt 
> > ---
> >   mm/vmscan.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 9e3292ee5c7c..786dacfdfe29 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -3908,7 +3908,7 @@ static int kswapd(void *p)
> >
> >   /* Read the new order and classzone_idx */
> >   alloc_order = reclaim_order = pgdat->kswapd_order;
> > - classzone_idx = kswapd_classzone_idx(pgdat, 0);
> > + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
>
> I'm a little bit confused by the fix. What happen if kswapd is waken for
> a lower zone? It looks kswapd may just reclaim the higher zone instead
> of the lower zone?
>
> For example, after bootup, classzone_idx should be (MAX_NR_ZONES - 1),
> if GFP_DMA is used for allocation and kswapd is waken up for ZONE_DMA,
> kswapd_classzone_idx would still return (MAX_NR_ZONES - 1) since
> kswapd_classzone_idx(pgdat, classzone_idx) returns the max classzone_idx.
>

Indeed you are right. I think kswapd_classzone_idx() is too much
convoluted. It has different semantics when called from the wakers
than when called from kswapd(). Let me see if we can decouple the
logic in this function based on the context (or have two separate
functions for both contexts).

thanks,
Shakeel


Re: slub: don't panic for memcg kmem cache creation failure

2019-06-29 Thread Shakeel Butt
On Sat, Jun 29, 2019 at 7:05 AM Alexey Dobriyan  wrote:
>
> > -   if (flags & SLAB_PANIC)
> > -   panic("Cannot create slab %s size=%u realsize=%u order=%u 
> > offset=%u flags=%lx\n",
> > - s->name, s->size, s->size,
> > - oo_order(s->oo), s->offset, (unsigned long)flags);
>
> This is wrong. Without SLAB_PANIC people will start to implement error
> checking out of habit and add all slightly different error messages.
> This simply increases text and rodata size.
>
> If memcg kmem caches creation failure is OK, then SLAB_PANIC should not
> be passed.
>
> The fact that SLAB doesn't implement SLAB_PANIC is SLAB bug.

I do not agree with you. IMHO the kmem_cache_create_usercopy() is the
right place to handle SLAB_PANIC which is handling it. If you want
extra info here, you can add pr_warn for SLAB_PANIC here and the
caller will still and rightfully do the panic().

thanks,
Shakeel


[PATCH v2] mm, vmscan: prevent useless kswapd loops

2019-07-01 Thread Shakeel Butt
On production we have noticed hard lockups on large machines running
large jobs due to kswaps hoarding lru lock within isolate_lru_pages when
sc->reclaim_idx is 0 which is a small zone. The lru was couple hundred
GiBs and the condition (page_zonenum(page) > sc->reclaim_idx) in
isolate_lru_pages was basically skipping GiBs of pages while holding the
LRU spinlock with interrupt disabled.

On further inspection, it seems like there are two issues:

1) If the kswapd on the return from balance_pgdat() could not sleep
(i.e. node is still unbalanced), the classzone_idx is unintentionally
set to 0  and the whole reclaim cycle of kswapd will try to reclaim
only the lowest and smallest zone while traversing the whole memory.

2) Fundamentally isolate_lru_pages() is really bad when the allocation
has woken kswapd for a smaller zone on a very large machine running very
large jobs. It can hoard the LRU spinlock while skipping over 100s of
GiBs of pages.

This patch only fixes the (1). The (2) needs a more fundamental solution.
To fix (1), in the kswapd context, if pgdat->kswapd_classzone_idx is
invalid use the classzone_idx of the previous kswapd loop otherwise use
the one the waker has requested.

Fixes: e716f2eb24de ("mm, vmscan: prevent kswapd sleeping prematurely
due to mismatched classzone_idx")

Signed-off-by: Shakeel Butt 
---
Changelog since v1:
- fixed the patch based on Yang Shi's comment.

 mm/vmscan.c | 27 +++
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9e3292ee5c7c..eacf87f07afe 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3760,19 +3760,18 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
int classzone_idx)
 }
 
 /*
- * pgdat->kswapd_classzone_idx is the highest zone index that a recent
- * allocation request woke kswapd for. When kswapd has not woken recently,
- * the value is MAX_NR_ZONES which is not a valid index. This compares a
- * given classzone and returns it or the highest classzone index kswapd
- * was recently woke for.
+ * The pgdat->kswapd_classzone_idx is used to pass the highest zone index to be
+ * reclaimed by kswapd from the waker. If the value is MAX_NR_ZONES which is 
not
+ * a valid index then either kswapd runs for first time or kswapd couldn't 
sleep
+ * after previous reclaim attempt (node is still unbalanced). In that case
+ * return the zone index of the previous kswapd reclaim cycle.
  */
 static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat,
-  enum zone_type classzone_idx)
+  enum zone_type prev_classzone_idx)
 {
if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
-   return classzone_idx;
-
-   return max(pgdat->kswapd_classzone_idx, classzone_idx);
+   return prev_classzone_idx;
+   return pgdat->kswapd_classzone_idx;
 }
 
 static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int 
reclaim_order,
@@ -3908,7 +3907,7 @@ static int kswapd(void *p)
 
/* Read the new order and classzone_idx */
alloc_order = reclaim_order = pgdat->kswapd_order;
-   classzone_idx = kswapd_classzone_idx(pgdat, 0);
+   classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
pgdat->kswapd_order = 0;
pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
 
@@ -3961,8 +3960,12 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, 
int order,
if (!cpuset_zone_allowed(zone, gfp_flags))
return;
pgdat = zone->zone_pgdat;
-   pgdat->kswapd_classzone_idx = kswapd_classzone_idx(pgdat,
-  classzone_idx);
+
+   if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
+   pgdat->kswapd_classzone_idx = classzone_idx;
+   else
+   pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx,
+ classzone_idx);
pgdat->kswapd_order = max(pgdat->kswapd_order, order);
if (!waitqueue_active(>kswapd_wait))
return;
-- 
2.22.0.410.gd8fdbe21b5-goog



Re: [PATCH v2] mm/z3fold.c: Lock z3fold page before __SetPageMovable()

2019-07-01 Thread Shakeel Butt
On Mon, Jul 1, 2019 at 5:51 PM Henry Burns  wrote:
>
> __SetPageMovable() expects it's page to be locked, but z3fold.c doesn't
> lock the page. Following zsmalloc.c's example we call trylock_page() and
> unlock_page(). Also makes z3fold_page_migrate() assert that newpage is
> passed in locked, as documentation.
>
> Signed-off-by: Henry Burns 
> Suggested-by: Vitaly Wool 
> ---
>  Changelog since v1:
>  - Added an if statement around WARN_ON(trylock_page(page)) to avoid
>unlocking a page locked by a someone else.
>
>  mm/z3fold.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/mm/z3fold.c b/mm/z3fold.c
> index e174d1549734..6341435b9610 100644
> --- a/mm/z3fold.c
> +++ b/mm/z3fold.c
> @@ -918,7 +918,10 @@ static int z3fold_alloc(struct z3fold_pool *pool, size_t 
> size, gfp_t gfp,
> set_bit(PAGE_HEADLESS, >private);
> goto headless;
> }
> -   __SetPageMovable(page, pool->inode->i_mapping);
> +   if (!WARN_ON(!trylock_page(page))) {
> +   __SetPageMovable(page, pool->inode->i_mapping);
> +   unlock_page(page);
> +   }

Can you please comment why lock_page() is not used here?

> z3fold_page_lock(zhdr);
>
>  found:
> @@ -1325,6 +1328,7 @@ static int z3fold_page_migrate(struct address_space 
> *mapping, struct page *newpa
>
> VM_BUG_ON_PAGE(!PageMovable(page), page);
> VM_BUG_ON_PAGE(!PageIsolated(page), page);
> +   VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
>
> zhdr = page_address(page);
> pool = zhdr_to_pool(zhdr);
> --
> 2.22.0.410.gd8fdbe21b5-goog
>


Re: [PATCH v3] mm/z3fold.c: Lock z3fold page before __SetPageMovable()

2019-07-02 Thread Shakeel Butt
On Tue, Jul 2, 2019 at 4:35 PM Henry Burns  wrote:
>
> Following zsmalloc.c's example we call trylock_page() and unlock_page().
> Also make z3fold_page_migrate() assert that newpage is passed in locked,
> as per the documentation.
>
> Link: http://lkml.kernel.org/r/20190702005122.41036-1-henrybu...@google.com
> Signed-off-by: Henry Burns 
> Suggested-by: Vitaly Wool 
> Acked-by: Vitaly Wool 
> Acked-by: David Rientjes 
> Cc: Shakeel Butt 
> Cc: Vitaly Vul 
> Cc: Mike Rapoport 
> Cc: Xidong Wang 
> Cc: Jonathan Adams 
> Cc: 
> Signed-off-by: Andrew Morton 

We need a "Fixes" tag.

Reviewed-by: Shakeel Butt 


Re: [PATCH V3 1/2] zpool: Add malloc_support_movable to zpool_driver

2019-06-05 Thread Shakeel Butt
On Wed, Jun 5, 2019 at 3:06 AM Hui Zhu  wrote:
>
> As a zpool_driver, zsmalloc can allocate movable memory because it
> support migate pages.
> But zbud and z3fold cannot allocate movable memory.
>

Cc: Vitaly

It seems like z3fold does support page migration but z3fold's malloc
is rejecting __GFP_HIGHMEM. Vitaly, is there a reason to keep
rejecting __GFP_HIGHMEM after 1f862989b04a ("mm/z3fold.c: support page
migration").

thanks,
Shakeel


Re: [PATCH v6 01/10] mm: add missing smp read barrier on getting memcg kmem_cache pointer

2019-06-05 Thread Shakeel Butt
On Wed, Jun 5, 2019 at 10:14 AM Roman Gushchin  wrote:
>
> On Tue, Jun 04, 2019 at 09:35:02PM -0700, Shakeel Butt wrote:
> > On Tue, Jun 4, 2019 at 7:45 PM Roman Gushchin  wrote:
> > >
> > > Johannes noticed that reading the memcg kmem_cache pointer in
> > > cache_from_memcg_idx() is performed using READ_ONCE() macro,
> > > which doesn't implement a SMP barrier, which is required
> > > by the logic.
> > >
> > > Add a proper smp_rmb() to be paired with smp_wmb() in
> > > memcg_create_kmem_cache().
> > >
> > > The same applies to memcg_create_kmem_cache() itself,
> > > which reads the same value without barriers and READ_ONCE().
> > >
> > > Suggested-by: Johannes Weiner 
> > > Signed-off-by: Roman Gushchin 
> >
> > Reviewed-by: Shakeel Butt 
> >
> > This seems like independent to the series. Shouldn't this be Cc'ed stable?
>
> It is independent, but let's keep it here to avoid merge conflicts.
>
> It has been so for a long time, and nobody complained, so I'm not sure
> if we really need a stable backport. Do you have a different opinion?
>

Nah, it's fine as it is.


Re: [PATCH V3 2/2] zswap: Use movable memory if zpool support allocate movable memory

2019-06-05 Thread Shakeel Butt
On Wed, Jun 5, 2019 at 3:06 AM Hui Zhu  wrote:
>
> This is the third version that was updated according to the comments
> from Sergey Senozhatsky https://lkml.org/lkml/2019/5/29/73 and
> Shakeel Butt https://lkml.org/lkml/2019/6/4/973
>
> zswap compresses swap pages into a dynamically allocated RAM-based
> memory pool.  The memory pool should be zbud, z3fold or zsmalloc.
> All of them will allocate unmovable pages.  It will increase the
> number of unmovable page blocks that will bad for anti-fragment.
>
> zsmalloc support page migration if request movable page:
> handle = zs_malloc(zram->mem_pool, comp_len,
> GFP_NOIO | __GFP_HIGHMEM |
> __GFP_MOVABLE);
>
> And commit "zpool: Add malloc_support_movable to zpool_driver" add
> zpool_malloc_support_movable check malloc_support_movable to make
> sure if a zpool support allocate movable memory.
>
> This commit let zswap allocate block with gfp
> __GFP_HIGHMEM | __GFP_MOVABLE if zpool support allocate movable memory.
>
> Following part is test log in a pc that has 8G memory and 2G swap.
>
> Without this commit:
> ~# echo lz4 > /sys/module/zswap/parameters/compressor
> ~# echo zsmalloc > /sys/module/zswap/parameters/zpool
> ~# echo 1 > /sys/module/zswap/parameters/enabled
> ~# swapon /swapfile
> ~# cd /home/teawater/kernel/vm-scalability/
> /home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 
> 1024))
> /home/teawater/kernel/vm-scalability# ./case-anon-w-seq
> 2717908992 bytes / 4826062 usecs = 549973 KB/s
> 2717908992 bytes / 4864201 usecs = 545661 KB/s
> 2717908992 bytes / 4867015 usecs = 545346 KB/s
> 2717908992 bytes / 4915485 usecs = 539968 KB/s
> 397853 usecs to free memory
> 357820 usecs to free memory
> 421333 usecs to free memory
> 420454 usecs to free memory
> /home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
> Page block order: 9
> Pages per block:  512
>
> Free pages count per migrate type at order   0  1  2  3  
> 4  5  6  7  8  9 10
> Node0, zone  DMA, typeUnmovable  1  1  1  0  
> 2  1  1  0  1  0  0
> Node0, zone  DMA, type  Movable  0  0  0  0  
> 0  0  0  0  0  1  3
> Node0, zone  DMA, type  Reclaimable  0  0  0  0  
> 0  0  0  0  0  0  0
> Node0, zone  DMA, type   HighAtomic  0  0  0  0  
> 0  0  0  0  0  0  0
> Node0, zone  DMA, type  CMA  0  0  0  0  
> 0  0  0  0  0  0  0
> Node0, zone  DMA, type  Isolate  0  0  0  0  
> 0  0  0  0  0  0  0
> Node0, zoneDMA32, typeUnmovable  6  5  8  6  
> 6  5  4  1  1  1  0
> Node0, zoneDMA32, type  Movable 25 20 20 19 
> 22 15 14 11 11  5767
> Node0, zoneDMA32, type  Reclaimable  0  0  0  0  
> 0  0  0  0  0  0  0
> Node0, zoneDMA32, type   HighAtomic  0  0  0  0  
> 0  0  0  0  0  0  0
> Node0, zoneDMA32, type  CMA  0  0  0  0  
> 0  0  0  0  0  0  0
> Node0, zoneDMA32, type  Isolate  0  0  0  0  
> 0  0  0  0  0  0  0
> Node0, zone   Normal, typeUnmovable   4753   5588   5159   4613   
> 3712   2520   1448594188 11  0
> Node0, zone   Normal, type  Movable 16  3457   2648   
> 2143   1435860459223224296
> Node0, zone   Normal, type  Reclaimable  0  0 44 38 
> 11  2  0  0  0  0  0
> Node0, zone   Normal, type   HighAtomic  0  0  0  0  
> 0  0  0  0  0  0  0
> Node0, zone   Normal, type  CMA  0  0  0  0  
> 0  0  0  0  0  0  0
> Node0, zone   Normal, type  Isolate  0  0  0  0  
> 0  0  0  0  0  0  0
>
> Number of blocks type Unmovable  Movable  Reclaimable   HighAtomic
>   CMA  Isolate
> Node 0, zone  DMA1700 
>00
> Node 0, zoneDMA324 165200 
>00
> Node 0, zone   Normal  931 1485   15

Re: [PATCH V3 1/2] zpool: Add malloc_support_movable to zpool_driver

2019-06-05 Thread Shakeel Butt
On Wed, Jun 5, 2019 at 3:06 AM Hui Zhu  wrote:
>
> As a zpool_driver, zsmalloc can allocate movable memory because it
> support migate pages.
> But zbud and z3fold cannot allocate movable memory.
>
> This commit adds malloc_support_movable to zpool_driver.
> If a zpool_driver support allocate movable memory, set it to true.
> And add zpool_malloc_support_movable check malloc_support_movable
> to make sure if a zpool support allocate movable memory.
>
> Signed-off-by: Hui Zhu 

Reviewed-by: Shakeel Butt 

IMHO no need to block this series on z3fold query.

> ---
>  include/linux/zpool.h |  3 +++
>  mm/zpool.c| 16 
>  mm/zsmalloc.c | 19 ++-
>  3 files changed, 29 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/zpool.h b/include/linux/zpool.h
> index 7238865e75b0..51bf43076165 100644
> --- a/include/linux/zpool.h
> +++ b/include/linux/zpool.h
> @@ -46,6 +46,8 @@ const char *zpool_get_type(struct zpool *pool);
>
>  void zpool_destroy_pool(struct zpool *pool);
>
> +bool zpool_malloc_support_movable(struct zpool *pool);
> +
>  int zpool_malloc(struct zpool *pool, size_t size, gfp_t gfp,
> unsigned long *handle);
>
> @@ -90,6 +92,7 @@ struct zpool_driver {
> struct zpool *zpool);
> void (*destroy)(void *pool);
>
> +   bool malloc_support_movable;
> int (*malloc)(void *pool, size_t size, gfp_t gfp,
> unsigned long *handle);
> void (*free)(void *pool, unsigned long handle);
> diff --git a/mm/zpool.c b/mm/zpool.c
> index a2dd9107857d..863669212070 100644
> --- a/mm/zpool.c
> +++ b/mm/zpool.c
> @@ -238,6 +238,22 @@ const char *zpool_get_type(struct zpool *zpool)
> return zpool->driver->type;
>  }
>
> +/**
> + * zpool_malloc_support_movable() - Check if the zpool support
> + * allocate movable memory
> + * @zpool: The zpool to check
> + *
> + * This returns if the zpool support allocate movable memory.
> + *
> + * Implementations must guarantee this to be thread-safe.
> + *
> + * Returns: true if if the zpool support allocate movable memory, false if 
> not
> + */
> +bool zpool_malloc_support_movable(struct zpool *zpool)
> +{
> +   return zpool->driver->malloc_support_movable;
> +}
> +
>  /**
>   * zpool_malloc() - Allocate memory
>   * @zpool: The zpool to allocate from.
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 0787d33b80d8..8f3d9a4d46f4 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -437,15 +437,16 @@ static u64 zs_zpool_total_size(void *pool)
>  }
>
>  static struct zpool_driver zs_zpool_driver = {
> -   .type = "zsmalloc",
> -   .owner =THIS_MODULE,
> -   .create =   zs_zpool_create,
> -   .destroy =  zs_zpool_destroy,
> -   .malloc =   zs_zpool_malloc,
> -   .free = zs_zpool_free,
> -   .map =  zs_zpool_map,
> -   .unmap =zs_zpool_unmap,
> -   .total_size =   zs_zpool_total_size,
> +   .type =   "zsmalloc",
> +   .owner =  THIS_MODULE,
> +   .create = zs_zpool_create,
> +   .destroy =zs_zpool_destroy,
> +   .malloc_support_movable = true,
> +   .malloc = zs_zpool_malloc,
> +   .free =   zs_zpool_free,
> +   .map =zs_zpool_map,
> +   .unmap =  zs_zpool_unmap,
> +   .total_size = zs_zpool_total_size,
>  };
>
>  MODULE_ALIAS("zpool-zsmalloc");
> --
> 2.21.0 (Apple Git-120)
>


Re: [PATCH V2 2/2] zswap: Add module parameter malloc_movable_if_support

2019-06-04 Thread Shakeel Butt
On Sun, Jun 2, 2019 at 2:47 AM Hui Zhu  wrote:
>
> This is the second version that was updated according to the comments
> from Sergey Senozhatsky in https://lkml.org/lkml/2019/5/29/73
>
> zswap compresses swap pages into a dynamically allocated RAM-based
> memory pool.  The memory pool should be zbud, z3fold or zsmalloc.
> All of them will allocate unmovable pages.  It will increase the
> number of unmovable page blocks that will bad for anti-fragment.
>
> zsmalloc support page migration if request movable page:
> handle = zs_malloc(zram->mem_pool, comp_len,
> GFP_NOIO | __GFP_HIGHMEM |
> __GFP_MOVABLE);
>
> And commit "zpool: Add malloc_support_movable to zpool_driver" add
> zpool_malloc_support_movable check malloc_support_movable to make
> sure if a zpool support allocate movable memory.
>
> This commit adds module parameter malloc_movable_if_support to enable
> or disable zpool allocate block with gfp __GFP_HIGHMEM | __GFP_MOVABLE
> if it support allocate movable memory (disabled by default).
>
> Following part is test log in a pc that has 8G memory and 2G swap.
>
> When it disabled:
>  echo lz4 > /sys/module/zswap/parameters/compressor
>  echo zsmalloc > /sys/module/zswap/parameters/zpool
>  echo 1 > /sys/module/zswap/parameters/enabled
>  swapon /swapfile
>  cd /home/teawater/kernel/vm-scalability/
> /home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 
> 1024))
> /home/teawater/kernel/vm-scalability# ./case-anon-w-seq
> 2717908992 bytes / 3977932 usecs = 667233 KB/s
> 2717908992 bytes / 4160702 usecs = 637923 KB/s
> 2717908992 bytes / 4354611 usecs = 609516 KB/s
> 293359 usecs to free memory
> 340304 usecs to free memory
> 205781 usecs to free memory
> 2717908992 bytes / 5588016 usecs = 474982 KB/s
> 166124 usecs to free memory
> /home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
> Page block order: 9
> Pages per block:  512
>
> Free pages count per migrate type at order   0  1  2  3  
> 4  5  6  7  8  9 10
> Node0, zone  DMA, typeUnmovable  1  1  1  0  
> 2  1  1  0  1  0  0
> Node0, zone  DMA, type  Movable  0  0  0  0  
> 0  0  0  0  0  1  3
> Node0, zone  DMA, type  Reclaimable  0  0  0  0  
> 0  0  0  0  0  0  0
> Node0, zone  DMA, type   HighAtomic  0  0  0  0  
> 0  0  0  0  0  0  0
> Node0, zone  DMA, type  CMA  0  0  0  0  
> 0  0  0  0  0  0  0
> Node0, zone  DMA, type  Isolate  0  0  0  0  
> 0  0  0  0  0  0  0
> Node0, zoneDMA32, typeUnmovable  5 10  9  8  
> 8  5  1  2  3  0  0
> Node0, zoneDMA32, type  Movable 15 16 14 12 
> 14 10  9  6  6  5776
> Node0, zoneDMA32, type  Reclaimable  0  0  0  0  
> 0  0  0  0  0  0  0
> Node0, zoneDMA32, type   HighAtomic  0  0  0  0  
> 0  0  0  0  0  0  0
> Node0, zoneDMA32, type  CMA  0  0  0  0  
> 0  0  0  0  0  0  0
> Node0, zoneDMA32, type  Isolate  0  0  0  0  
> 0  0  0  0  0  0  0
> Node0, zone   Normal, typeUnmovable   7097   6914   6473   5642   
> 4373   2664   1220319 78  4  0
> Node0, zone   Normal, type  Movable   2092   3216   2820   2266   
> 1585946559359237258378
> Node0, zone   Normal, type  Reclaimable 47 88122 80 
> 34  9  5  4  2  1  2
> Node0, zone   Normal, type   HighAtomic  0  0  0  0  
> 0  0  0  0  0  0  0
> Node0, zone   Normal, type  CMA  0  0  0  0  
> 0  0  0  0  0  0  0
> Node0, zone   Normal, type  Isolate  0  0  0  0  
> 0  0  0  0  0  0  0
>
> Number of blocks type Unmovable  Movable  Reclaimable   HighAtomic
>   CMA  Isolate
> Node 0, zone  DMA1700 
>00
> Node 0, zoneDMA324 165200 
>00
> Node 0, zone   Normal  834 1572   250 
>00
>
> When it enabled:
>  echo lz4 > /sys/module/zswap/parameters/compressor
>  echo zsmalloc > /sys/module/zswap/parameters/zpool
>  echo 1 > /sys/module/zswap/parameters/enabled
>  echo 1 > 

Re: [PATCH v6 01/10] mm: add missing smp read barrier on getting memcg kmem_cache pointer

2019-06-04 Thread Shakeel Butt
On Tue, Jun 4, 2019 at 7:45 PM Roman Gushchin  wrote:
>
> Johannes noticed that reading the memcg kmem_cache pointer in
> cache_from_memcg_idx() is performed using READ_ONCE() macro,
> which doesn't implement a SMP barrier, which is required
> by the logic.
>
> Add a proper smp_rmb() to be paired with smp_wmb() in
> memcg_create_kmem_cache().
>
> The same applies to memcg_create_kmem_cache() itself,
> which reads the same value without barriers and READ_ONCE().
>
> Suggested-by: Johannes Weiner 
> Signed-off-by: Roman Gushchin 

Reviewed-by: Shakeel Butt 

This seems like independent to the series. Shouldn't this be Cc'ed stable?

> ---
>  mm/slab.h| 1 +
>  mm/slab_common.c | 3 ++-
>  2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/mm/slab.h b/mm/slab.h
> index 739099af6cbb..1176b61bb8fc 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -260,6 +260,7 @@ cache_from_memcg_idx(struct kmem_cache *s, int idx)
>  * memcg_caches issues a write barrier to match this (see
>  * memcg_create_kmem_cache()).
>  */
> +   smp_rmb();
> cachep = READ_ONCE(arr->entries[idx]);
> rcu_read_unlock();
>
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 58251ba63e4a..8092bdfc05d5 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -652,7 +652,8 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
>  * allocation (see memcg_kmem_get_cache()), several threads can try to
>  * create the same cache, but only one of them may succeed.
>  */
> -   if (arr->entries[idx])
> +   smp_rmb();
> +   if (READ_ONCE(arr->entries[idx]))
> goto out_unlock;
>
> cgroup_name(css->cgroup, memcg_name_buf, sizeof(memcg_name_buf));
> --
> 2.20.1
>


Re: [PATCH] mm/z3fold: Fix z3fold_buddy_slots use after free

2019-07-03 Thread Shakeel Butt
On Tue, Jul 2, 2019 at 11:03 PM Vitaly Wool  wrote:
>
> On Tue, Jul 2, 2019 at 6:57 PM Henry Burns  wrote:
> >
> > On Tue, Jul 2, 2019 at 12:45 AM Vitaly Wool  wrote:
> > >
> > > Hi Henry,
> > >
> > > On Mon, Jul 1, 2019 at 8:31 PM Henry Burns  wrote:
> > > >
> > > > Running z3fold stress testing with address sanitization
> > > > showed zhdr->slots was being used after it was freed.
> > > >
> > > > z3fold_free(z3fold_pool, handle)
> > > >   free_handle(handle)
> > > > kmem_cache_free(pool->c_handle, zhdr->slots)
> > > >   release_z3fold_page_locked_list(kref)
> > > > __release_z3fold_page(zhdr, true)
> > > >   zhdr_to_pool(zhdr)
> > > > slots_to_pool(zhdr->slots)  *BOOM*
> > >
> > > Thanks for looking into this. I'm not entirely sure I'm all for
> > > splitting free_handle() but let me think about it.
> > >
> > > > Instead we split free_handle into two functions, release_handle()
> > > > and free_slots(). We use release_handle() in place of free_handle(),
> > > > and use free_slots() to call kmem_cache_free() after
> > > > __release_z3fold_page() is done.
> > >
> > > A little less intrusive solution would be to move backlink to pool
> > > from slots back to z3fold_header. Looks like it was a bad idea from
> > > the start.
> > >
> > > Best regards,
> > >Vitaly
> >
> > We still want z3fold pages to be movable though. Wouldn't moving
> > the backink to the pool from slots to z3fold_header prevent us from
> > enabling migration?
>
> That is a valid point but we can just add back pool pointer to
> z3fold_header. The thing here is, there's another patch in the
> pipeline that allows for a better (inter-page) compaction and it will
> somewhat complicate things, because sometimes slots will have to be
> released after z3fold page is released (because they will hold a
> handle to another z3fold page). I would prefer that we just added back
> pool to z3fold_header and changed zhdr_to_pool to just return
> zhdr->pool, then had the compaction patch valid again, and then we
> could come back to size optimization.
>

By adding pool pointer back to z3fold_header, will we still be able to
move/migrate/compact the z3fold pages?


Re: [PATCH] mm, memcg: Add a memcg_slabinfo debugfs file

2019-06-19 Thread Shakeel Butt
On Wed, Jun 19, 2019 at 7:46 AM Waiman Long  wrote:
>
> There are concerns about memory leaks from extensive use of memory
> cgroups as each memory cgroup creates its own set of kmem caches. There
> is a possiblity that the memcg kmem caches may remain even after the
> memory cgroup removal. Therefore, it will be useful to show how many
> memcg caches are present for each of the kmem caches.
>
> This patch introduces a new /memcg_slabinfo file which is
> somewhat similar to /proc/slabinfo in format, but lists only slabs that
> are in memcg kmem caches. Information available in /proc/slabinfo are
> not repeated in memcg_slabinfo.
>

At Google, we have an interface /proc/slabinfo_full which shows each
kmem cache (root and memcg) on a separate line i.e. no accumulation.
This interface has helped us a lot for debugging zombies and memory
leaks. The name of the memcg kmem caches include the memcg name, css
id and "dead" for offlined memcgs. I think these extra information is
much more useful for debugging. What do you think?

Shakeel


Re: [PATCH] mm, memcg: Add a memcg_slabinfo debugfs file

2019-06-19 Thread Shakeel Butt
On Wed, Jun 19, 2019 at 8:30 AM Waiman Long  wrote:
>
> On 6/19/19 11:18 AM, Shakeel Butt wrote:
> > On Wed, Jun 19, 2019 at 7:46 AM Waiman Long  wrote:
> >> There are concerns about memory leaks from extensive use of memory
> >> cgroups as each memory cgroup creates its own set of kmem caches. There
> >> is a possiblity that the memcg kmem caches may remain even after the
> >> memory cgroup removal. Therefore, it will be useful to show how many
> >> memcg caches are present for each of the kmem caches.
> >>
> >> This patch introduces a new /memcg_slabinfo file which is
> >> somewhat similar to /proc/slabinfo in format, but lists only slabs that
> >> are in memcg kmem caches. Information available in /proc/slabinfo are
> >> not repeated in memcg_slabinfo.
> >>
> > At Google, we have an interface /proc/slabinfo_full which shows each
> > kmem cache (root and memcg) on a separate line i.e. no accumulation.
> > This interface has helped us a lot for debugging zombies and memory
> > leaks. The name of the memcg kmem caches include the memcg name, css
> > id and "dead" for offlined memcgs. I think these extra information is
> > much more useful for debugging. What do you think?
> >
> > Shakeel
>
> Yes, I think that can be a good idea. My only concern is that it can be
> very verbose. Will work on a v2 patch.
>

Yes, it is very verbose but it is only for debugging and normal users
should not be (continuously) reading that interface.

Shakeel


Re: memcg/kmem panics

2019-06-19 Thread Shakeel Butt
On Wed, Jun 19, 2019 at 3:50 PM Dave Hansen  wrote:
>
> I have a bit of a grievance to file.  :)
>
> I'm seeing "Cannot create slab..." panic()s coming from
> kmem_cache_open() when trying to create memory cgroups on a Fedora
> system running 5.2-rc's.  The panic()s happen when failing to create
> memcg-specific slabs because the memcg code passes through the
> root_cache->flags, which can include SLAB_PANIC.
>
> I haven't tracked down the root cause yet, or where this behavior
> started.  But, the end-user experience is that systemd tries to create a
> cgroup and ends up with a kernel panic.  That's rather sad, especially
> for the poor sod that's trying to debug it.
>
> Should memcg_create_kmem_cache() be, perhaps filtering out SLAB_PANIC
> from root_cache->flags, for instance?  That might make the system a bit
> less likely to turn into a doorstop if and when something goes mildly
> wrong.  I've hacked out the panic()s and the system actually seems to
> boot OK.

You must be using CONFIG_SLUB and I see that in kmem_cache_open() in
SLUB doing a SLAB_PANIC check. I think we should remove that
altogether from SLUB as SLAB does not do this and failure in memcg
kmem cache creation can and should be handled gracefully. I can send a
patch to remove that check.

Shakeel


[PATCH] slub: Don't panic for memcg kmem cache creation failure

2019-06-19 Thread Shakeel Butt
Currently for CONFIG_SLUB, if a memcg kmem cache creation is failed and
the corresponding root kmem cache has SLAB_PANIC flag, the kernel will
be crashed. This is unnecessary as the kernel can handle the creation
failures of memcg kmem caches. Additionally CONFIG_SLAB does not
implement this behavior. So, to keep the behavior consistent between
SLAB and SLUB, removing the panic for memcg kmem cache creation
failures. The root kmem cache creation failure for SLAB_PANIC correctly
panics for both SLAB and SLUB.

Reported-by: Dave Hansen 
Signed-off-by: Shakeel Butt 
---
 mm/slub.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 6a5174b51cd6..84c6508e360d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3640,10 +3640,6 @@ static int kmem_cache_open(struct kmem_cache *s, 
slab_flags_t flags)
 
free_kmem_cache_nodes(s);
 error:
-   if (flags & SLAB_PANIC)
-   panic("Cannot create slab %s size=%u realsize=%u order=%u 
offset=%u flags=%lx\n",
- s->name, s->size, s->size,
- oo_order(s->oo), s->offset, (unsigned long)flags);
return -EINVAL;
 }
 
-- 
2.22.0.410.gd8fdbe21b5-goog



Re: [PATCH v2] mm, memcg: Add a memcg_slabinfo debugfs file

2019-06-19 Thread Shakeel Butt
Hi Waiman,

On Wed, Jun 19, 2019 at 10:16 AM Waiman Long  wrote:
>
> There are concerns about memory leaks from extensive use of memory
> cgroups as each memory cgroup creates its own set of kmem caches. There
> is a possiblity that the memcg kmem caches may remain even after the
> memory cgroups have been offlined. Therefore, it will be useful to show
> the status of each of memcg kmem caches.
>
> This patch introduces a new /memcg_slabinfo file which is
> somewhat similar to /proc/slabinfo in format, but lists only information
> about kmem caches that have child memcg kmem caches. Information
> available in /proc/slabinfo are not repeated in memcg_slabinfo.
>
> A portion of a sample output of the file was:
>
>   #  
>   rpc_inode_cache   root  13 51  1  1
>   rpc_inode_cache 48   0  0  0  0
>   fat_inode_cache   root   1 45  1  1
>   fat_inode_cache 41   2 45  1  1
>   xfs_inode root 770816 24 24
>   xfs_inode   92  22 34  1  1
>   xfs_inode   88:dead  1 34  1  1
>   xfs_inode   89:dead 23 34  1  1
>   xfs_inode   85   4 34  1  1
>   xfs_inode   84   9 34  1  1
>
> The css id of the memcg is also listed. If a memcg is not online,
> the tag ":dead" will be attached as shown above.
>
> Suggested-by: Shakeel Butt 
> Signed-off-by: Waiman Long 
> ---
>  mm/slab_common.c | 57 
>  1 file changed, 57 insertions(+)
>
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 58251ba63e4a..2bca1558a722 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1498,6 +1499,62 @@ static int __init slab_proc_init(void)
> return 0;
>  }
>  module_init(slab_proc_init);
> +
> +#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_MEMCG_KMEM)
> +/*
> + * Display information about kmem caches that have child memcg caches.
> + */
> +static int memcg_slabinfo_show(struct seq_file *m, void *unused)
> +{
> +   struct kmem_cache *s, *c;
> +   struct slabinfo sinfo;
> +
> +   mutex_lock(_mutex);

On large machines there can be thousands of memcgs and potentially
each memcg can have hundreds of kmem caches. So, the slab_mutex can be
held for a very long time.

Our internal implementation traverses the memcg tree and then
traverses 'memcg->kmem_caches' within the slab_mutex (and
cond_resched() after unlock).

> +   seq_puts(m, "#");
> +   seq_puts(m, "  \n");
> +   list_for_each_entry(s, _root_caches, root_caches_node) {
> +   /*
> +* Skip kmem caches that don't have any memcg children.
> +*/
> +   if (list_empty(>memcg_params.children))
> +   continue;
> +
> +   memset(, 0, sizeof(sinfo));
> +   get_slabinfo(s, );
> +   seq_printf(m, "%-17s root  %6lu %6lu %6lu %6lu\n",
> +  cache_name(s), sinfo.active_objs, sinfo.num_objs,
> +  sinfo.active_slabs, sinfo.num_slabs);
> +
> +   for_each_memcg_cache(c, s) {
> +   struct cgroup_subsys_state *css;
> +   char *dead = "";
> +
> +   css = >memcg_params.memcg->css;
> +   if (!(css->flags & CSS_ONLINE))
> +   dead = ":dead";

Please note that Roman's kmem cache reparenting patch series have made
kmem caches of zombie memcgs a bit tricky. On memcg offlining the
memcg kmem caches are reparented and the css->id can get recycled. So,
we want to know that the a kmem cache is reparented and which memcg it
belonged to initially. Determining if a kmem cache is reparented, we
can store a flag on the kmem cache and for the previous memcg we can
use fhandle. However to not make this more complicated, for now, we
can just have the info that the kmem cache was reparented i.e. belongs
to an offlined memcg.

> +
> +   memset(, 0, sizeof(sinfo));
> +   get_slabinfo(c, );
> +   seq_printf(m, "%-17s %4d%5s %6lu %6lu %6lu %6lu\n",
> +  cache_name(c), css->id, dead,
> +  sinfo.active_objs, sinfo.num_objs,
> +  sinfo.active_slabs, sinfo.num_slabs);
> +

Re: [PATCH v2] mm, memcg: Add a memcg_slabinfo debugfs file

2019-06-20 Thread Shakeel Butt
On Thu, Jun 20, 2019 at 7:24 AM Waiman Long  wrote:
>
> On 6/19/19 7:48 PM, Shakeel Butt wrote:
> > Hi Waiman,
> >
> > On Wed, Jun 19, 2019 at 10:16 AM Waiman Long  wrote:
> >> There are concerns about memory leaks from extensive use of memory
> >> cgroups as each memory cgroup creates its own set of kmem caches. There
> >> is a possiblity that the memcg kmem caches may remain even after the
> >> memory cgroups have been offlined. Therefore, it will be useful to show
> >> the status of each of memcg kmem caches.
> >>
> >> This patch introduces a new /memcg_slabinfo file which is
> >> somewhat similar to /proc/slabinfo in format, but lists only information
> >> about kmem caches that have child memcg kmem caches. Information
> >> available in /proc/slabinfo are not repeated in memcg_slabinfo.
> >>
> >> A portion of a sample output of the file was:
> >>
> >>   #  
> >> 
> >>   rpc_inode_cache   root  13 51  1  1
> >>   rpc_inode_cache 48   0  0  0  0
> >>   fat_inode_cache   root   1 45  1  1
> >>   fat_inode_cache 41   2 45  1  1
> >>   xfs_inode root 770816 24 24
> >>   xfs_inode   92  22 34  1  1
> >>   xfs_inode   88:dead  1 34  1  1
> >>   xfs_inode   89:dead 23 34  1  1
> >>   xfs_inode   85   4 34  1  1
> >>   xfs_inode   84   9 34  1  1
> >>
> >> The css id of the memcg is also listed. If a memcg is not online,
> >> the tag ":dead" will be attached as shown above.
> >>
> >> Suggested-by: Shakeel Butt 
> >> Signed-off-by: Waiman Long 
> >> ---
> >>  mm/slab_common.c | 57 
> >>  1 file changed, 57 insertions(+)
> >>
> >> diff --git a/mm/slab_common.c b/mm/slab_common.c
> >> index 58251ba63e4a..2bca1558a722 100644
> >> --- a/mm/slab_common.c
> >> +++ b/mm/slab_common.c
> >> @@ -17,6 +17,7 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> >>  #include 
> >>  #include 
> >>  #include 
> >> @@ -1498,6 +1499,62 @@ static int __init slab_proc_init(void)
> >> return 0;
> >>  }
> >>  module_init(slab_proc_init);
> >> +
> >> +#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_MEMCG_KMEM)
> >> +/*
> >> + * Display information about kmem caches that have child memcg caches.
> >> + */
> >> +static int memcg_slabinfo_show(struct seq_file *m, void *unused)
> >> +{
> >> +   struct kmem_cache *s, *c;
> >> +   struct slabinfo sinfo;
> >> +
> >> +   mutex_lock(_mutex);
> > On large machines there can be thousands of memcgs and potentially
> > each memcg can have hundreds of kmem caches. So, the slab_mutex can be
> > held for a very long time.
>
> But that is also what /proc/slabinfo does by doing mutex_lock() at
> slab_start() and mutex_unlock() at slab_stop(). So the same problem will
> happen when /proc/slabinfo is being read.
>
> When you are in a situation that reading /proc/slabinfo take a long time
> because of the large number of memcg's, the system is in some kind of
> trouble anyway. I am saying that we should not improve the scalability
> of this patch. It is just that some nasty race conditions may pop up if
> we release the lock and re-acquire it latter. That will greatly
> complicate the code to handle all those edge cases.
>

We have been using that interface and implementation for couple of
years and have not seen any race condition. However I am fine with
what you have here for now. We can always come back if we think we
need to improve it.

> > Our internal implementation traverses the memcg tree and then
> > traverses 'memcg->kmem_caches' within the slab_mutex (and
> > cond_resched() after unlock).
> For cgroup v1, the setting of the CONFIG_SLUB_DEBUG option will allow
> you to iterate and display slabinfo just for that particular memcg. I am
> thinking of extending the debug controller to do similar thing for
> cgroup v2.

I was also planning to look into that and it seems like you are
already on it. Do CC me the patches.

> >> +   seq_puts(m, "#");
> >> +   seq_puts(m, "  \n");
> >> +   list_for_each_entry(s, _root_caches, 

Re: [PATCH] slub: Don't panic for memcg kmem cache creation failure

2019-06-20 Thread Shakeel Butt
On Wed, Jun 19, 2019 at 10:50 PM Michal Hocko  wrote:
>
> On Wed 19-06-19 16:25:14, Shakeel Butt wrote:
> > Currently for CONFIG_SLUB, if a memcg kmem cache creation is failed and
> > the corresponding root kmem cache has SLAB_PANIC flag, the kernel will
> > be crashed. This is unnecessary as the kernel can handle the creation
> > failures of memcg kmem caches.
>
> AFAICS it will handle those by simply not accounting those objects
> right?
>

The memcg kmem cache creation is async. The allocation has already
been decided not to be accounted on creation trigger. If memcg kmem
cache creation is failed, it will fail silently and the next
allocation will trigger the creation process again.

> > Additionally CONFIG_SLAB does not
> > implement this behavior. So, to keep the behavior consistent between
> > SLAB and SLUB, removing the panic for memcg kmem cache creation
> > failures. The root kmem cache creation failure for SLAB_PANIC correctly
> > panics for both SLAB and SLUB.
>
> I do agree that panicing is really dubious especially because it opens
> doors to shut the system down from a restricted environment. So the
> patch makes sesne to me.
>
> I am wondering whether SLAB_PANIC makes sense in general though. Why is
> it any different from any other essential early allocations? We tend to
> not care about allocation failures for those on bases that the system
> must be in a broken state to fail that early already. Do you think it is
> time to remove SLAB_PANIC altogether?
>

That would need some investigation into the history of SLAB_PANIC. I
will look into it.

> > Reported-by: Dave Hansen 
> > Signed-off-by: Shakeel Butt 
>
> Acked-by: Michal Hocko 

Thanks.


Re: [PATCH] mm: memcg/slab: properly handle kmem_caches reparented to root_mem_cgroup

2019-06-20 Thread Shakeel Butt
On Wed, Jun 19, 2019 at 6:57 PM Roman Gushchin  wrote:
>
> As a result of reparenting a kmem_cache might belong to the root
> memory cgroup. It happens when a top-level memory cgroup is removed,
> and all associated kmem_caches are reparented to the root memory
> cgroup.
>
> The root memory cgroup is special, and requires a special handling.
> Let's make sure that we don't try to charge or uncharge it,
> and we handle system-wide vmstats exactly as for root kmem_caches.
>
> Note, that we still need to alter the kmem_cache reference counter,
> so that the kmem_cache can be released properly.
>
> The issue was discovered by running CRIU tests; the following warning
> did appear:
>
> [  381.345960] WARNING: CPU: 0 PID: 11655 at mm/page_counter.c:62
> page_counter_cancel+0x26/0x30
> [  381.345992] Modules linked in:
> [  381.345998] CPU: 0 PID: 11655 Comm: kworker/0:8 Not tainted
> 5.2.0-rc5-next-20190618+ #1
> [  381.346001] Hardware name: Google Google Compute Engine/Google
> Compute Engine, BIOS Google 01/01/2011
> [  381.346010] Workqueue: memcg_kmem_cache kmemcg_workfn
> [  381.346013] RIP: 0010:page_counter_cancel+0x26/0x30
> [  381.346017] Code: 1f 44 00 00 0f 1f 44 00 00 48 89 f0 53 48 f7 d8
> f0 48 0f c1 07 48 29 f0 48 89 c3 48 89 c6 e8 61 ff ff ff 48 85 db 78
> 02 5b c3 <0f> 0b 5b c3 66 0f 1f 44 00 00 0f 1f 44 00 00 48 85 ff 74 41
> 41 55
> [  381.346019] RSP: 0018:b3b34319f990 EFLAGS: 00010086
> [  381.346022] RAX: fffc RBX: fffc RCX: 
> 0004
> [  381.346024] RDX:  RSI: fffc RDI: 
> 9c2cd7165270
> [  381.346026] RBP: 0004 R08:  R09: 
> 0001
> [  381.346028] R10: 00c8 R11: 9c2cd684e660 R12: 
> fffc
> [  381.346030] R13: 0002 R14: 0006 R15: 
> 9c2c8ce1f200
> [  381.346033] FS:  () GS:9c2cd820()
> knlGS:
> [  381.346039] CS:  0010 DS:  ES:  CR0: 80050033
> [  381.346041] CR2: 007be000 CR3: 0001cdbfc005 CR4: 
> 001606f0
> [  381.346043] DR0:  DR1:  DR2: 
> 
> [  381.346045] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [  381.346047] Call Trace:
> [  381.346054]  page_counter_uncharge+0x1d/0x30
> [  381.346065]  __memcg_kmem_uncharge_memcg+0x39/0x60
> [  381.346071]  __free_slab+0x34c/0x460
> [  381.346079]  deactivate_slab.isra.80+0x57d/0x6d0
> [  381.346088]  ? add_lock_to_list.isra.36+0x9c/0xf0
> [  381.346095]  ? __lock_acquire+0x252/0x1410
> [  381.346106]  ? cpumask_next_and+0x19/0x20
> [  381.346110]  ? slub_cpu_dead+0xd0/0xd0
> [  381.346113]  flush_cpu_slab+0x36/0x50
> [  381.346117]  ? slub_cpu_dead+0xd0/0xd0
> [  381.346125]  on_each_cpu_mask+0x51/0x70
> [  381.346131]  ? ksm_migrate_page+0x60/0x60
> [  381.346134]  on_each_cpu_cond_mask+0xab/0x100
> [  381.346143]  __kmem_cache_shrink+0x56/0x320
> [  381.346150]  ? ret_from_fork+0x3a/0x50
> [  381.346157]  ? unwind_next_frame+0x73/0x480
> [  381.346176]  ? __lock_acquire+0x252/0x1410
> [  381.346188]  ? kmemcg_workfn+0x21/0x50
> [  381.346196]  ? __mutex_lock+0x99/0x920
> [  381.346199]  ? kmemcg_workfn+0x21/0x50
> [  381.346205]  ? kmemcg_workfn+0x21/0x50
> [  381.346216]  __kmemcg_cache_deactivate_after_rcu+0xe/0x40
> [  381.346220]  kmemcg_cache_deactivate_after_rcu+0xe/0x20
> [  381.346223]  kmemcg_workfn+0x31/0x50
> [  381.346230]  process_one_work+0x23c/0x5e0
> [  381.346241]  worker_thread+0x3c/0x390
> [  381.346248]  ? process_one_work+0x5e0/0x5e0
> [  381.346252]  kthread+0x11d/0x140
> [  381.346255]  ? kthread_create_on_node+0x60/0x60
> [  381.346261]  ret_from_fork+0x3a/0x50
> [  381.346275] irq event stamp: 10302
> [  381.346278] hardirqs last  enabled at (10301): []
> _raw_spin_unlock_irq+0x29/0x40
> [  381.346282] hardirqs last disabled at (10302): []
> on_each_cpu_mask+0x49/0x70
> [  381.346287] softirqs last  enabled at (10262): []
> cgroup_idr_replace+0x3a/0x50
> [  381.346290] softirqs last disabled at (10260): []
> cgroup_idr_replace+0x1d/0x50
> [  381.346293] ---[ end trace b324ba73eb3659f0 ]---
>
> Reported-by: Andrei Vagin 
> Signed-off-by: Roman Gushchin 
> Cc: Christoph Lameter 
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Shakeel Butt 
> Cc: Vladimir Davydov 
> Cc: Waiman Long 
> Cc: David Rientjes 
> Cc: Joonsoo Kim 
> Cc: Pekka Enberg 
> ---
>  mm/slab.h | 17 +
>  1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/mm/slab.h b/mm/slab.h
> index a4c9b9d042de..c02e7f44268b 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @

Re: [PATCH] memcg: oom: ignore oom warnings from memory.max

2020-05-04 Thread Shakeel Butt
On Mon, May 4, 2020 at 9:06 AM Michal Hocko  wrote:
>
> On Mon 04-05-20 08:35:57, Shakeel Butt wrote:
> > On Mon, May 4, 2020 at 8:00 AM Michal Hocko  wrote:
> > >
> > > On Mon 04-05-20 07:53:01, Shakeel Butt wrote:
> [...]
> > > > I am trying to see if "no eligible task" is really an issue and should
> > > > be warned for the "other use cases". The only real use-case I can
> > > > think of are resource managers adjusting the limit dynamically. I
> > > > don't see "no eligible task" a concerning reason for such use-case.
> > >
> > > It is very much a concerning reason to notify about like any other OOM
> > > situation due to hard limit breach. In this case it is worse in some
> > > sense because the limit cannot be trimmed down because there is no
> > > directly reclaimable memory at all. Such an oom situation is
> > > effectivelly conserved.
> > > --
> >
> > Let me make a more precise statement and tell me if you agree. The "no
> > eligible task" is concerning for the charging path but not for the
> > writer of memory.max. The writer can read the usage and
> > cgroup.[procs|events] to figure out the situation if needed.
>
> I really hate to repeat myself but this is no different from a regular
> oom situation.

Conceptually yes there is no difference but there is no *divine
restriction* to not make a difference if there is a real world
use-case which would benefit from it.

> Admin sets the hard limit and the kernel tries to act
> upon that.
>
> You cannot make any assumption about what admin wanted or didn't want
> to see.

Actually we always make assumptions on how the feature we implement
will be used and as new use-cases come the assumptions evolve.

> We simply trigger the oom killer on memory.max and this is a
> documented behavior. No eligible task or no task at all is a simply a
> corner case

For "sweep before tear down" use-case this is not a corner case.

> when the kernel cannot act and mentions that along with the
> oom report so that whoever consumes that information can debug or act on
> that fact.
>
> Silencing the oom report is simply removing a potentially useful
> aid to debug further a potential problem.

*Potentially* useful for debugging versus actually beneficial for
"sweep before tear down" use-case. Also I am not saying to make "no
dumps for memory.max when no eligible tasks" a set in stone rule. We
can always reevaluate when such information will actually be useful.

Johannes/Andrew, what's your opinion?


Re: [PATCH v5 0/4] Charge loop device i/o to issuing cgroup

2020-05-05 Thread Shakeel Butt
On Mon, May 4, 2020 at 11:30 PM Dave Chinner  wrote:
>
> On Tue, Apr 28, 2020 at 10:27:32PM -0400, Johannes Weiner wrote:
> > On Wed, Apr 29, 2020 at 07:47:34AM +1000, Dave Chinner wrote:
> > > On Tue, Apr 28, 2020 at 12:13:46PM -0400, Dan Schatzberg wrote:
> > > > This patch series does some
> > > > minor modification to the loop driver so that each cgroup can make
> > > > forward progress independently to avoid this inversion.
> > > >
> > > > With this patch series applied, the above script triggers OOM kills
> > > > when writing through the loop device as expected.
> > >
> > > NACK!
> > >
> > > The IO that is disallowed should fail with ENOMEM or some similar
> > > error, not trigger an OOM kill that shoots some innocent bystander
> > > in the head. That's worse than using BUG() to report errors...
> >
> > Did you actually read the script?
>

Before responding, I want to make it clear that the OOM behavior which
you are objecting to is already present in the kernel and is
independent of this patch series. This patch series is only connecting
the charging links which were missing for the loop device.

> Of course I did. You specifically mean this bit:
>
> echo 64M > $CGROUP/memory.max;
> mount -t tmpfs -o size=512m tmpfs /tmp;
> truncate -s 512m /tmp/backing_file
> losetup $LOOP_DEV /tmp/backing_file
> dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
>
> And with this patch set the dd gets OOM killed because the
> /tmp/backing_file usage accounted to the cgroup goes over 64MB and
> so tmpfs OOM kills the IO...
>

A few queries to better understand your objection:

Let's forget about the cgroup for a second. Let's suppose I am running
this script on a system/VM having 64 MiB. In your opinion what should
happen?

Next let's add the swap to the 64 MiB system/VM/cgroup and re-run the
script, what should be the correct behavior?

Next replace the tmpfs with any other disk backed file system and
re-run the script in a 64 MiB system/VM/cgroup, what should be the
correct behavior?

> As I said: that's very broken behaviour from a storage stack
> perspective.
>
> > It's OOMing because it's creating 256M worth of tmpfs pages inside a
> > 64M cgroup. It's not killing an innocent bystander, it's killing in
> > the cgroup that is allocating all that memory - after Dan makes sure
> > that memory is accounted to its rightful owner.
>
> What this example does is turn /tmp into thinly provisioned storage
> for $CGROUP via a restricted quota. It has a device size of 512MB,
> but only has physical storage capacity of 64MB, as constrained by
> the cgroup memory limit.  You're dealing with management of -storage
> devices- and -IO error reporting- here, not memory management.
>
> When thin provisioned storage runs out of space - for whatever
> reason - and it cannot resolve the issue by blocking, then it *must*
> return ENOSPC to the IO submitter to tell it the IO has failed. This
> is no different to if we set a quota on /tmp/backing_file and it is
> exhausted at 64MB of data written - we fail the IO with ENOSPC or
> EDQUOT, depending on which quota we used.
>
> IOWs, we have solid expectations on how block devices report
> unsolvable resource shortages during IO because those errors have to
> be handled correctly by whatever is submitting the IO. We do not use
> the OOM-killer to report or resolve ENOSPC errors.
>
> Indeed, once you've killed the dd, that CGROUP still consumes 64MB
> of tmpfs space in /tmp/backing_file, in which case any further IO to
> $LOOP_DEV is also going to OOM kill. These are horrible semantics
> for reporting errors to userspace.
>
> And if the OOM-killer actually frees the 64MB of data written to
> /tmp/backing_file through the $LOOP_DEV, then you're actually
> corrupting the storage and ensuring that nobody can read the data
> that was written to $LOOP_DEV.
>
> So now lets put a filesystem on $LOOP_DEV in the above example, and
> write out data to the filesystem which does IO to $LOOP_DEV. Just by
> chance, the filesystem does some metadata writes iin the context of
> the user process doing the writes (because journalling, etc) and
> that metadata IO is what pushes the loop device over the cgroup's
> memory limit.
>
> You kill the user application even though it wasn't directly
> responsible for going over the 64MB limit of space in $LOOP_DEV.
> What happens now to the filesystem's metadata IO?  Did $LOOP_DEV
> return an error, or after the OOM kill did the IO succeed?  What
> happens if all the IO being generated from the user application is
> metadata and that starts failing - killing the user application
> doesn't help anything - the filesystem IO is failing and that's
> where the errors need to be reported.
>
> And if the answer is "metadata IO isn't accounted to the $CGROUP"
> then what happens when the tmpfs actually runs out of it's 512MB of
> space because of all the metadata the filesystem wrote (e.g. create
> lots of zero length files)? That's an ENOSPC error, and we'll get
> that from 

Re: [PATCH] memcg: oom: ignore oom warnings from memory.max

2020-05-05 Thread Shakeel Butt
On Tue, May 5, 2020 at 12:13 AM Michal Hocko  wrote:
>
> On Mon 04-05-20 12:23:51, Shakeel Butt wrote:
> [...]
> > *Potentially* useful for debugging versus actually beneficial for
> > "sweep before tear down" use-case.
>
> I definitely do not want to prevent you from achieving what you
> want/need. Let's get back to your argument on why you cannot use
> memory.high for this purpose and what is the actual difference from
> memory.max on the "sweep before removal". You've said
>
> : Yes that would work but remote charging concerns me. Remote charging
> : can still happen after the memcg is offlined and at the moment, high
> : reclaim does not work for remote memcg and the usage can go till max
> : or global pressure. This is most probably a misconfiguration and we
> : might not receive the warnings in the log ever. Setting memory.max to
> : 0 will definitely give such warnings.
>
> So essentially the only reason you are not using memory.high which would
> effectively achieve the same level of reclaim for your usecase is that
> potential future remote charges could get unnoticed.

Yes.

> I have proposed to
> warn when charging to an offline memcg because that looks like a sign of
> bug to me.

Instead of a bug, I would say misconfiguration but there is at least a
genuine user i.e. buffer_head. It can be allocated in reclaim and
trigger remote charging but it should be fine as the page it is
attached to will possibly get freed soon. So, I don't think we want to
warn for all remote charges to an offlined memcg.

> Having the hard limit clamped to 0 (or some other small
> value) would complain loud by the oom report and no eligible tasks
> message but it will unlikely help to stop such a usage because, well,
> there is nothing reclaimable and we force the charge in that case. So
> you are effectively in the memory.high like situation.

Yes, effectively it will be similar to memory.high but at least we
will get early warnings.

Now rethinking about the remote charging of buffer_head to an offlined
memcg with memory.max=0. It seems like it is special in the sense that
it is using __GFP_NOFAIL and will skip the oom-killer and thus
warnings. Maybe the right approach is, as you suggested, always warn
for charging an offline memcg unless
(__GFP_NOFAIL|__GFP_RETRY_MAYFAIL). Though I am not sure if this is
doable without code duplication.

>
> So instead of potentially removing a useful information can we focus on
> the remote charging side of the problem and deal with it in a sensible
> way? That would make memory.high usable for your usecase and I still
> believe that this is what you should be using in the first place.

We talked about this at LSFMM'19 and I think the decision was to not
fix high reclaim for remote memcg until it will be an actual issue. I
suppose now we can treat it as an actual issue.

There are a couple of open questions:
1) Should the remote chargers be throttled and do the high reclaim?
2) There can be multiple remote charges to multiple memcgs in a single
kernel entry. Should we handle such scenarios?

Shakeel


Re: [PATCH] memcg: oom: ignore oom warnings from memory.max

2020-05-05 Thread Shakeel Butt
On Tue, May 5, 2020 at 8:27 AM Johannes Weiner  wrote:
>
> On Mon, May 04, 2020 at 12:23:51PM -0700, Shakeel Butt wrote:
> > On Mon, May 4, 2020 at 9:06 AM Michal Hocko  wrote:
> > > I really hate to repeat myself but this is no different from a regular
> > > oom situation.
> >
> > Conceptually yes there is no difference but there is no *divine
> > restriction* to not make a difference if there is a real world
> > use-case which would benefit from it.
>
> I would wholeheartedly agree with this in general.
>
> However, we're talking about the very semantics that set memory.max
> apart from memory.high: triggering OOM kills to enforce the limit.
>
> > > when the kernel cannot act and mentions that along with the
> > > oom report so that whoever consumes that information can debug or act on
> > > that fact.
> > >
> > > Silencing the oom report is simply removing a potentially useful
> > > aid to debug further a potential problem.
> >
> > *Potentially* useful for debugging versus actually beneficial for
> > "sweep before tear down" use-case. Also I am not saying to make "no
> > dumps for memory.max when no eligible tasks" a set in stone rule. We
> > can always reevaluate when such information will actually be useful.
> >
> > Johannes/Andrew, what's your opinion?
>
> I still think that if you want to sweep without triggering OOMs,
> memory.high has the matching semantics.
>
> As you pointed out, it doesn't work well for foreign charges, but that
> is more of a limitation in the implementation than in the semantics:
>
> /*
>  * If the hierarchy is above the normal consumption range, schedule
>  * reclaim on returning to userland.  We can perform reclaim here
>  * if __GFP_RECLAIM but let's always punt for simplicity and so that
>  * GFP_KERNEL can consistently be used during reclaim.  @memcg is
>  * not recorded as it most likely matches current's and won't
>  * change in the meantime.  As high limit is checked again before
>  * reclaim, the cost of mismatch is negligible.
>  */
>
> Wouldn't it be more useful to fix that instead? It shouldn't be much
> of a code change to do sync reclaim in try_charge().

Sync reclaim would really simplify the remote charging case. Though
should sync reclaim only be done for remote charging or for all?

>
> Then you could express all things that you asked for without changing
> any user-visible semantics: sweep an empty cgroup as well as possible,
> do not oom on remaining charges that continue to be used by processes
> outside the cgroup, do trigger oom on new foreign charges appearing
> due to a misconfiguration.
>
> echo 0 > memory.high
> cat memory.current > memory.max
>
> Would this work for you?

Yes that would work. I will work on a patch.


Re: [PATCH] memcg: expose root cgroup's memory.stat

2020-05-15 Thread Shakeel Butt
On Fri, May 15, 2020 at 6:24 AM Johannes Weiner  wrote:
>
> On Fri, May 15, 2020 at 10:29:55AM +0200, Michal Hocko wrote:
> > On Sat 09-05-20 07:06:38, Shakeel Butt wrote:
> > > On Fri, May 8, 2020 at 2:44 PM Johannes Weiner  wrote:
> > > >
> > > > On Fri, May 08, 2020 at 10:06:30AM -0700, Shakeel Butt wrote:
> > > > > One way to measure the efficiency of memory reclaim is to look at the
> > > > > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> > > > > not updated consistently at the system level and the ratio of these 
> > > > > are
> > > > > not very meaningful. The pgsteal and pgscan are updated for only 
> > > > > global
> > > > > reclaim while pgrefill gets updated for global as well as cgroup
> > > > > reclaim.
> > > > >
> > > > > Please note that this difference is only for system level vmstats. The
> > > > > cgroup stats returned by memory.stat are actually consistent. The
> > > > > cgroup's pgsteal contains number of reclaimed pages for global as well
> > > > > as cgroup reclaim. So, one way to get the system level stats is to get
> > > > > these stats from root's memory.stat, so, expose memory.stat for the 
> > > > > root
> > > > > cgroup.
> > > > >
> > > > >   from Johannes Weiner:
> > > > >   There are subtle differences between /proc/vmstat and
> > > > >   memory.stat, and cgroup-aware code that wants to watch the full
> > > > >   hierarchy currently has to know about these intricacies and
> > > > >   translate semantics back and forth.
> >
> > Can we have those subtle differences documented please?
> >
> > > > >
> > > > >   Generally having the fully recursive memory.stat at the root
> > > > >   level could help a broader range of usecases.
> > > >
> > > > The changelog begs the question why we don't just "fix" the
> > > > system-level stats. It may be useful to include the conclusions from
> > > > that discussion, and why there is value in keeping the stats this way.
> > > >
> > >
> > > Right. Andrew, can you please add the following para to the changelog?
> > >
> > > Why not fix the stats by including both the global and cgroup reclaim
> > > activity instead of exposing root cgroup's memory.stat? The reason is
> > > the benefit of having metrics exposing the activity that happens
> > > purely due to machine capacity rather than localized activity that
> > > happens due to the limits throughout the cgroup tree. Additionally
> > > there are userspace tools like sysstat(sar) which reads these stats to
> > > inform about the system level reclaim activity. So, we should not
> > > break such use-cases.
> > >
> > > > > Signed-off-by: Shakeel Butt 
> > > > > Suggested-by: Johannes Weiner 
> > > >
> > > > Acked-by: Johannes Weiner 
> > >
> > > Thanks a lot.
> >
> > I was quite surprised that the patch is so simple TBH. For some reason
> > I've still had memories that we do not account for root memcg (likely
> > because mem_cgroup_is_root(memcg) bail out in the try_charge. But stats
> > are slightly different here.
>
> Yep, we skip the page_counter for root, but keep in mind that cgroup1
> *does* have a root-level memory.stat, so (for the most part) we've
> been keeping consumer stats for the root level the whole time.
>
> > counters because they are not really all the same. E.g.
> > - mem_cgroup_charge_statistics accounts for each memcg
>
> Yep, that's heritage from cgroup1.
>
> > - memcg_charge_kernel_stack relies on pages being associated with a
> >   memcg and that in turn relies on __memcg_kmem_charge_page which bails
> >   out on root memcg
>
> You're right. It should only bypass the page_counter, but still set
> page->mem_cgroup = root_mem_cgroup, just like user pages.
>
> This counter also doesn't get exported on cgroup1, so it would indeed
> be a new bug. It needs to be fixed before this patch here.
>
> > - memcg_charge_slab (NR_SLAB*) skips over root memcg as well
>
> Same thing with these two.

Yes, we skip page_counter for root but not the stats. I will go over
all the stats and make sure we are not skipping the stats for root.


Re: [PATCH] memcg: expose root cgroup's memory.stat

2020-05-15 Thread Shakeel Butt
On Fri, May 15, 2020 at 8:00 AM Roman Gushchin  wrote:
>
> On Fri, May 15, 2020 at 06:44:44AM -0700, Shakeel Butt wrote:
> > On Fri, May 15, 2020 at 6:24 AM Johannes Weiner  wrote:
> > >
> > > On Fri, May 15, 2020 at 10:29:55AM +0200, Michal Hocko wrote:
> > > > On Sat 09-05-20 07:06:38, Shakeel Butt wrote:
> > > > > On Fri, May 8, 2020 at 2:44 PM Johannes Weiner  
> > > > > wrote:
> > > > > >
> > > > > > On Fri, May 08, 2020 at 10:06:30AM -0700, Shakeel Butt wrote:
> > > > > > > One way to measure the efficiency of memory reclaim is to look at 
> > > > > > > the
> > > > > > > ratio (pgscan+pfrefill)/pgsteal. However at the moment these 
> > > > > > > stats are
> > > > > > > not updated consistently at the system level and the ratio of 
> > > > > > > these are
> > > > > > > not very meaningful. The pgsteal and pgscan are updated for only 
> > > > > > > global
> > > > > > > reclaim while pgrefill gets updated for global as well as cgroup
> > > > > > > reclaim.
> > > > > > >
> > > > > > > Please note that this difference is only for system level 
> > > > > > > vmstats. The
> > > > > > > cgroup stats returned by memory.stat are actually consistent. The
> > > > > > > cgroup's pgsteal contains number of reclaimed pages for global as 
> > > > > > > well
> > > > > > > as cgroup reclaim. So, one way to get the system level stats is 
> > > > > > > to get
> > > > > > > these stats from root's memory.stat, so, expose memory.stat for 
> > > > > > > the root
> > > > > > > cgroup.
> > > > > > >
> > > > > > >   from Johannes Weiner:
> > > > > > >   There are subtle differences between /proc/vmstat and
> > > > > > >   memory.stat, and cgroup-aware code that wants to watch the 
> > > > > > > full
> > > > > > >   hierarchy currently has to know about these intricacies and
> > > > > > >   translate semantics back and forth.
> > > >
> > > > Can we have those subtle differences documented please?
> > > >
> > > > > > >
> > > > > > >   Generally having the fully recursive memory.stat at the root
> > > > > > >   level could help a broader range of usecases.
> > > > > >
> > > > > > The changelog begs the question why we don't just "fix" the
> > > > > > system-level stats. It may be useful to include the conclusions from
> > > > > > that discussion, and why there is value in keeping the stats this 
> > > > > > way.
> > > > > >
> > > > >
> > > > > Right. Andrew, can you please add the following para to the changelog?
> > > > >
> > > > > Why not fix the stats by including both the global and cgroup reclaim
> > > > > activity instead of exposing root cgroup's memory.stat? The reason is
> > > > > the benefit of having metrics exposing the activity that happens
> > > > > purely due to machine capacity rather than localized activity that
> > > > > happens due to the limits throughout the cgroup tree. Additionally
> > > > > there are userspace tools like sysstat(sar) which reads these stats to
> > > > > inform about the system level reclaim activity. So, we should not
> > > > > break such use-cases.
> > > > >
> > > > > > > Signed-off-by: Shakeel Butt 
> > > > > > > Suggested-by: Johannes Weiner 
> > > > > >
> > > > > > Acked-by: Johannes Weiner 
> > > > >
> > > > > Thanks a lot.
> > > >
> > > > I was quite surprised that the patch is so simple TBH. For some reason
> > > > I've still had memories that we do not account for root memcg (likely
> > > > because mem_cgroup_is_root(memcg) bail out in the try_charge. But stats
> > > > are slightly different here.
> > >
> > > Yep, we skip the page_counter for root, but keep in mind that cgroup1
> > > *does* have a root-level memory.stat, so (for the most part) we've
> > > been keeping consumer stats for the root level the whole time.
> > >
> > > > counters because they are not really all the same. E.g.
> > > > - mem_cgroup_charge_statistics accounts for each memcg
> > >
> > > Yep, that's heritage from cgroup1.
> > >
> > > > - memcg_charge_kernel_stack relies on pages being associated with a
> > > >   memcg and that in turn relies on __memcg_kmem_charge_page which bails
> > > >   out on root memcg
> > >
> > > You're right. It should only bypass the page_counter, but still set
> > > page->mem_cgroup = root_mem_cgroup, just like user pages.
>
> What about kernel threads? We consider them belonging to the root memory
> cgroup. Should their memory consumption being considered in root-level stats?
>
> I'm not sure we really want it, but I guess we need to document how
> kernel threads are handled.
>

What will be the cons of updating root-level stats for kthreads?

> > >
> > > This counter also doesn't get exported on cgroup1, so it would indeed
> > > be a new bug. It needs to be fixed before this patch here.
> > >
> > > > - memcg_charge_slab (NR_SLAB*) skips over root memcg as well
> > >
> > > Same thing with these two.
> >
> > Yes, we skip page_counter for root but not the stats. I will go over
> > all the stats and make sure we are not skipping the stats for root.


Re: [PATCH] memcg: expose root cgroup's memory.stat

2020-05-15 Thread Shakeel Butt
On Fri, May 15, 2020 at 11:09 AM Roman Gushchin  wrote:
>
> On Fri, May 15, 2020 at 10:49:22AM -0700, Shakeel Butt wrote:
> > On Fri, May 15, 2020 at 8:00 AM Roman Gushchin  wrote:
> > >
> > > On Fri, May 15, 2020 at 06:44:44AM -0700, Shakeel Butt wrote:
> > > > On Fri, May 15, 2020 at 6:24 AM Johannes Weiner  
> > > > wrote:
> > > > >
> > > > > On Fri, May 15, 2020 at 10:29:55AM +0200, Michal Hocko wrote:
> > > > > > On Sat 09-05-20 07:06:38, Shakeel Butt wrote:
> > > > > > > On Fri, May 8, 2020 at 2:44 PM Johannes Weiner 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Fri, May 08, 2020 at 10:06:30AM -0700, Shakeel Butt wrote:
> > > > > > > > > One way to measure the efficiency of memory reclaim is to 
> > > > > > > > > look at the
> > > > > > > > > ratio (pgscan+pfrefill)/pgsteal. However at the moment these 
> > > > > > > > > stats are
> > > > > > > > > not updated consistently at the system level and the ratio of 
> > > > > > > > > these are
> > > > > > > > > not very meaningful. The pgsteal and pgscan are updated for 
> > > > > > > > > only global
> > > > > > > > > reclaim while pgrefill gets updated for global as well as 
> > > > > > > > > cgroup
> > > > > > > > > reclaim.
> > > > > > > > >
> > > > > > > > > Please note that this difference is only for system level 
> > > > > > > > > vmstats. The
> > > > > > > > > cgroup stats returned by memory.stat are actually consistent. 
> > > > > > > > > The
> > > > > > > > > cgroup's pgsteal contains number of reclaimed pages for 
> > > > > > > > > global as well
> > > > > > > > > as cgroup reclaim. So, one way to get the system level stats 
> > > > > > > > > is to get
> > > > > > > > > these stats from root's memory.stat, so, expose memory.stat 
> > > > > > > > > for the root
> > > > > > > > > cgroup.
> > > > > > > > >
> > > > > > > > >   from Johannes Weiner:
> > > > > > > > >   There are subtle differences between /proc/vmstat and
> > > > > > > > >   memory.stat, and cgroup-aware code that wants to watch 
> > > > > > > > > the full
> > > > > > > > >   hierarchy currently has to know about these intricacies 
> > > > > > > > > and
> > > > > > > > >   translate semantics back and forth.
> > > > > >
> > > > > > Can we have those subtle differences documented please?
> > > > > >
> > > > > > > > >
> > > > > > > > >   Generally having the fully recursive memory.stat at the 
> > > > > > > > > root
> > > > > > > > >   level could help a broader range of usecases.
> > > > > > > >
> > > > > > > > The changelog begs the question why we don't just "fix" the
> > > > > > > > system-level stats. It may be useful to include the conclusions 
> > > > > > > > from
> > > > > > > > that discussion, and why there is value in keeping the stats 
> > > > > > > > this way.
> > > > > > > >
> > > > > > >
> > > > > > > Right. Andrew, can you please add the following para to the 
> > > > > > > changelog?
> > > > > > >
> > > > > > > Why not fix the stats by including both the global and cgroup 
> > > > > > > reclaim
> > > > > > > activity instead of exposing root cgroup's memory.stat? The 
> > > > > > > reason is
> > > > > > > the benefit of having metrics exposing the activity that happens
> > > > > > > purely due to machine capacity rather than localized activity that
> > > > > > > happens due to the limits throughout the cgroup tree. Additionally
> > > > >

Re: [PATCH] memcg: expose root cgroup's memory.stat

2020-05-15 Thread Shakeel Butt
On Fri, May 15, 2020 at 11:09 AM Johannes Weiner  wrote:
>
> On Fri, May 15, 2020 at 10:49:22AM -0700, Shakeel Butt wrote:
> > On Fri, May 15, 2020 at 8:00 AM Roman Gushchin  wrote:
> > > On Fri, May 15, 2020 at 06:44:44AM -0700, Shakeel Butt wrote:
> > > > On Fri, May 15, 2020 at 6:24 AM Johannes Weiner  
> > > > wrote:
> > > > > You're right. It should only bypass the page_counter, but still set
> > > > > page->mem_cgroup = root_mem_cgroup, just like user pages.
> > >
> > > What about kernel threads? We consider them belonging to the root memory
> > > cgroup. Should their memory consumption being considered in root-level 
> > > stats?
> > >
> > > I'm not sure we really want it, but I guess we need to document how
> > > kernel threads are handled.
> >
> > What will be the cons of updating root-level stats for kthreads?
>
> Should kernel threads be doing GFP_ACCOUNT allocations without
> memalloc_use_memcg()? GFP_ACCOUNT implies that the memory consumption
> can be significant and should be attributed to userspace activity.
>
> If the kernel thread has no userspace entity to blame, it seems to
> imply the same thing as a !GFP_ACCOUNT allocation: shared public
> infrastructure, not interesting to account to any specific cgroup.
>
> I'm not sure if we have such allocations right now. But IMO we should
> not account anything from kthreads, or interrupts for that matter,
> /unless/ there is a specific active_memcg that was set by the kthread
> or the interrupt.

I totally agree with you but I think your response is about memory
charging in IRQ/kthread context, a topic of a parallel patch from
Zefan Li at [1].

Here we are discussing stats update for kthreads e.g. should we update
root memcg's MEMCG_KERNEL_STACK_KB stat when we allocate stack for
kernel threads?

[1] http://lkml.kernel.org/r/3a721f62-5a66-8bc5-247b-5c8b7c51c...@huawei.com


[PATCH] net/packet: simply allocations in alloc_one_pg_vec_page

2020-05-15 Thread Shakeel Butt
Currently the initial allocation for pg_vec buffers are done through
page allocator with __GFP_NORETRY, the first fallbacks is vzalloc and
the second fallback is page allocator without __GFP_NORETRY.

First, there is no need to do vzalloc if the order is 0 and second the
vzalloc internally use GFP_KERNEL for each individual page allocation
and thus there is no need to have any fallback after vzalloc.

Signed-off-by: Shakeel Butt 
---
 net/packet/af_packet.c | 15 ---
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 29bd405adbbd..d6f96b9f5b01 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -4243,19 +4243,12 @@ static char *alloc_one_pg_vec_page(unsigned long order)
if (buffer)
return buffer;
 
-   /* __get_free_pages failed, fall back to vmalloc */
-   buffer = vzalloc(array_size((1 << order), PAGE_SIZE));
-   if (buffer)
-   return buffer;
+   /* __get_free_pages failed, fall back to vmalloc for high order. */
+   if (order)
+   return vzalloc(array_size((1 << order), PAGE_SIZE));
 
-   /* vmalloc failed, lets dig into swap here */
gfp_flags &= ~__GFP_NORETRY;
-   buffer = (char *) __get_free_pages(gfp_flags, order);
-   if (buffer)
-   return buffer;
-
-   /* complete and utter failure */
-   return NULL;
+   return (char *)__get_free_pages(gfp_flags, order);
 }
 
 static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)
-- 
2.26.2.761.g0e0b3e54be-goog



[PATCH 1/3] mm: swap: fix vmstats for huge pages

2020-05-08 Thread Shakeel Butt
Many of the callbacks called by pagevec_lru_move_fn() do not correctly
update the vmstats for huge pages. Fix that. Also __pagevec_lru_add_fn()
use the irq-unsafe alternative to update the stat as the irqs are
already disabled.

Signed-off-by: Shakeel Butt 
---
 mm/swap.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index a37bd7b202ac..3dbef6517cac 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -225,7 +225,7 @@ static void pagevec_move_tail_fn(struct page *page, struct 
lruvec *lruvec,
del_page_from_lru_list(page, lruvec, page_lru(page));
ClearPageActive(page);
add_page_to_lru_list_tail(page, lruvec, page_lru(page));
-   (*pgmoved)++;
+   (*pgmoved) += hpage_nr_pages(page);
}
 }
 
@@ -285,7 +285,7 @@ static void __activate_page(struct page *page, struct 
lruvec *lruvec,
add_page_to_lru_list(page, lruvec, lru);
trace_mm_lru_activate(page);
 
-   __count_vm_event(PGACTIVATE);
+   __count_vm_events(PGACTIVATE, hpage_nr_pages(page));
update_page_reclaim_stat(lruvec, file, 1);
}
 }
@@ -503,6 +503,7 @@ static void lru_deactivate_file_fn(struct page *page, 
struct lruvec *lruvec,
 {
int lru, file;
bool active;
+   int nr_pages = hpage_nr_pages(page);
 
if (!PageLRU(page))
return;
@@ -536,11 +537,11 @@ static void lru_deactivate_file_fn(struct page *page, 
struct lruvec *lruvec,
 * We moves tha page into tail of inactive.
 */
add_page_to_lru_list_tail(page, lruvec, lru);
-   __count_vm_event(PGROTATED);
+   __count_vm_events(PGROTATED, nr_pages);
}
 
if (active)
-   __count_vm_event(PGDEACTIVATE);
+   __count_vm_events(PGDEACTIVATE, nr_pages);
update_page_reclaim_stat(lruvec, file, 0);
 }
 
@@ -929,6 +930,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct 
lruvec *lruvec,
 {
enum lru_list lru;
int was_unevictable = TestClearPageUnevictable(page);
+   int nr_pages = hpage_nr_pages(page);
 
VM_BUG_ON_PAGE(PageLRU(page), page);
 
@@ -966,13 +968,13 @@ static void __pagevec_lru_add_fn(struct page *page, 
struct lruvec *lruvec,
update_page_reclaim_stat(lruvec, page_is_file_lru(page),
 PageActive(page));
if (was_unevictable)
-   count_vm_event(UNEVICTABLE_PGRESCUED);
+   __count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
} else {
lru = LRU_UNEVICTABLE;
ClearPageActive(page);
SetPageUnevictable(page);
if (!was_unevictable)
-   count_vm_event(UNEVICTABLE_PGCULLED);
+   __count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
}
 
add_page_to_lru_list(page, lruvec, lru);
-- 
2.26.2.645.ge9eca65c58-goog



[PATCH 2/3] mm: swap: memcg: fix memcg stats for huge pages

2020-05-08 Thread Shakeel Butt
The commit 2262185c5b28 ("mm: per-cgroup memory reclaim stats") added
PGLAZYFREE, PGACTIVATE & PGDEACTIVATE stats for cgroups but missed
couple of places and PGLAZYFREE missed huge page handling. Fix that.
Also for PGLAZYFREE use the irq-unsafe function to update as the irq is
already disabled.

Fixes: 2262185c5b28 ("mm: per-cgroup memory reclaim stats")
Signed-off-by: Shakeel Butt 
---
 mm/swap.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 3dbef6517cac..4eb179ee0b72 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -278,6 +278,7 @@ static void __activate_page(struct page *page, struct 
lruvec *lruvec,
if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
int file = page_is_file_lru(page);
int lru = page_lru_base_type(page);
+   int nr_pages = hpage_nr_pages(page);
 
del_page_from_lru_list(page, lruvec, lru);
SetPageActive(page);
@@ -285,7 +286,8 @@ static void __activate_page(struct page *page, struct 
lruvec *lruvec,
add_page_to_lru_list(page, lruvec, lru);
trace_mm_lru_activate(page);
 
-   __count_vm_events(PGACTIVATE, hpage_nr_pages(page));
+   __count_vm_events(PGACTIVATE, nr_pages);
+   __count_memcg_events(lruvec_memcg(lruvec), PGACTIVATE, 
nr_pages);
update_page_reclaim_stat(lruvec, file, 1);
}
 }
@@ -540,8 +542,10 @@ static void lru_deactivate_file_fn(struct page *page, 
struct lruvec *lruvec,
__count_vm_events(PGROTATED, nr_pages);
}
 
-   if (active)
+   if (active) {
__count_vm_events(PGDEACTIVATE, nr_pages);
+   __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, 
nr_pages);
+   }
update_page_reclaim_stat(lruvec, file, 0);
 }
 
@@ -551,13 +555,15 @@ static void lru_deactivate_fn(struct page *page, struct 
lruvec *lruvec,
if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
int file = page_is_file_lru(page);
int lru = page_lru_base_type(page);
+   int nr_pages = hpage_nr_pages(page);
 
del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
ClearPageActive(page);
ClearPageReferenced(page);
add_page_to_lru_list(page, lruvec, lru);
 
-   __count_vm_events(PGDEACTIVATE, hpage_nr_pages(page));
+   __count_vm_events(PGDEACTIVATE, nr_pages);
+   __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, 
nr_pages);
update_page_reclaim_stat(lruvec, file, 0);
}
 }
@@ -568,6 +574,7 @@ static void lru_lazyfree_fn(struct page *page, struct 
lruvec *lruvec,
if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
!PageSwapCache(page) && !PageUnevictable(page)) {
bool active = PageActive(page);
+   int nr_pages = hpage_nr_pages(page);
 
del_page_from_lru_list(page, lruvec,
   LRU_INACTIVE_ANON + active);
@@ -581,8 +588,8 @@ static void lru_lazyfree_fn(struct page *page, struct 
lruvec *lruvec,
ClearPageSwapBacked(page);
add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE);
 
-   __count_vm_events(PGLAZYFREE, hpage_nr_pages(page));
-   count_memcg_page_event(page, PGLAZYFREE);
+   __count_vm_events(PGLAZYFREE, nr_pages);
+   __count_memcg_events(lruvec_memcg(lruvec), PGLAZYFREE, 
nr_pages);
update_page_reclaim_stat(lruvec, 1, 0);
}
 }
-- 
2.26.2.645.ge9eca65c58-goog



[PATCH 3/3] mm: swap: fix update_page_reclaim_stat for huge pages

2020-05-08 Thread Shakeel Butt
Currently update_page_reclaim_stat() updates the lruvec.reclaim_stats
just once for a page irrespective if a page is huge or not. Fix that by
passing the hpage_nr_pages(page) to it.

Signed-off-by: Shakeel Butt 
---
 mm/swap.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 4eb179ee0b72..dc7297cb76a0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -262,14 +262,14 @@ void rotate_reclaimable_page(struct page *page)
}
 }
 
-static void update_page_reclaim_stat(struct lruvec *lruvec,
-int file, int rotated)
+static void update_page_reclaim_stat(struct lruvec *lruvec, int file,
+int rotated, int nr_pages)
 {
struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
 
-   reclaim_stat->recent_scanned[file]++;
+   reclaim_stat->recent_scanned[file] += nr_pages;
if (rotated)
-   reclaim_stat->recent_rotated[file]++;
+   reclaim_stat->recent_rotated[file] += nr_pages;
 }
 
 static void __activate_page(struct page *page, struct lruvec *lruvec,
@@ -288,7 +288,7 @@ static void __activate_page(struct page *page, struct 
lruvec *lruvec,
 
__count_vm_events(PGACTIVATE, nr_pages);
__count_memcg_events(lruvec_memcg(lruvec), PGACTIVATE, 
nr_pages);
-   update_page_reclaim_stat(lruvec, file, 1);
+   update_page_reclaim_stat(lruvec, file, 1, nr_pages);
}
 }
 
@@ -546,7 +546,7 @@ static void lru_deactivate_file_fn(struct page *page, 
struct lruvec *lruvec,
__count_vm_events(PGDEACTIVATE, nr_pages);
__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, 
nr_pages);
}
-   update_page_reclaim_stat(lruvec, file, 0);
+   update_page_reclaim_stat(lruvec, file, 0, nr_pages);
 }
 
 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
@@ -564,7 +564,7 @@ static void lru_deactivate_fn(struct page *page, struct 
lruvec *lruvec,
 
__count_vm_events(PGDEACTIVATE, nr_pages);
__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, 
nr_pages);
-   update_page_reclaim_stat(lruvec, file, 0);
+   update_page_reclaim_stat(lruvec, file, 0, nr_pages);
}
 }
 
@@ -590,7 +590,7 @@ static void lru_lazyfree_fn(struct page *page, struct 
lruvec *lruvec,
 
__count_vm_events(PGLAZYFREE, nr_pages);
__count_memcg_events(lruvec_memcg(lruvec), PGLAZYFREE, 
nr_pages);
-   update_page_reclaim_stat(lruvec, 1, 0);
+   update_page_reclaim_stat(lruvec, 1, 0, nr_pages);
}
 }
 
@@ -928,7 +928,7 @@ void lru_add_page_tail(struct page *page, struct page 
*page_tail,
}
 
if (!PageUnevictable(page))
-   update_page_reclaim_stat(lruvec, file, PageActive(page_tail));
+   update_page_reclaim_stat(lruvec, file, PageActive(page_tail), 
1);
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -973,7 +973,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct 
lruvec *lruvec,
if (page_evictable(page)) {
lru = page_lru(page);
update_page_reclaim_stat(lruvec, page_is_file_lru(page),
-PageActive(page));
+PageActive(page), nr_pages);
if (was_unevictable)
__count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
} else {
-- 
2.26.2.645.ge9eca65c58-goog



Re: [PATCH 3/3] mm: swap: fix update_page_reclaim_stat for huge pages

2020-05-09 Thread Shakeel Butt
On Fri, May 8, 2020 at 2:51 PM Johannes Weiner  wrote:
>
> On Fri, May 08, 2020 at 02:22:15PM -0700, Shakeel Butt wrote:
> > Currently update_page_reclaim_stat() updates the lruvec.reclaim_stats
> > just once for a page irrespective if a page is huge or not. Fix that by
> > passing the hpage_nr_pages(page) to it.
> >
> > Signed-off-by: Shakeel Butt 
>
> https://lore.kernel.org/patchwork/patch/685703/
>
> Laughs, then cries.
>

What happened to that patch? Fell through the cracks?

> > @@ -928,7 +928,7 @@ void lru_add_page_tail(struct page *page, struct page 
> > *page_tail,
> >   }
> >
> >   if (!PageUnevictable(page))
> > - update_page_reclaim_stat(lruvec, file, PageActive(page_tail));
> > + update_page_reclaim_stat(lruvec, file, PageActive(page_tail), 
> > 1);
>
> The change to __pagevec_lru_add_fn() below makes sure the tail pages
> are already accounted. This would make them count twice.
>

Yes, you are right. I will just re-send your patch after rebase.



> >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > @@ -973,7 +973,7 @@ static void __pagevec_lru_add_fn(struct page *page, 
> > struct lruvec *lruvec,
> >   if (page_evictable(page)) {
> >   lru = page_lru(page);
> >   update_page_reclaim_stat(lruvec, page_is_file_lru(page),
> > -  PageActive(page));
> > +  PageActive(page), nr_pages);
> >   if (was_unevictable)
> >   __count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
> >   } else {


Re: [PATCH] memcg: expose root cgroup's memory.stat

2020-05-09 Thread Shakeel Butt
On Fri, May 8, 2020 at 2:44 PM Johannes Weiner  wrote:
>
> On Fri, May 08, 2020 at 10:06:30AM -0700, Shakeel Butt wrote:
> > One way to measure the efficiency of memory reclaim is to look at the
> > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> > not updated consistently at the system level and the ratio of these are
> > not very meaningful. The pgsteal and pgscan are updated for only global
> > reclaim while pgrefill gets updated for global as well as cgroup
> > reclaim.
> >
> > Please note that this difference is only for system level vmstats. The
> > cgroup stats returned by memory.stat are actually consistent. The
> > cgroup's pgsteal contains number of reclaimed pages for global as well
> > as cgroup reclaim. So, one way to get the system level stats is to get
> > these stats from root's memory.stat, so, expose memory.stat for the root
> > cgroup.
> >
> >   from Johannes Weiner:
> >   There are subtle differences between /proc/vmstat and
> >   memory.stat, and cgroup-aware code that wants to watch the full
> >   hierarchy currently has to know about these intricacies and
> >   translate semantics back and forth.
> >
> >   Generally having the fully recursive memory.stat at the root
> >   level could help a broader range of usecases.
>
> The changelog begs the question why we don't just "fix" the
> system-level stats. It may be useful to include the conclusions from
> that discussion, and why there is value in keeping the stats this way.
>

Right. Andrew, can you please add the following para to the changelog?

Why not fix the stats by including both the global and cgroup reclaim
activity instead of exposing root cgroup's memory.stat? The reason is
the benefit of having metrics exposing the activity that happens
purely due to machine capacity rather than localized activity that
happens due to the limits throughout the cgroup tree. Additionally
there are userspace tools like sysstat(sar) which reads these stats to
inform about the system level reclaim activity. So, we should not
break such use-cases.

> > Signed-off-by: Shakeel Butt 
> > Suggested-by: Johannes Weiner 
>
> Acked-by: Johannes Weiner 

Thanks a lot.


[PATCH] mm: fix LRU balancing effect of new transparent huge pages

2020-05-09 Thread Shakeel Butt
From: Johannes Weiner 

Currently, THP are counted as single pages until they are split right
before being swapped out. However, at that point the VM is already in
the middle of reclaim, and adjusting the LRU balance then is useless.

Always account THP by the number of basepages, and remove the fixup
from the splitting path.

Signed-off-by: Johannes Weiner 
Signed-off-by: Shakeel Butt 
---
Revived the patch from https://lore.kernel.org/patchwork/patch/685703/

 mm/swap.c | 23 +--
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 4eb179ee0b72..b75c0ce90418 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -262,14 +262,14 @@ void rotate_reclaimable_page(struct page *page)
}
 }
 
-static void update_page_reclaim_stat(struct lruvec *lruvec,
-int file, int rotated)
+static void update_page_reclaim_stat(struct lruvec *lruvec, int file,
+int rotated, int nr_pages)
 {
struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
 
-   reclaim_stat->recent_scanned[file]++;
+   reclaim_stat->recent_scanned[file] += nr_pages;
if (rotated)
-   reclaim_stat->recent_rotated[file]++;
+   reclaim_stat->recent_rotated[file] += nr_pages;
 }
 
 static void __activate_page(struct page *page, struct lruvec *lruvec,
@@ -288,7 +288,7 @@ static void __activate_page(struct page *page, struct 
lruvec *lruvec,
 
__count_vm_events(PGACTIVATE, nr_pages);
__count_memcg_events(lruvec_memcg(lruvec), PGACTIVATE, 
nr_pages);
-   update_page_reclaim_stat(lruvec, file, 1);
+   update_page_reclaim_stat(lruvec, file, 1, nr_pages);
}
 }
 
@@ -546,7 +546,7 @@ static void lru_deactivate_file_fn(struct page *page, 
struct lruvec *lruvec,
__count_vm_events(PGDEACTIVATE, nr_pages);
__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, 
nr_pages);
}
-   update_page_reclaim_stat(lruvec, file, 0);
+   update_page_reclaim_stat(lruvec, file, 0, nr_pages);
 }
 
 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
@@ -564,7 +564,7 @@ static void lru_deactivate_fn(struct page *page, struct 
lruvec *lruvec,
 
__count_vm_events(PGDEACTIVATE, nr_pages);
__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, 
nr_pages);
-   update_page_reclaim_stat(lruvec, file, 0);
+   update_page_reclaim_stat(lruvec, file, 0, nr_pages);
}
 }
 
@@ -590,7 +590,7 @@ static void lru_lazyfree_fn(struct page *page, struct 
lruvec *lruvec,
 
__count_vm_events(PGLAZYFREE, nr_pages);
__count_memcg_events(lruvec_memcg(lruvec), PGLAZYFREE, 
nr_pages);
-   update_page_reclaim_stat(lruvec, 1, 0);
+   update_page_reclaim_stat(lruvec, 1, 0, nr_pages);
}
 }
 
@@ -899,8 +899,6 @@ EXPORT_SYMBOL(__pagevec_release);
 void lru_add_page_tail(struct page *page, struct page *page_tail,
   struct lruvec *lruvec, struct list_head *list)
 {
-   const int file = 0;
-
VM_BUG_ON_PAGE(!PageHead(page), page);
VM_BUG_ON_PAGE(PageCompound(page_tail), page);
VM_BUG_ON_PAGE(PageLRU(page_tail), page);
@@ -926,9 +924,6 @@ void lru_add_page_tail(struct page *page, struct page 
*page_tail,
add_page_to_lru_list_tail(page_tail, lruvec,
  page_lru(page_tail));
}
-
-   if (!PageUnevictable(page))
-   update_page_reclaim_stat(lruvec, file, PageActive(page_tail));
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -973,7 +968,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct 
lruvec *lruvec,
if (page_evictable(page)) {
lru = page_lru(page);
update_page_reclaim_stat(lruvec, page_is_file_lru(page),
-PageActive(page));
+PageActive(page), nr_pages);
if (was_unevictable)
__count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
} else {
-- 
2.26.2.645.ge9eca65c58-goog



[PATCH] mm: memcontrol: fix NULL-ptr deref in percpu stats flush

2019-10-18 Thread Shakeel Butt
__mem_cgroup_free() can be called on the failure path in
mem_cgroup_alloc(). However memcg_flush_percpu_vmstats() and
memcg_flush_percpu_vmevents() which are called from __mem_cgroup_free()
access the fields of memcg which can potentially be null if called from
failure path from mem_cgroup_alloc(). Indeed syzbot has reported the
following crash:

R13: 004bf27d R14: 004db028 R15: 0003
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] PREEMPT SMP KASAN
CPU: 0 PID: 30393 Comm: syz-executor.1 Not tainted 5.4.0-rc2+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
Google 01/01/2011
RIP: 0010:memcg_flush_percpu_vmstats+0x4ae/0x930 mm/memcontrol.c:3436
Code: 05 41 89 c0 41 0f b6 04 24 41 38 c7 7c 08 84 c0 0f 85 5d 03 00 00 
44 3b 05 33 d5 12 08 0f 83 e2 00 00 00 4c 89 f0 48 c1 e8 03 <42> 80 3c 28 00 0f 
85 91 03 00 00 48 8b 85 10 fe ff ff 48 8b b0 90
RSP: 0018:888095c27980 EFLAGS: 00010206
RAX: 0012 RBX: 888095c27b28 RCX: c90008192000
RDX: 0004 RSI: 8340fae7 RDI: 0007
RBP: 888095c27be0 R08:  R09: ed1013f0da33
R10: ed1013f0da32 R11: 88809f86d197 R12: fbfff138b760
R13: dc00 R14: 0090 R15: 0007
FS:  7f5027170700() GS:8880ae80() 
knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 00710158 CR3: a7b18000 CR4: 001406f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
__mem_cgroup_free+0x1a/0x190 mm/memcontrol.c:5021
mem_cgroup_free mm/memcontrol.c:5033 [inline]
mem_cgroup_css_alloc+0x3a1/0x1ae0 mm/memcontrol.c:5160
css_create kernel/cgroup/cgroup.c:5156 [inline]
cgroup_apply_control_enable+0x44d/0xc40 kernel/cgroup/cgroup.c:3119
cgroup_mkdir+0x899/0x11b0 kernel/cgroup/cgroup.c:5401
kernfs_iop_mkdir+0x14d/0x1d0 fs/kernfs/dir.c:1124
vfs_mkdir+0x42e/0x670 fs/namei.c:3807
do_mkdirat+0x234/0x2a0 fs/namei.c:3830
__do_sys_mkdir fs/namei.c:3846 [inline]
__se_sys_mkdir fs/namei.c:3844 [inline]
__x64_sys_mkdir+0x5c/0x80 fs/namei.c:3844
do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x459a59
Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 
f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 
0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:7f502716fc78 EFLAGS: 0246 ORIG_RAX: 0053
RAX: ffda RBX: 7f502716fc90 RCX: 00459a59
RDX:  RSI:  RDI: 2180
RBP: 0075bf20 R08:  R09: 
R10:  R11: 0246 R12: 7f50271706d4
R13: 004bf27d R14: 004db028 R15: 0003

Fixing this by moving the flush to mem_cgroup_free as there is no need
to flush anything if we see failure in mem_cgroup_alloc().

Reported-by: syzbot+515d5bcfe179cdf04...@syzkaller.appspotmail.com
Fixes: bb65f89b7d3d ("mm: memcontrol: flush percpu vmevents before releasing 
memcg")
Fixes: c350a99ea2b1 ("mm: memcontrol: flush percpu vmstats before releasing 
memcg")
Signed-off-by: Shakeel Butt 
Cc: Roman Gushchin 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: 
Cc: Andrew Morton 

---
 mm/memcontrol.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bdac56009a38..13cb4c1e9f2a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5014,12 +5014,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 {
int node;
 
-   /*
-* Flush percpu vmstats and vmevents to guarantee the value correctness
-* on parent's and all ancestor levels.
-*/
-   memcg_flush_percpu_vmstats(memcg, false);
-   memcg_flush_percpu_vmevents(memcg);
for_each_node(node)
free_mem_cgroup_per_node_info(memcg, node);
free_percpu(memcg->vmstats_percpu);
@@ -5030,6 +5024,12 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 static void mem_cgroup_free(struct mem_cgroup *memcg)
 {
memcg_wb_domain_exit(memcg);
+   /*
+* Flush percpu vmstats and vmevents to guarantee the value correctness
+* on parent's and all ancestor levels.
+*/
+   memcg_flush_percpu_vmstats(memcg, false);
+   memcg_flush_percpu_vmevents(memc

Re: [PATCH] mm: fix LRU balancing effect of new transparent huge pages

2020-05-11 Thread Shakeel Butt
On Mon, May 11, 2020 at 2:11 PM Andrew Morton  wrote:
>
> On Sat,  9 May 2020 07:19:46 -0700 Shakeel Butt  wrote:
>
> > Currently, THP are counted as single pages until they are split right
> > before being swapped out. However, at that point the VM is already in
> > the middle of reclaim, and adjusting the LRU balance then is useless.
> >
> > Always account THP by the number of basepages, and remove the fixup
> > from the splitting path.
>
> Confused.  What kernel is this applicable to?

It is still applicable to the latest Linux kernel. Basically
lruvec->reclaim_stat->recent_[scanned|rotated] counters are used as
heuristic in get_scan_count() to measure how much file and anon LRUs
should be scanned by the current reclaim. Previously huge pages are
treated as single page while updating the recent_[scanned|rotated]
counters in swap.c while vmscan.c correctly updates them as huge
pages.


Re: [PATCH] memcg: effective memory.high reclaim for remote charging

2020-05-11 Thread Shakeel Butt
On Mon, May 11, 2020 at 8:57 AM Johannes Weiner  wrote:
>
> On Thu, May 07, 2020 at 10:00:07AM -0700, Shakeel Butt wrote:
> > On Thu, May 7, 2020 at 9:47 AM Michal Hocko  wrote:
> > >
> > > On Thu 07-05-20 09:33:01, Shakeel Butt wrote:
> > > [...]
> > > > @@ -2600,8 +2596,23 @@ static int try_charge(struct mem_cgroup *memcg, 
> > > > gfp_t gfp_mask,
> > > >   schedule_work(>high_work);
> > > >   break;
> > > >   }
> > > > - current->memcg_nr_pages_over_high += batch;
> > > > - set_notify_resume(current);
> > > > +
> > > > + if (gfpflags_allow_blocking(gfp_mask))
> > > > + reclaim_over_high(memcg, gfp_mask, batch);
> > > > +
> > > > + if (page_counter_read(>memory) <=
> > > > + READ_ONCE(memcg->high))
> > > > + break;
> > >
> > > I am half way to a long weekend so bear with me. Shouldn't this be 
> > > continue? The
> > > parent memcg might be still in excess even the child got reclaimed,
> > > right?
> > >
> >
> > The reclaim_high() actually already does this walk up to the root and
> > reclaim from ones who are still over their high limit. Though having
> > 'continue' here is correct too.
>
> If reclaim was weak and failed to bring the child back in line, we
> still do set_notify_resume(). We should do that for ancestors too.
>
> But it seems we keep adding hierarchy walks and it's getting somewhat
> convoluted: page_counter does it, then we check high overage
> recursively, and now we add the call to reclaim which itself is a walk
> up the ancestor line.
>
> Can we hitchhike on the page_counter_try_charge() walk, which already
> has the concept of identifying counters with overage? Rename the @fail
> to @limited and return the first counter that is in excess of its high
> as well, even when the function succeeds?
>
> Then we could ditch the entire high checking loop here and simply
> replace it with
>
> done_restock:
> ...
>
> if (*limited) {
> if (gfpflags_allow_blocking())
> reclaim_over_high(memcg_from_counter(limited));
> /* Reclaim may not be able to do much, ... */
> set_notify_resume(); // or schedule_work()
> };
>

I will try to code the above and will give a shot to the following
long-term suggestion as well.

> In the long-term, the best thing might be to integrate memory.high
> reclaim with the regular reclaim that try_charge() is already
> doing. Especially the part where it retries several times - we
> currently give up on memory.high unnecessarily early. Make
> page_counter_try_charge() fail on high and max equally, and after
> several reclaim cycles, instead of invoking the OOM killer, inject the
> penalty sleep and force the charges. OOM killing and throttling is
> supposed to be the only difference between the two, anyway, and yet
> the code diverges far more than that for no apparent reason.
>
> But I also appreciate that this is a cleanup beyond the scope of this
> patch here, so it's up to you how far you want to take it.

Thanks,
Shakeel


Re: [PATCH] mm: fix LRU balancing effect of new transparent huge pages

2020-05-11 Thread Shakeel Butt
On Mon, May 11, 2020 at 2:58 PM Andrew Morton  wrote:
>
> On Mon, 11 May 2020 14:38:23 -0700 Shakeel Butt  wrote:
>
> > On Mon, May 11, 2020 at 2:11 PM Andrew Morton  
> > wrote:
> > >
> > > On Sat,  9 May 2020 07:19:46 -0700 Shakeel Butt  
> > > wrote:
> > >
> > > > Currently, THP are counted as single pages until they are split right
> > > > before being swapped out. However, at that point the VM is already in
> > > > the middle of reclaim, and adjusting the LRU balance then is useless.
> > > >
> > > > Always account THP by the number of basepages, and remove the fixup
> > > > from the splitting path.
> > >
> > > Confused.  What kernel is this applicable to?
> >
> > It is still applicable to the latest Linux kernel.
>
> The patch has
>
> > @@ -288,7 +288,7 @@ static void __activate_page(struct page *page, struct 
> > lruvec *lruvec,
> >
> >   __count_vm_events(PGACTIVATE, nr_pages);
> >   __count_memcg_events(lruvec_memcg(lruvec), PGACTIVATE, 
> > nr_pages);
> > - update_page_reclaim_stat(lruvec, file, 1);
> > + update_page_reclaim_stat(lruvec, file, 1, nr_pages);
> >   }
> >  }
>
> but current mainline is quite different:
>
> static void __activate_page(struct page *page, struct lruvec *lruvec,
> void *arg)
> {
> if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> int file = page_is_file_lru(page);
> int lru = page_lru_base_type(page);
>
> del_page_from_lru_list(page, lruvec, lru);
> SetPageActive(page);
> lru += LRU_ACTIVE;
> add_page_to_lru_list(page, lruvec, lru);
> trace_mm_lru_activate(page);
>
> __count_vm_event(PGACTIVATE);
> update_page_reclaim_stat(lruvec, file, 1);
> }
> }
>
> q:/usr/src/linux-5.7-rc5> patch -p1 --dry-run < ~/x.txt
> checking file mm/swap.c
> Hunk #2 FAILED at 288.
> Hunk #3 FAILED at 546.
> Hunk #4 FAILED at 564.
> Hunk #5 FAILED at 590.
> Hunk #6 succeeded at 890 (offset -9 lines).
> Hunk #7 succeeded at 915 (offset -9 lines).
> Hunk #8 succeeded at 958 with fuzz 2 (offset -10 lines).
> 4 out of 8 hunks FAILED
>

Oh sorry my mistake. It is dependent on the first two patches at [1].
Basically I replaced the third patch of the series with this one. I
should have re-send them all together.

[1] http://lkml.kernel.org/r/20200508212215.181307-1-shake...@google.com


[PATCH] memcg: effective memory.high reclaim for remote charging

2020-05-07 Thread Shakeel Butt
Currently the reclaim of excessive usage over memory.high is scheduled
to run on returning to the userland. The main reason behind this
approach was simplicity i.e. always reclaim with GFP_KERNEL context.
However the underlying assumptions behind this approach are: the current
task shares the memcg hierarchy with the given memcg and the memcg of
the current task most probably will not change on return to userland.

With the remote charging, the first assumption breaks and it allows the
usage to grow way beyond the memory.high as the reclaim and the
throttling becomes ineffective.

This patch forces the synchronous reclaim and potentially throttling for
the callers with context that allows blocking. For unblockable callers
or whose synch high reclaim is still not successful, a high reclaim is
scheduled either to return-to-userland if current task shares the
hierarchy with the given memcg or to system work queue.

Signed-off-by: Shakeel Butt 
---
 mm/memcontrol.c | 63 +
 1 file changed, 37 insertions(+), 26 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 317dbbaac603..7abb762f26cd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2387,23 +2387,13 @@ static unsigned long calculate_high_delay(struct 
mem_cgroup *memcg,
return min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
 }
 
-/*
- * Scheduled by try_charge() to be executed from the userland return path
- * and reclaims memory over the high limit.
- */
-void mem_cgroup_handle_over_high(void)
+static void reclaim_over_high(struct mem_cgroup *memcg, gfp_t gfp_mask,
+ unsigned long nr_pages)
 {
unsigned long penalty_jiffies;
unsigned long pflags;
-   unsigned int nr_pages = current->memcg_nr_pages_over_high;
-   struct mem_cgroup *memcg;
 
-   if (likely(!nr_pages))
-   return;
-
-   memcg = get_mem_cgroup_from_mm(current->mm);
-   reclaim_high(memcg, nr_pages, GFP_KERNEL);
-   current->memcg_nr_pages_over_high = 0;
+   reclaim_high(memcg, nr_pages, gfp_mask);
 
/*
 * memory.high is breached and reclaim is unable to keep up. Throttle
@@ -2418,7 +2408,7 @@ void mem_cgroup_handle_over_high(void)
 * been aggressively reclaimed enough yet.
 */
if (penalty_jiffies <= HZ / 100)
-   goto out;
+   return;
 
/*
 * If we exit early, we're guaranteed to die (since
@@ -2428,8 +2418,23 @@ void mem_cgroup_handle_over_high(void)
psi_memstall_enter();
schedule_timeout_killable(penalty_jiffies);
psi_memstall_leave();
+}
 
-out:
+/*
+ * Scheduled by try_charge() to be executed from the userland return path
+ * and reclaims memory over the high limit.
+ */
+void mem_cgroup_handle_over_high(void)
+{
+   unsigned int nr_pages = current->memcg_nr_pages_over_high;
+   struct mem_cgroup *memcg;
+
+   if (likely(!nr_pages))
+   return;
+
+   memcg = get_mem_cgroup_from_mm(current->mm);
+   reclaim_over_high(memcg, GFP_KERNEL, nr_pages);
+   current->memcg_nr_pages_over_high = 0;
css_put(>css);
 }
 
@@ -2584,15 +2589,6 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
gfp_mask,
if (batch > nr_pages)
refill_stock(memcg, batch - nr_pages);
 
-   /*
-* If the hierarchy is above the normal consumption range, schedule
-* reclaim on returning to userland.  We can perform reclaim here
-* if __GFP_RECLAIM but let's always punt for simplicity and so that
-* GFP_KERNEL can consistently be used during reclaim.  @memcg is
-* not recorded as it most likely matches current's and won't
-* change in the meantime.  As high limit is checked again before
-* reclaim, the cost of mismatch is negligible.
-*/
do {
if (page_counter_read(>memory) > READ_ONCE(memcg->high)) 
{
/* Don't bother a random interrupted task */
@@ -2600,8 +2596,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
gfp_mask,
schedule_work(>high_work);
break;
}
-   current->memcg_nr_pages_over_high += batch;
-   set_notify_resume(current);
+
+   if (gfpflags_allow_blocking(gfp_mask))
+   reclaim_over_high(memcg, gfp_mask, batch);
+
+   if (page_counter_read(>memory) <=
+   READ_ONCE(memcg->high))
+   break;
+   /*
+* The above reclaim might not be able to do much. Punt
+* the high reclaim to return to userland if the current
+* task shares the hierarchy.
+*/
+  

Re: [PATCH] memcg: effective memory.high reclaim for remote charging

2020-05-07 Thread Shakeel Butt
On Thu, May 7, 2020 at 9:47 AM Michal Hocko  wrote:
>
> On Thu 07-05-20 09:33:01, Shakeel Butt wrote:
> [...]
> > @@ -2600,8 +2596,23 @@ static int try_charge(struct mem_cgroup *memcg, 
> > gfp_t gfp_mask,
> >   schedule_work(>high_work);
> >   break;
> >   }
> > - current->memcg_nr_pages_over_high += batch;
> > - set_notify_resume(current);
> > +
> > + if (gfpflags_allow_blocking(gfp_mask))
> > + reclaim_over_high(memcg, gfp_mask, batch);
> > +
> > + if (page_counter_read(>memory) <=
> > + READ_ONCE(memcg->high))
> > + break;
>
> I am half way to a long weekend so bear with me. Shouldn't this be continue? 
> The
> parent memcg might be still in excess even the child got reclaimed,
> right?
>

The reclaim_high() actually already does this walk up to the root and
reclaim from ones who are still over their high limit. Though having
'continue' here is correct too.

> > + /*
> > +  * The above reclaim might not be able to do much. 
> > Punt
> > +  * the high reclaim to return to userland if the 
> > current
> > +  * task shares the hierarchy.
> > +  */
> > + if (current->mm && mm_match_cgroup(current->mm, 
> > memcg)) {
> > + current->memcg_nr_pages_over_high += batch;
> > + set_notify_resume(current);
> > + } else
> > + schedule_work(>high_work);
> >   break;
> >   }
> >   } while ((memcg = parent_mem_cgroup(memcg)));
> > --
> > 2.26.2.526.g744177e7f7-goog
> >
>
> --
> Michal Hocko
> SUSE Labs


[PATCH] mm: vmscan: consistent update to pgsteal and pgscan

2020-05-07 Thread Shakeel Butt
One way to measure the efficiency of memory reclaim is to look at the
ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
not updated consistently at the system level and the ratio of these are
not very meaningful. The pgsteal and pgscan are updated for only global
reclaim while pgrefill gets updated for global as well as cgroup
reclaim.

Please note that this difference is only for system level vmstats. The
cgroup stats returned by memory.stat are actually consistent. The
cgroup's pgsteal contains number of reclaimed pages for global as well
as cgroup reclaim. So, one way to get the system level stats is to get
these stats from root's memory.stat but root does not expose that
interface. Also for !CONFIG_MEMCG machines /proc/vmstat is the only way
to get these stats. So, make these stats consistent.

Signed-off-by: Shakeel Butt 
---
 mm/vmscan.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index cc555903a332..51f7d1efc912 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1943,8 +1943,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
reclaim_stat->recent_scanned[file] += nr_taken;
 
item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
-   if (!cgroup_reclaim(sc))
-   __count_vm_events(item, nr_scanned);
+   __count_vm_events(item, nr_scanned);
__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
spin_unlock_irq(>lru_lock);
 
@@ -1957,8 +1956,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
spin_lock_irq(>lru_lock);
 
item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
-   if (!cgroup_reclaim(sc))
-   __count_vm_events(item, nr_reclaimed);
+   __count_vm_events(item, nr_reclaimed);
__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
reclaim_stat->recent_rotated[0] += stat.nr_activate[0];
reclaim_stat->recent_rotated[1] += stat.nr_activate[1];
-- 
2.26.2.526.g744177e7f7-goog



Re: [PATCH] mm: vmscan: consistent update to pgsteal and pgscan

2020-05-08 Thread Shakeel Butt
On Fri, May 8, 2020 at 3:34 AM Yafang Shao  wrote:
>
> On Fri, May 8, 2020 at 4:49 AM Shakeel Butt  wrote:
> >
> > One way to measure the efficiency of memory reclaim is to look at the
> > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> > not updated consistently at the system level and the ratio of these are
> > not very meaningful. The pgsteal and pgscan are updated for only global
> > reclaim while pgrefill gets updated for global as well as cgroup
> > reclaim.
> >
>
> Hi Shakeel,
>
> We always use pgscan and pgsteal for monitoring the system level
> memory pressure, for example, by using sysstat(sar) or some other
> monitor tools.

Don't you need pgrefill in addition to pgscan and pgsteal to get the
full picture of the reclaim activity?

> But with this change, these two counters include the memcg pressure as
> well. It is not easy to know whether the pgscan and pgsteal are caused
> by system level pressure or only some specific memcgs reaching their
> memory limit.
>
> How about adding  cgroup_reclaim() to pgrefill as well ?
>

I am looking for all the reclaim activity on the system. Adding
!cgroup_reclaim to pgrefill will skip the cgroup reclaim activity.
Maybe adding pgsteal_cgroup and pgscan_cgroup would be better.

> > Please note that this difference is only for system level vmstats. The
> > cgroup stats returned by memory.stat are actually consistent. The
> > cgroup's pgsteal contains number of reclaimed pages for global as well
> > as cgroup reclaim. So, one way to get the system level stats is to get
> > these stats from root's memory.stat but root does not expose that
> > interface. Also for !CONFIG_MEMCG machines /proc/vmstat is the only way
> > to get these stats. So, make these stats consistent.
> >
> > Signed-off-by: Shakeel Butt 
> > ---
> >  mm/vmscan.c | 6 ++
> >  1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index cc555903a332..51f7d1efc912 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1943,8 +1943,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
> > lruvec *lruvec,
> > reclaim_stat->recent_scanned[file] += nr_taken;
> >
> > item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
> > -   if (!cgroup_reclaim(sc))
> > -   __count_vm_events(item, nr_scanned);
> > +   __count_vm_events(item, nr_scanned);
> > __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
> > spin_unlock_irq(>lru_lock);
> >
> > @@ -1957,8 +1956,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
> > lruvec *lruvec,
> > spin_lock_irq(>lru_lock);
> >
> > item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
> > -   if (!cgroup_reclaim(sc))
> > -   __count_vm_events(item, nr_reclaimed);
> > +   __count_vm_events(item, nr_reclaimed);
> > __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
> > reclaim_stat->recent_rotated[0] += stat.nr_activate[0];
> > reclaim_stat->recent_rotated[1] += stat.nr_activate[1];
> > --
> > 2.26.2.526.g744177e7f7-goog
> >
> >
>
>
> --
> Thanks
> Yafang


Re: [PATCH] mm: vmscan: consistent update to pgsteal and pgscan

2020-05-08 Thread Shakeel Butt
On Fri, May 8, 2020 at 6:38 AM Johannes Weiner  wrote:
>
> On Fri, May 08, 2020 at 06:25:14AM -0700, Shakeel Butt wrote:
> > On Fri, May 8, 2020 at 3:34 AM Yafang Shao  wrote:
> > >
> > > On Fri, May 8, 2020 at 4:49 AM Shakeel Butt  wrote:
> > > >
> > > > One way to measure the efficiency of memory reclaim is to look at the
> > > > ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
> > > > not updated consistently at the system level and the ratio of these are
> > > > not very meaningful. The pgsteal and pgscan are updated for only global
> > > > reclaim while pgrefill gets updated for global as well as cgroup
> > > > reclaim.
> > > >
> > >
> > > Hi Shakeel,
> > >
> > > We always use pgscan and pgsteal for monitoring the system level
> > > memory pressure, for example, by using sysstat(sar) or some other
> > > monitor tools.
>
> I'm in the same boat. It's useful to have activity that happens purely
> due to machine capacity rather than localized activity that happens
> due to the limits throughout the cgroup tree.
>
> > Don't you need pgrefill in addition to pgscan and pgsteal to get the
> > full picture of the reclaim activity?
>
> I actually almost never look at pgrefill.
>

Nowadays we are looking at reclaim cost on high utilization
machines/devices and noticed that rmap walk takes more than 60/70% of
the CPU cost of the reclaim. Kernel does rmap walks in
shrink_active_list and shrink_page_list and pgscan and pgrefill are
good approximations of the number of rmap walks during a reclaim.

> > > But with this change, these two counters include the memcg pressure as
> > > well. It is not easy to know whether the pgscan and pgsteal are caused
> > > by system level pressure or only some specific memcgs reaching their
> > > memory limit.
> > >
> > > How about adding  cgroup_reclaim() to pgrefill as well ?
> > >
> >
> > I am looking for all the reclaim activity on the system. Adding
> > !cgroup_reclaim to pgrefill will skip the cgroup reclaim activity.
> > Maybe adding pgsteal_cgroup and pgscan_cgroup would be better.
>
> How would you feel about adding memory.stat at the root cgroup level?
>

Actually I would prefer adding memory.stat at the root cgroup level as
you noted below that more use-cases would benefit from it.

> There are subtle differences between /proc/vmstat and memory.stat, and
> cgroup-aware code that wants to watch the full hierarchy currently has
> to know about these intricacies and translate semantics back and forth.
>
> Generally having the fully recursive memory.stat at the root level
> could help a broader range of usecases.

Thanks for the feedback. I will send the patch with the additional motivation.


[PATCH] memcg: expose root cgroup's memory.stat

2020-05-08 Thread Shakeel Butt
One way to measure the efficiency of memory reclaim is to look at the
ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are
not updated consistently at the system level and the ratio of these are
not very meaningful. The pgsteal and pgscan are updated for only global
reclaim while pgrefill gets updated for global as well as cgroup
reclaim.

Please note that this difference is only for system level vmstats. The
cgroup stats returned by memory.stat are actually consistent. The
cgroup's pgsteal contains number of reclaimed pages for global as well
as cgroup reclaim. So, one way to get the system level stats is to get
these stats from root's memory.stat, so, expose memory.stat for the root
cgroup.

from Johannes Weiner:
There are subtle differences between /proc/vmstat and
memory.stat, and cgroup-aware code that wants to watch the full
hierarchy currently has to know about these intricacies and
translate semantics back and forth.

Generally having the fully recursive memory.stat at the root
level could help a broader range of usecases.

Signed-off-by: Shakeel Butt 
Suggested-by: Johannes Weiner 
---
 mm/memcontrol.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 05dcb72314b5..c300d52c07a5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6230,7 +6230,6 @@ static struct cftype memory_files[] = {
},
{
.name = "stat",
-   .flags = CFTYPE_NOT_ON_ROOT,
.seq_show = memory_stat_show,
},
{
-- 
2.26.2.645.ge9eca65c58-goog



Re: [PATCH] net/packet: simply allocations in alloc_one_pg_vec_page

2020-05-16 Thread Shakeel Butt
On Sat, May 16, 2020 at 1:40 PM David Miller  wrote:
>
> From: Shakeel Butt 
> Date: Fri, 15 May 2020 19:17:36 -0700
>
> > and thus there is no need to have any fallback after vzalloc.
>
> This statement is false.
>
> The virtual mapping allocation or the page table allocations can fail.
>
> A fallback is therefore indeed necessary.

I am assuming that you at least agree that vzalloc should only be
called for non-zero order allocations. So, my argument is if non-zero
order vzalloc has failed (allocations internal to vzalloc, including
virtual mapping allocation and page table allocations, are order 0 and
use GFP_KERNEL i.e. triggering reclaim and oom-killer) then the next
non-zero order page allocation has very low chance of succeeding.


Re: [PATCH] net/packet: simply allocations in alloc_one_pg_vec_page

2020-05-16 Thread Shakeel Butt
On Sat, May 16, 2020 at 3:45 PM Eric Dumazet  wrote:
>
> On Sat, May 16, 2020 at 3:35 PM Shakeel Butt  wrote:
> >
> > On Sat, May 16, 2020 at 1:40 PM David Miller  wrote:
> > >
> > > From: Shakeel Butt 
> > > Date: Fri, 15 May 2020 19:17:36 -0700
> > >
> > > > and thus there is no need to have any fallback after vzalloc.
> > >
> > > This statement is false.
> > >
> > > The virtual mapping allocation or the page table allocations can fail.
> > >
> > > A fallback is therefore indeed necessary.
> >
> > I am assuming that you at least agree that vzalloc should only be
> > called for non-zero order allocations. So, my argument is if non-zero
> > order vzalloc has failed (allocations internal to vzalloc, including
> > virtual mapping allocation and page table allocations, are order 0 and
> > use GFP_KERNEL i.e. triggering reclaim and oom-killer) then the next
> > non-zero order page allocation has very low chance of succeeding.
>
>
> 32bit kernels might have exhausted their vmalloc space, yet they can
> still allocate order-0 pages.

Oh ok it makes sense.


Re: [PATCH] net/packet: simply allocations in alloc_one_pg_vec_page

2020-05-16 Thread Shakeel Butt
On Sat, May 16, 2020 at 4:39 PM David Miller  wrote:
>
> From: Shakeel Butt 
> Date: Sat, 16 May 2020 15:35:46 -0700
>
> > So, my argument is if non-zero order vzalloc has failed (allocations
> > internal to vzalloc, including virtual mapping allocation and page
> > table allocations, are order 0 and use GFP_KERNEL i.e. triggering
> > reclaim and oom-killer) then the next non-zero order page allocation
> > has very low chance of succeeding.
>
> Also not true.
>
> Page table allocation strategies and limits vary by architecture, they
> may even need virtual mappings themselves.  So they can fail in situations
> where a non-zero GFP_KERNEL page allocator allocation would succeed.

Thanks for the explanation. Do you think calling vzalloc only for
non-zero order here has any value?


Re: [PATCH] mm/z3fold.c: Fix z3fold_destroy_pool() ordering

2019-07-26 Thread Shakeel Butt
On Fri, Jul 26, 2019 at 3:48 PM Henry Burns  wrote:
>
> The constraint from the zpool use of z3fold_destroy_pool() is there are no
> outstanding handles to memory (so no active allocations), but it is possible
> for there to be outstanding work on either of the two wqs in the pool.
>
> If there is work queued on pool->compact_workqueue when it is called,
> z3fold_destroy_pool() will do:
>
>z3fold_destroy_pool()
>  destroy_workqueue(pool->release_wq)
>  destroy_workqueue(pool->compact_wq)
>drain_workqueue(pool->compact_wq)
>  do_compact_page(zhdr)
>kref_put(>refcount)
>  __release_z3fold_page(zhdr, ...)
>queue_work_on(pool->release_wq, >work) *BOOM*
>
> So compact_wq needs to be destroyed before release_wq.
>
> Fixes: 5d03a6613957 ("mm/z3fold.c: use kref to prevent page free/compact 
> race")
>
> Signed-off-by: Henry Burns 

Reviewed-by: Shakeel Butt 

> Cc: 
> ---
>  mm/z3fold.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/mm/z3fold.c b/mm/z3fold.c
> index 1a029a7432ee..43de92f52961 100644
> --- a/mm/z3fold.c
> +++ b/mm/z3fold.c
> @@ -818,8 +818,15 @@ static void z3fold_destroy_pool(struct z3fold_pool *pool)
>  {
> kmem_cache_destroy(pool->c_handle);
> z3fold_unregister_migration(pool);
> -   destroy_workqueue(pool->release_wq);
> +
> +   /*
> +* We need to destroy pool->compact_wq before pool->release_wq,
> +* as any pending work on pool->compact_wq will call
> +* queue_work(pool->release_wq, >work).
> +*/
> +
> destroy_workqueue(pool->compact_wq);
> +   destroy_workqueue(pool->release_wq);
> kfree(pool);
>  }
>
> --
> 2.22.0.709.g102302147b-goog
>


Re: [PATCH] mm/z3fold.c: Fix z3fold_destroy_pool() race condition

2019-07-26 Thread Shakeel Butt
On Fri, Jul 26, 2019 at 3:48 PM Henry Burns  wrote:
>
> The constraint from the zpool use of z3fold_destroy_pool() is there are no
> outstanding handles to memory (so no active allocations), but it is possible
> for there to be outstanding work on either of the two wqs in the pool.
>
> Calling z3fold_deregister_migration() before the workqueues are drained
> means that there can be allocated pages referencing a freed inode,
> causing any thread in compaction to be able to trip over the bad
> pointer in PageMovable().
>
> Fixes: 1f862989b04a ("mm/z3fold.c: support page migration")
>
> Signed-off-by: Henry Burns 

Reviewed-by: Shakeel Butt 

> Cc: 
> ---
>  mm/z3fold.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/mm/z3fold.c b/mm/z3fold.c
> index 43de92f52961..ed19d98c9dcd 100644
> --- a/mm/z3fold.c
> +++ b/mm/z3fold.c
> @@ -817,16 +817,19 @@ static struct z3fold_pool *z3fold_create_pool(const 
> char *name, gfp_t gfp,
>  static void z3fold_destroy_pool(struct z3fold_pool *pool)
>  {
> kmem_cache_destroy(pool->c_handle);
> -   z3fold_unregister_migration(pool);
>
> /*
>  * We need to destroy pool->compact_wq before pool->release_wq,
>  * as any pending work on pool->compact_wq will call
>  * queue_work(pool->release_wq, >work).
> +*
> +* There are still outstanding pages until both workqueues are 
> drained,
> +* so we cannot unregister migration until then.
>  */
>
> destroy_workqueue(pool->compact_wq);
> destroy_workqueue(pool->release_wq);
> +   z3fold_unregister_migration(pool);
> kfree(pool);
>  }
>
> --
> 2.22.0.709.g102302147b-goog
>


[PATCH] mm: memcontrol: account kernel stack per node

2020-06-29 Thread Shakeel Butt
Currently the kernel stack is being accounted per-zone. There is no need
to do that. In addition due to being per-zone, memcg has to keep a
separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate
MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
node_stat_item.

Signed-off-by: Shakeel Butt 
---
 drivers/base/node.c|  4 ++--
 fs/proc/meminfo.c  |  4 ++--
 include/linux/memcontrol.h |  2 --
 include/linux/mmzone.h |  8 
 kernel/fork.c  | 29 ++---
 kernel/scs.c   |  2 +-
 mm/memcontrol.c|  2 +-
 mm/page_alloc.c| 16 
 mm/vmstat.c|  8 
 9 files changed, 32 insertions(+), 43 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 0cf13e31603c..508b80f6329b 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -440,9 +440,9 @@ static ssize_t node_read_meminfo(struct device *dev,
   nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
   nid, K(node_page_state(pgdat, NR_ANON_MAPPED)),
   nid, K(i.sharedram),
-  nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK_KB),
+  nid, node_page_state(pgdat, NR_KERNEL_STACK_KB),
 #ifdef CONFIG_SHADOW_CALL_STACK
-  nid, sum_zone_node_page_state(nid, NR_KERNEL_SCS_KB),
+  nid, node_page_state(pgdat, NR_KERNEL_SCS_KB),
 #endif
   nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)),
   nid, 0UL,
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index f262bff3ca31..887a5532e449 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -101,10 +101,10 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
show_val_kb(m, "SReclaimable:   ", sreclaimable);
show_val_kb(m, "SUnreclaim: ", sunreclaim);
seq_printf(m, "KernelStack:%8lu kB\n",
-  global_zone_page_state(NR_KERNEL_STACK_KB));
+  global_node_page_state(NR_KERNEL_STACK_KB));
 #ifdef CONFIG_SHADOW_CALL_STACK
seq_printf(m, "ShadowCallStack:%8lu kB\n",
-  global_zone_page_state(NR_KERNEL_SCS_KB));
+  global_node_page_state(NR_KERNEL_SCS_KB));
 #endif
show_val_kb(m, "PageTables: ",
global_zone_page_state(NR_PAGETABLE));
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ba1e42715ecf..a3ddb236898e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -33,8 +33,6 @@ enum memcg_stat_item {
MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
MEMCG_SOCK,
MEMCG_PERCPU_B,
-   /* XXX: why are these zone and not node counters? */
-   MEMCG_KERNEL_STACK_KB,
MEMCG_NR_STAT,
 };
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8e859444927a..b79f73ce8b57 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -153,10 +153,6 @@ enum zone_stat_item {
NR_ZONE_WRITE_PENDING,  /* Count of dirty, writeback and unstable pages 
*/
NR_MLOCK,   /* mlock()ed pages found and moved off LRU */
NR_PAGETABLE,   /* used for pagetables */
-   NR_KERNEL_STACK_KB, /* measured in KiB */
-#if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
-   NR_KERNEL_SCS_KB,   /* measured in KiB */
-#endif
/* Second 128 byte cacheline */
NR_BOUNCE,
 #if IS_ENABLED(CONFIG_ZSMALLOC)
@@ -201,6 +197,10 @@ enum node_stat_item {
NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */
NR_FOLL_PIN_ACQUIRED,   /* via: pin_user_page(), gup flag: FOLL_PIN */
NR_FOLL_PIN_RELEASED,   /* pages returned via unpin_user_page() */
+   NR_KERNEL_STACK_KB, /* measured in KiB */
+#if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
+   NR_KERNEL_SCS_KB,   /* measured in KiB */
+#endif
NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 73fdfa9674b5..ee5393350ef7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -278,7 +278,7 @@ static inline void free_thread_stack(struct task_struct 
*tsk)
 
for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
mod_memcg_page_state(vm->pages[i],
-MEMCG_KERNEL_STACK_KB,
+NR_KERNEL_STACK_KB,
 -(int)(PAGE_SIZE / 1024));
 
memcg_kmem_uncharge_page(vm->pages[i], 0);
@@ -381,32 +381,23 @@ static void account_kernel_stack(struct task_struct *tsk, 
int account)
 {
void *stack = task_stack_page(tsk);
struct vm_struct *vm = task_stack_vm_area(tsk);
+   struct page *page;
 
BUILD_BUG_ON(IS_ENABLED(CONFIG_VMAP_STACK) && PAGE_SIZE % 1024 != 0);
 
if 

[RFC PROPOSAL] memcg: per-memcg user space reclaim interface

2020-07-02 Thread Shakeel Butt
This is a proposal to expose an interface to the user space to trigger
memory reclaim on a memory cgroup. The proposal contains potential use
cases, benefits of the user space interface and potential implementation
choices.

Use cases:
--

1) Per-memcg uswapd:

Usually applications consists of combination of latency sensitive and
latency tolerant tasks. For example, tasks serving user requests vs
tasks doing data backup for a database application. At the moment the
kernel does not differentiate between such tasks when the application
hits the memcg limits. So, potentially a latency sensitive user facing
task can get stuck in memory reclaim and be throttled by the kernel.

This problem has been discussed before [1, 2].

One way to resolve this issue is to preemptively trigger the memory
reclaim from a latency tolerant task (uswapd) when the application is
near the limits. (Please note that finding 'near the limits' situation
is an orthogonal problem and we are exploring if per-memcg MemAvailable
notifications can be useful [3]).

2) Proactive reclaim:

This is a similar to the previous use-case, the difference is instead of
waiting for the application to be near its limit to trigger memory
reclaim, continuously pressuring the memcg to reclaim a small amount of
memory. This gives more accurate and uptodate workingset estimation as
the LRUs are continuously sorted and can potentially provide more
deterministic memory overcommit behavior. The memory overcommit
controller can provide more proactive response to the changing workload
of the running applications instead of being reactive.

Benefit of user space solution:
---

1) More flexible on who should be charged for the cpu of the memory
reclaim. For proactive reclaim, it makes more sense to centralized the
overhead while for uswapd, it makes more sense for the application to
pay for the cpu of the memory reclaim.

2) More flexible on dedicating the resources (like cpu). The memory
overcommit controller can balance the cost between the cpu usage and
the memory reclaimed.

3) Provides a way to the applications to keep their LRUs sorted, so,
under memory pressure better reclaim candidates are selected.

Interface options:
--

1) memcg interface e.g. 'echo 10M > memory.reclaim'

+ simple
+ can be extended to target specific type of memory (anon, file, kmem).
- most probably restricted to cgroup v2.

2) fadvise(PAGEOUT) on cgroup_dir_fd

+ more general and applicable to other FSes (actually we are using
something similar for tmpfs).
+ can be extended in future to just age the LRUs instead of reclaim or
some new use cases.

[Or maybe a new fadvise2() syscall which can take FS specific options.]

[1] https://lwn.net/Articles/753162/
[2] http://lkml.kernel.org/r/20200219181219.54356-1-han...@cmpxchg.org
[3] 
http://lkml.kernel.org/r/alpine.deb.2.22.394.2006281445210.855...@chino.kir.corp.google.com

The following patch is my attempt to implement the option 2. Please ignore
the fine details as I am more interested in getting the feedback on the
proposal the interface options.

Signed-off-by: Shakeel Butt 
---
 fs/kernfs/dir.c | 20 +++
 include/linux/cgroup-defs.h |  2 ++
 include/linux/kernfs.h  |  2 ++
 include/uapi/linux/fadvise.h|  1 +
 kernel/cgroup/cgroup-internal.h |  2 ++
 kernel/cgroup/cgroup-v1.c   |  1 +
 kernel/cgroup/cgroup.c  | 43 +
 mm/memcontrol.c | 20 +++
 8 files changed, 91 insertions(+)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 9aec80b9d7c6..96b3b67f3a85 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1698,9 +1698,29 @@ static int kernfs_fop_readdir(struct file *file, struct 
dir_context *ctx)
return 0;
 }
 
+static int kernfs_dir_fadvise(struct file *file, loff_t offset, loff_t len,
+ int advise)
+{
+   struct kernfs_node *kn  = kernfs_dentry_node(file->f_path.dentry);
+   struct kernfs_syscall_ops *scops = kernfs_root(kn)->syscall_ops;
+   int ret;
+
+   if (!scops || !scops->fadvise)
+   return -EPERM;
+
+   if (!kernfs_get_active(kn))
+   return -ENODEV;
+
+   ret = scops->fadvise(kn, offset, len, advise);
+
+   kernfs_put_active(kn);
+   return ret;
+}
+
 const struct file_operations kernfs_dir_fops = {
.read   = generic_read_dir,
.iterate_shared = kernfs_fop_readdir,
.release= kernfs_dir_fop_release,
.llseek = generic_file_llseek,
+   .fadvise= kernfs_dir_fadvise,
 };
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 52661155f85f..cbe46634875e 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -628,6 +628,8 @@ struct cgroup_subsys {
void (*css_rstat_flush)(struct cgroup_subsys_state *css, int cpu);
int (*css_extra_st

Re: BUG: Bad page state in process - page dumped because: page still charged to cgroup

2020-07-02 Thread Shakeel Butt
On Wed, Jul 1, 2020 at 11:46 AM Roman Gushchin  wrote:
...
> --
>
> From c97afecd32c0db5e024be9ba72f43d22974f5bcd Mon Sep 17 00:00:00 2001
> From: Roman Gushchin 
> Date: Wed, 1 Jul 2020 11:05:32 -0700
> Subject: [PATCH] mm: kmem: make memcg_kmem_enabled() irreversible
>
> Historically the kernel memory accounting was an opt-in feature, which
> could be enabled for individual cgroups. But now it's not true, and
> it's on by default both on cgroup v1 and cgroup v2.  And as long as a
> user has at least one non-root memory cgroup, the kernel memory
> accounting is on. So in most setups it's either always on (if memory
> cgroups are in use and kmem accounting is not disabled), either always
> off (otherwise).
>
> memcg_kmem_enabled() is used in many places to guard the kernel memory
> accounting code. If memcg_kmem_enabled() can reverse from returning
> true to returning false (as now), we can't rely on it on release paths
> and have to check if it was on before.
>
> If we'll make memcg_kmem_enabled() irreversible (always returning true
> after returning it for the first time), it'll make the general logic
> more simple and robust. It also will allow to guard some checks which
> otherwise would stay unguarded.
>
> Signed-off-by: Roman Gushchin 

Reviewed-by: Shakeel Butt 


Re: [RFC PROPOSAL] memcg: per-memcg user space reclaim interface

2020-07-03 Thread Shakeel Butt
On Thu, Jul 2, 2020 at 11:35 PM Michal Hocko  wrote:
>
> On Thu 02-07-20 08:22:22, Shakeel Butt wrote:
> [...]
> > Interface options:
> > --
> >
> > 1) memcg interface e.g. 'echo 10M > memory.reclaim'
> >
> > + simple
> > + can be extended to target specific type of memory (anon, file, kmem).
> > - most probably restricted to cgroup v2.
> >
> > 2) fadvise(PAGEOUT) on cgroup_dir_fd
> >
> > + more general and applicable to other FSes (actually we are using
> > something similar for tmpfs).
> > + can be extended in future to just age the LRUs instead of reclaim or
> > some new use cases.
>
> Could you explain why memory.high as an interface to trigger pro-active
> memory reclaim is not sufficient. Also memory.low limit to protect
> latency sensitve workloads?

Yes, we can use memory.high to trigger [proactive] reclaim in a memcg
but note that it can also introduce stalls in the application running
in that memcg. Let's suppose the memory.current of a memcg is 100MiB
and we want to reclaim 20MiB from it, we can set the memory.high to
80MiB but any allocation attempt from the application running in that
memcg can get stalled/throttled. I want the functionality of the
reclaim without potential stalls.

The memory.min is for protection against the global reclaim and is
unrelated to this discussion.


Re: [RFC PROPOSAL] memcg: per-memcg user space reclaim interface

2020-07-03 Thread Shakeel Butt
On Fri, Jul 3, 2020 at 8:50 AM Roman Gushchin  wrote:
>
> On Fri, Jul 03, 2020 at 07:23:14AM -0700, Shakeel Butt wrote:
> > On Thu, Jul 2, 2020 at 11:35 PM Michal Hocko  wrote:
> > >
> > > On Thu 02-07-20 08:22:22, Shakeel Butt wrote:
> > > [...]
> > > > Interface options:
> > > > --
> > > >
> > > > 1) memcg interface e.g. 'echo 10M > memory.reclaim'
> > > >
> > > > + simple
> > > > + can be extended to target specific type of memory (anon, file, kmem).
> > > > - most probably restricted to cgroup v2.
> > > >
> > > > 2) fadvise(PAGEOUT) on cgroup_dir_fd
> > > >
> > > > + more general and applicable to other FSes (actually we are using
> > > > something similar for tmpfs).
> > > > + can be extended in future to just age the LRUs instead of reclaim or
> > > > some new use cases.
> > >
> > > Could you explain why memory.high as an interface to trigger pro-active
> > > memory reclaim is not sufficient. Also memory.low limit to protect
> > > latency sensitve workloads?
>
> I initially liked the proposal, but after some thoughts I've realized
> that I don't know a good use case where memory.high is less useful.
> Shakeel, what's the typical use case you thinking of?
> Who and how will use the new interface?
>
> >
> > Yes, we can use memory.high to trigger [proactive] reclaim in a memcg
> > but note that it can also introduce stalls in the application running
> > in that memcg. Let's suppose the memory.current of a memcg is 100MiB
> > and we want to reclaim 20MiB from it, we can set the memory.high to
> > 80MiB but any allocation attempt from the application running in that
> > memcg can get stalled/throttled. I want the functionality of the
> > reclaim without potential stalls.
>
> But reclaiming some pagecache/swapping out anon pages can always
> generate some stalls caused by pagefaults, no?
>

Thanks for looking into the proposal. Let me answer both of your
questions together. I have added the two use-cases but let me explain
the proactive reclaim a bit more as we actually use that in our
production.

We have defined tolerable refault rates for the applications based on
their type (latency sensitive or not). Proactive reclaim is triggered
in the application based on their current refault rates and usage. If
the current refault rate exceeds the tolerable refault rate then
stop/slowdown the proactive reclaim.

For the second question, yes, each individual refault can induce the
stall as well but we have more control on that stall as compared to
stalls due to reclaim. For us almost all the reclaimable memory is
anon and we use compression based swap, so, the cost of each refault
is fixed and a couple of microseconds.

I think the next question is what about the refaults from disk or
source with highly variable cost. Usually the latency sensitive
applications remove such uncertainty by mlocking the pages backed by
such backends (e.g. mlocking the executable) or at least that is the
case for us.

Thanks,
Shakeel


Re: [PATCH v2 1/2] mm, memcg: reclaim more aggressively before high allocator throttling

2020-07-13 Thread Shakeel Butt
On Mon, Jul 13, 2020 at 4:42 AM Chris Down  wrote:
>
> In Facebook production, we've seen cases where cgroups have been put
> into allocator throttling even when they appear to have a lot of slack
> file caches which should be trivially reclaimable.
>
> Looking more closely, the problem is that we only try a single cgroup
> reclaim walk for each return to usermode before calculating whether or
> not we should throttle. This single attempt doesn't produce enough
> pressure to shrink for cgroups with a rapidly growing amount of file
> caches prior to entering allocator throttling.
>
> As an example, we see that threads in an affected cgroup are stuck in
> allocator throttling:
>
> # for i in $(cat cgroup.threads); do
> > grep over_high "/proc/$i/stack"
> > done
> [<0>] mem_cgroup_handle_over_high+0x10b/0x150
> [<0>] mem_cgroup_handle_over_high+0x10b/0x150
> [<0>] mem_cgroup_handle_over_high+0x10b/0x150
>
> ...however, there is no I/O pressure reported by PSI, despite a lot of
> slack file pages:
>
> # cat memory.pressure
> some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903
> full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959
> # cat io.pressure
> some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391
> full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640
> # grep _file memory.stat
> inactive_file 1370939392
> active_file 661635072
>
> This patch changes the behaviour to retry reclaim either until the
> current task goes below the 10ms grace period, or we are making no
> reclaim progress at all. In the latter case, we enter reclaim throttling
> as before.
>
> To a user, there's no intuitive reason for the reclaim behaviour to
> differ from hitting memory.high as part of a new allocation, as opposed
> to hitting memory.high because someone lowered its value. As such this
> also brings an added benefit: it unifies the reclaim behaviour between
> the two.
>
> There's precedent for this behaviour: we already do reclaim retries when
> writing to memory.{high,max}, in max reclaim, and in the page allocator
> itself.
>
> Signed-off-by: Chris Down 
> Cc: Andrew Morton 
> Cc: Johannes Weiner 
> Cc: Tejun Heo 
> Cc: Michal Hocko 

Reviewed-by: Shakeel Butt 


Re: [RFC PROPOSAL] memcg: per-memcg user space reclaim interface

2020-07-07 Thread Shakeel Butt
On Mon, Jul 6, 2020 at 2:38 PM Roman Gushchin  wrote:
>
> On Fri, Jul 03, 2020 at 09:27:19AM -0700, Shakeel Butt wrote:
> > On Fri, Jul 3, 2020 at 8:50 AM Roman Gushchin  wrote:
> > >
> > > On Fri, Jul 03, 2020 at 07:23:14AM -0700, Shakeel Butt wrote:
> > > > On Thu, Jul 2, 2020 at 11:35 PM Michal Hocko  wrote:
> > > > >
> > > > > On Thu 02-07-20 08:22:22, Shakeel Butt wrote:
> > > > > [...]
> > > > > > Interface options:
> > > > > > --
> > > > > >
> > > > > > 1) memcg interface e.g. 'echo 10M > memory.reclaim'
> > > > > >
> > > > > > + simple
> > > > > > + can be extended to target specific type of memory (anon, file, 
> > > > > > kmem).
> > > > > > - most probably restricted to cgroup v2.
> > > > > >
> > > > > > 2) fadvise(PAGEOUT) on cgroup_dir_fd
> > > > > >
> > > > > > + more general and applicable to other FSes (actually we are using
> > > > > > something similar for tmpfs).
> > > > > > + can be extended in future to just age the LRUs instead of reclaim 
> > > > > > or
> > > > > > some new use cases.
> > > > >
> > > > > Could you explain why memory.high as an interface to trigger 
> > > > > pro-active
> > > > > memory reclaim is not sufficient. Also memory.low limit to protect
> > > > > latency sensitve workloads?
> > >
> > > I initially liked the proposal, but after some thoughts I've realized
> > > that I don't know a good use case where memory.high is less useful.
> > > Shakeel, what's the typical use case you thinking of?
> > > Who and how will use the new interface?
> > >
> > > >
> > > > Yes, we can use memory.high to trigger [proactive] reclaim in a memcg
> > > > but note that it can also introduce stalls in the application running
> > > > in that memcg. Let's suppose the memory.current of a memcg is 100MiB
> > > > and we want to reclaim 20MiB from it, we can set the memory.high to
> > > > 80MiB but any allocation attempt from the application running in that
> > > > memcg can get stalled/throttled. I want the functionality of the
> > > > reclaim without potential stalls.
> > >
> > > But reclaiming some pagecache/swapping out anon pages can always
> > > generate some stalls caused by pagefaults, no?
> > >
> >
> > Thanks for looking into the proposal. Let me answer both of your
> > questions together. I have added the two use-cases but let me explain
> > the proactive reclaim a bit more as we actually use that in our
> > production.
> >
> > We have defined tolerable refault rates for the applications based on
> > their type (latency sensitive or not). Proactive reclaim is triggered
> > in the application based on their current refault rates and usage. If
> > the current refault rate exceeds the tolerable refault rate then
> > stop/slowdown the proactive reclaim.
> >
> > For the second question, yes, each individual refault can induce the
> > stall as well but we have more control on that stall as compared to
> > stalls due to reclaim. For us almost all the reclaimable memory is
> > anon and we use compression based swap, so, the cost of each refault
> > is fixed and a couple of microseconds.
> >
> > I think the next question is what about the refaults from disk or
> > source with highly variable cost. Usually the latency sensitive
> > applications remove such uncertainty by mlocking the pages backed by
> > such backends (e.g. mlocking the executable) or at least that is the
> > case for us.
>
> Got it.
>
> It feels like you're suggesting something similar to memory.high with
> something similar to a different gfp flags. In other words, the
> difference is only which pages can be reclaimed and which not. I don't
> have a definitive answer here, but I wonder if we can somehow
> generalize the existing interface? E.g. if the problem is with artificially
> induced delays, we can have a config option/sysctl/sysfs knob/something else
> which would disable it. Otherwise we risk ending up with many different kinds
> of soft memory limits.
>

It is possible to achieve this functionality with memory.high with
some config/sysctls e.t.c as you suggested but it can bloat and
complicate the memory.high interface.

I understand your concern with different kinds of soft memory limits
but I see this as a simple interface (i.e. just trigger reclaim) that
gives users the flexibility to define and (soft) enforce their own
virtual limits on their jobs. More specifically this interface allows
reclaiming from a job to keep the usage below some virtual limit which
can correspond to some user defined metric. In my proactive reclaim
example, the user defined metric is refault rates. Keep the usage of
the job at a level where the refault rates are tolerable.


Re: [RFC PROPOSAL] memcg: per-memcg user space reclaim interface

2020-07-07 Thread Shakeel Butt
On Tue, Jul 7, 2020 at 5:14 AM Michal Hocko  wrote:
>
> On Fri 03-07-20 07:23:14, Shakeel Butt wrote:
> > On Thu, Jul 2, 2020 at 11:35 PM Michal Hocko  wrote:
> > >
> > > On Thu 02-07-20 08:22:22, Shakeel Butt wrote:
> > > [...]
> > > > Interface options:
> > > > --
> > > >
> > > > 1) memcg interface e.g. 'echo 10M > memory.reclaim'
> > > >
> > > > + simple
> > > > + can be extended to target specific type of memory (anon, file, kmem).
> > > > - most probably restricted to cgroup v2.
> > > >
> > > > 2) fadvise(PAGEOUT) on cgroup_dir_fd
> > > >
> > > > + more general and applicable to other FSes (actually we are using
> > > > something similar for tmpfs).
> > > > + can be extended in future to just age the LRUs instead of reclaim or
> > > > some new use cases.
> > >
> > > Could you explain why memory.high as an interface to trigger pro-active
> > > memory reclaim is not sufficient. Also memory.low limit to protect
> > > latency sensitve workloads?
> >
> > Yes, we can use memory.high to trigger [proactive] reclaim in a memcg
> > but note that it can also introduce stalls in the application running
> > in that memcg. Let's suppose the memory.current of a memcg is 100MiB
> > and we want to reclaim 20MiB from it, we can set the memory.high to
> > 80MiB but any allocation attempt from the application running in that
> > memcg can get stalled/throttled. I want the functionality of the
> > reclaim without potential stalls.
>
> It would be great if the proposal mention this limitation.
>

Will do in the next version.


> > The memory.min is for protection against the global reclaim and is
> > unrelated to this discussion.
>
> Well, I was talkingg about memory.low. It is not meant only to protect
> from the global reclaim. It can be used for balancing memory reclaim
> from _any_ external memory pressure source. So it is somehow related to
> the usecase you have mentioned.
>

For the uswapd use-case, I am not concerned about the external memory
pressure source but the application hitting its own memory.high limit
and getting throttled.

> What you consider a latency sensitive workload could be protected from
> directly induced reclaim latencies. You could use low events to learn
> about the external memory pressure and update your protection to allow
> for some reclaim. I do understand that this wouldn't solve your problem
> who gets reclaimed and maybe that is the crux on why it is not
> applicable but that should really be mentioned explicitly.
>

The main aim for the proactive reclaim is to not cause an external
memory pressure. The low events can be another source of information
to tell the system level situation to the 'Memory Overcommit
Controller'. So, I see the low events as complementary, not the
replacement for the reclaim interface.

BTW by "low events from external memory pressure" am I correct in
understanding that you meant an unrelated job reclaiming and
triggering low events on a job of interest. Or do you mean to
partition a job into sub-jobs and then use the low events between
these sub-jobs somehow?


Re: [PATCH 1/3] mm: memcg/slab: remove unused argument by charge_slab_page()

2020-07-07 Thread Shakeel Butt
On Tue, Jul 7, 2020 at 10:36 AM Roman Gushchin  wrote:
>
> charge_slab_page() is not using the gfp argument anymore,
> remove it.
>
> Signed-off-by: Roman Gushchin 

Reviewed-by: Shakeel Butt 


Re: [PATCH 2/3] mm: slab: rename (un)charge_slab_page() to (un)account_slab_page()

2020-07-07 Thread Shakeel Butt
On Tue, Jul 7, 2020 at 10:36 AM Roman Gushchin  wrote:
>
> charge_slab_page() and uncharge_slab_page() are not related anymore
> to memcg charging and uncharging. In order to make their names
> less confusing, let's rename them to account_slab_page() and
> unaccount_slab_page() respectively.
>
> Signed-off-by: Roman Gushchin 

Reviewed-by: Shakeel Butt 


Re: [PATCH 3/3] mm: kmem: switch to static_branch_likely() in memcg_kmem_enabled()

2020-07-07 Thread Shakeel Butt
On Tue, Jul 7, 2020 at 10:36 AM Roman Gushchin  wrote:
>
> Currently memcg_kmem_enabled() is optimized for the kernel memory
> accounting being off. It was so for a long time, and arguably the
> reason behind was that the kernel memory accounting was initially an
> opt-in feature. However, now it's on by default on both cgroup v1
> and cgroup v2, and it's on for all cgroups. So let's switch over
> to static_branch_likely() to reflect this fact.
>
> Unlikely there is a significant performance difference, as the cost
> of a memory allocation and its accounting significantly exceeds the
> cost of a jump. However, the conversion makes the code look more
> logically.
>
> Signed-off-by: Roman Gushchin 

Reviewed-by: Shakeel Butt 


Re: [PATCH] mm: memcontrol: account kernel stack per node

2020-06-30 Thread Shakeel Butt
On Mon, Jun 29, 2020 at 8:24 PM Roman Gushchin  wrote:
>
> On Mon, Jun 29, 2020 at 05:44:13PM -0700, Shakeel Butt wrote:
> > Currently the kernel stack is being accounted per-zone. There is no need
> > to do that. In addition due to being per-zone, memcg has to keep a
> > separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate
> > MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
> > node_stat_item.
> >
> > Signed-off-by: Shakeel Butt 
> > ---
> >  drivers/base/node.c|  4 ++--
> >  fs/proc/meminfo.c  |  4 ++--
> >  include/linux/memcontrol.h |  2 --
> >  include/linux/mmzone.h |  8 
> >  kernel/fork.c  | 29 ++---
> >  kernel/scs.c   |  2 +-
> >  mm/memcontrol.c|  2 +-
> >  mm/page_alloc.c| 16 
> >  mm/vmstat.c|  8 
> >  9 files changed, 32 insertions(+), 43 deletions(-)
> >
> > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > index 0cf13e31603c..508b80f6329b 100644
> > --- a/drivers/base/node.c
> > +++ b/drivers/base/node.c
> > @@ -440,9 +440,9 @@ static ssize_t node_read_meminfo(struct device *dev,
> >  nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
> >  nid, K(node_page_state(pgdat, NR_ANON_MAPPED)),
> >  nid, K(i.sharedram),
> > -nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK_KB),
> > +nid, node_page_state(pgdat, NR_KERNEL_STACK_KB),
> >  #ifdef CONFIG_SHADOW_CALL_STACK
> > -nid, sum_zone_node_page_state(nid, NR_KERNEL_SCS_KB),
> > +nid, node_page_state(pgdat, NR_KERNEL_SCS_KB),
> >  #endif
> >  nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)),
> >  nid, 0UL,
> > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > index f262bff3ca31..887a5532e449 100644
> > --- a/fs/proc/meminfo.c
> > +++ b/fs/proc/meminfo.c
> > @@ -101,10 +101,10 @@ static int meminfo_proc_show(struct seq_file *m, void 
> > *v)
> >   show_val_kb(m, "SReclaimable:   ", sreclaimable);
> >   show_val_kb(m, "SUnreclaim: ", sunreclaim);
> >   seq_printf(m, "KernelStack:%8lu kB\n",
> > -global_zone_page_state(NR_KERNEL_STACK_KB));
> > +global_node_page_state(NR_KERNEL_STACK_KB));
> >  #ifdef CONFIG_SHADOW_CALL_STACK
> >   seq_printf(m, "ShadowCallStack:%8lu kB\n",
> > -global_zone_page_state(NR_KERNEL_SCS_KB));
> > +global_node_page_state(NR_KERNEL_SCS_KB));
> >  #endif
> >   show_val_kb(m, "PageTables: ",
> >   global_zone_page_state(NR_PAGETABLE));
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index ba1e42715ecf..a3ddb236898e 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -33,8 +33,6 @@ enum memcg_stat_item {
> >   MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
> >   MEMCG_SOCK,
> >   MEMCG_PERCPU_B,
> > - /* XXX: why are these zone and not node counters? */
> > - MEMCG_KERNEL_STACK_KB,
> >   MEMCG_NR_STAT,
> >  };
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 8e859444927a..b79f73ce8b57 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -153,10 +153,6 @@ enum zone_stat_item {
> >   NR_ZONE_WRITE_PENDING,  /* Count of dirty, writeback and unstable 
> > pages */
> >   NR_MLOCK,   /* mlock()ed pages found and moved off LRU */
> >   NR_PAGETABLE,   /* used for pagetables */
> > - NR_KERNEL_STACK_KB, /* measured in KiB */
> > -#if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
> > - NR_KERNEL_SCS_KB,   /* measured in KiB */
> > -#endif
> >   /* Second 128 byte cacheline */
> >   NR_BOUNCE,
> >  #if IS_ENABLED(CONFIG_ZSMALLOC)
> > @@ -201,6 +197,10 @@ enum node_stat_item {
> >   NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages 
> > */
> >   NR_FOLL_PIN_ACQUIRED,   /* via: pin_user_page(), gup flag: FOLL_PIN */
> >   NR_FOLL_PIN_RELEASED,   /* pages returned via unpin_user_page() */
> > + NR_KERNEL_STACK_KB, /* measured in KiB */
> > +#if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
> > + NR_KERNEL_SCS_KB,   /* measured in KiB */
> > +#endif
> > 

[PATCH v2] mm: memcontrol: account kernel stack per node

2020-06-30 Thread Shakeel Butt
Currently the kernel stack is being accounted per-zone. There is no need
to do that. In addition due to being per-zone, memcg has to keep a
separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate
MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
node_stat_item. In addition localize the kernel stack stats updates to
account_kernel_stack().

Signed-off-by: Shakeel Butt 
---
Changes since v1:
- Use lruvec based stat update functions based on Roman's suggestion.

 drivers/base/node.c|  4 +--
 fs/proc/meminfo.c  |  4 +--
 include/linux/memcontrol.h | 21 ++--
 include/linux/mmzone.h |  8 +++---
 kernel/fork.c  | 51 +++---
 kernel/scs.c   |  2 +-
 mm/memcontrol.c|  2 +-
 mm/page_alloc.c| 16 ++--
 mm/vmstat.c|  8 +++---
 9 files changed, 55 insertions(+), 61 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 0cf13e31603c..508b80f6329b 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -440,9 +440,9 @@ static ssize_t node_read_meminfo(struct device *dev,
   nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
   nid, K(node_page_state(pgdat, NR_ANON_MAPPED)),
   nid, K(i.sharedram),
-  nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK_KB),
+  nid, node_page_state(pgdat, NR_KERNEL_STACK_KB),
 #ifdef CONFIG_SHADOW_CALL_STACK
-  nid, sum_zone_node_page_state(nid, NR_KERNEL_SCS_KB),
+  nid, node_page_state(pgdat, NR_KERNEL_SCS_KB),
 #endif
   nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)),
   nid, 0UL,
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index f262bff3ca31..887a5532e449 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -101,10 +101,10 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
show_val_kb(m, "SReclaimable:   ", sreclaimable);
show_val_kb(m, "SUnreclaim: ", sunreclaim);
seq_printf(m, "KernelStack:%8lu kB\n",
-  global_zone_page_state(NR_KERNEL_STACK_KB));
+  global_node_page_state(NR_KERNEL_STACK_KB));
 #ifdef CONFIG_SHADOW_CALL_STACK
seq_printf(m, "ShadowCallStack:%8lu kB\n",
-  global_zone_page_state(NR_KERNEL_SCS_KB));
+  global_node_page_state(NR_KERNEL_SCS_KB));
 #endif
show_val_kb(m, "PageTables: ",
global_zone_page_state(NR_PAGETABLE));
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ba1e42715ecf..b8f52a3fed90 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -33,8 +33,6 @@ enum memcg_stat_item {
MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
MEMCG_SOCK,
MEMCG_PERCPU_B,
-   /* XXX: why are these zone and not node counters? */
-   MEMCG_KERNEL_STACK_KB,
MEMCG_NR_STAT,
 };
 
@@ -737,8 +735,19 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum 
node_stat_item idx,
 void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
int val);
 void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val);
+
 void mod_memcg_obj_state(void *p, int idx, int val);
 
+static inline void mod_lruvec_slab_state(void *p, enum node_stat_item idx,
+int val)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   __mod_lruvec_slab_state(p, idx, val);
+   local_irq_restore(flags);
+}
+
 static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
  enum node_stat_item idx, int val)
 {
@@ -1159,6 +1168,14 @@ static inline void __mod_lruvec_slab_state(void *p, enum 
node_stat_item idx,
__mod_node_page_state(page_pgdat(page), idx, val);
 }
 
+static inline void mod_lruvec_slab_state(void *p, enum node_stat_item idx,
+int val)
+{
+   struct page *page = virt_to_head_page(p);
+
+   mod_node_page_state(page_pgdat(page), idx, val);
+}
+
 static inline void mod_memcg_obj_state(void *p, int idx, int val)
 {
 }
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8e859444927a..b79f73ce8b57 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -153,10 +153,6 @@ enum zone_stat_item {
NR_ZONE_WRITE_PENDING,  /* Count of dirty, writeback and unstable pages 
*/
NR_MLOCK,   /* mlock()ed pages found and moved off LRU */
NR_PAGETABLE,   /* used for pagetables */
-   NR_KERNEL_STACK_KB, /* measured in KiB */
-#if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
-   NR_KERNEL_SCS_KB,   /* measured in KiB */
-#endif
/* Second 128 byte cacheline */
NR_BOUNCE,
 #if IS_ENABLED(CONFIG_

Re: [RFC][PATCH 0/8] Migrate Pages in lieu of discard

2020-06-30 Thread Shakeel Butt
On Mon, Jun 29, 2020 at 4:48 PM Dave Hansen  wrote:
>
> I've been sitting on these for too long.  Tha main purpose of this
> post is to have a public discussion with the other folks who are
> interested in this functionalty and converge on a single
> implementation.
>
> This set directly incorporates a statictics patch from Yang Shi and
> also includes one to ensure good behavior with cgroup reclaim which
> was very closely derived from this series:
>
> 
> https://lore.kernel.org/linux-mm/1560468577-101178-1-git-send-email-yang@linux.alibaba.com/
>
> Since the last post, the major changes are:
>  - Added patch to skip migration when doing cgroup reclaim
>  - Added stats patch from Yang Shi
>
> The full series is also available here:
>
> https://github.com/hansendc/linux/tree/automigrate-20200629
>
> --
>
> We're starting to see systems with more and more kinds of memory such
> as Intel's implementation of persistent memory.
>
> Let's say you have a system with some DRAM and some persistent memory.
> Today, once DRAM fills up, reclaim will start and some of the DRAM
> contents will be thrown out.  Allocations will, at some point, start
> falling over to the slower persistent memory.
>
> That has two nasty properties.  First, the newer allocations can end
> up in the slower persistent memory.  Second, reclaimed data in DRAM
> are just discarded even if there are gobs of space in persistent
> memory that could be used.
>
> This set implements a solution to these problems.  At the end of the
> reclaim process in shrink_page_list() just before the last page
> refcount is dropped, the page is migrated to persistent memory instead
> of being dropped.
>
> While I've talked about a DRAM/PMEM pairing, this approach would
> function in any environment where memory tiers exist.
>
> This is not perfect.  It "strands" pages in slower memory and never
> brings them back to fast DRAM.  Other things need to be built to
> promote hot pages back to DRAM.
>
> This is part of a larger patch set.  If you want to apply these or
> play with them, I'd suggest using the tree from here.  It includes
> autonuma-based hot page promotion back to DRAM:
>
> 
> http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2...@intel.com
>
> This is also all based on an upstream mechanism that allows
> persistent memory to be onlined and used as if it were volatile:
>
> http://lkml.kernel.org/r/20190124231441.37a4a...@viggo.jf.intel.com
>

I have a high level question. Given a reclaim request for a set of
nodes, if there is no demotion path out of that set, should the kernel
still consider the migrations within the set of nodes? Basically
should the decision to allow migrations within a reclaim request be
taken at the node level or the reclaim request (or allocation level)?


Re: [RFC][PATCH 0/8] Migrate Pages in lieu of discard

2020-06-30 Thread Shakeel Butt
On Tue, Jun 30, 2020 at 11:51 AM Dave Hansen  wrote:
>
> On 6/30/20 11:36 AM, Shakeel Butt wrote:
> >> This is part of a larger patch set.  If you want to apply these or
> >> play with them, I'd suggest using the tree from here.  It includes
> >> autonuma-based hot page promotion back to DRAM:
> >>
> >> 
> >> http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2...@intel.com
> >>
> >> This is also all based on an upstream mechanism that allows
> >> persistent memory to be onlined and used as if it were volatile:
> >>
> >> http://lkml.kernel.org/r/20190124231441.37a4a...@viggo.jf.intel.com
> >>
> > I have a high level question. Given a reclaim request for a set of
> > nodes, if there is no demotion path out of that set, should the kernel
> > still consider the migrations within the set of nodes?
>
> OK, to be specific, we're talking about a case where we've arrived at
> try_to_free_pages()

Yes.

> and, say, all of the nodes on the system are set in
> sc->nodemask?  Isn't the common case that all nodes are set in
> sc->nodemask?

Depends on the workload but for normal users, yes.

> Since there is never a demotion path out of the set of
> all nodes, the common case would be that there is no demotion path out
> of a reclaim node set.
>
> If that's true, I'd say that the kernel still needs to consider
> migrations even within the set.

In my opinion it should be a user defined policy but I think that
discussion is orthogonal to this patch series. As I understand, this
patch series aims to add the migration-within-reclaim infrastructure,
IMO the policies, optimizations, heuristics can come later.

BTW is this proposal only for systems having multi-tiers of memory?
Can a multi-node DRAM-only system take advantage of this proposal? For
example I have a system with two DRAM nodes running two jobs
hardwalled to each node. For each job the other node is kind of
low-tier memory. If I can describe the per-job demotion paths then
these jobs can take advantage of this proposal during occasional
peaks.


Re: [PATCH] mm: memcontrol: avoid workload stalls when lowering memory.high

2020-07-10 Thread Shakeel Butt
On Fri, Jul 10, 2020 at 5:29 AM Michal Hocko  wrote:
>
> On Thu 09-07-20 12:47:18, Roman Gushchin wrote:
> > Memory.high limit is implemented in a way such that the kernel
> > penalizes all threads which are allocating a memory over the limit.
> > Forcing all threads into the synchronous reclaim and adding some
> > artificial delays allows to slow down the memory consumption and
> > potentially give some time for userspace oom handlers/resource control
> > agents to react.
> >
> > It works nicely if the memory usage is hitting the limit from below,
> > however it works sub-optimal if a user adjusts memory.high to a value
> > way below the current memory usage. It basically forces all workload
> > threads (doing any memory allocations) into the synchronous reclaim
> > and sleep. This makes the workload completely unresponsive for
> > a long period of time and can also lead to a system-wide contention on
> > lru locks. It can happen even if the workload is not actually tight on
> > memory and has, for example, a ton of cold pagecache.
> >
> > In the current implementation writing to memory.high causes an atomic
> > update of page counter's high value followed by an attempt to reclaim
> > enough memory to fit into the new limit. To fix the problem described
> > above, all we need is to change the order of execution: try to push
> > the memory usage under the limit first, and only then set the new
> > high limit.
>
> Shakeel would this help with your pro-active reclaim usecase? It would
> require to reset the high limit right after the reclaim returns which is
> quite ugly but it would at least not require a completely new interface.
> You would simply do
> high = current - to_reclaim
> echo $high > memory.high
> echo infinity > memory.high # To prevent direct reclaim
> # allocation stalls
>

This will reduce the chance of stalls but the interface is still
non-delegatable i.e. applications can not change their own memory.high
for the use-cases like application controlled proactive reclaim and
uswapd.

One more ugly fix would be to add one more layer of cgroup and the
application use memory.high of that layer to fulfill such use-cases.

I think providing a new interface would allow us to have a much
cleaner solution than to settle on a bunch of ugly hacks.

> The primary reason to set the high limit in advance was to catch
> potential runaways more effectively because they would just get
> throttled while memory_high_write is reclaiming. With this change
> the reclaim here might be just playing never ending catch up. On the
> plus side a break out from the reclaim loop would just enforce the limit
> so if the operation takes too long then the reclaim burden will move
> over to consumers eventually. So I do not see any real danger.
>
> > Signed-off-by: Roman Gushchin 
> > Reported-by: Domas Mituzas 
> > Cc: Johannes Weiner 
> > Cc: Michal Hocko 
> > Cc: Tejun Heo 
> > Cc: Shakeel Butt 
> > Cc: Chris Down 
>
> Acked-by: Michal Hocko 
>

This patch seems reasonable on its own.

Reviewed-by: Shakeel Butt 


Re: [PATCH] mm: memcontrol: avoid workload stalls when lowering memory.high

2020-07-10 Thread Shakeel Butt
On Fri, Jul 10, 2020 at 11:42 AM Roman Gushchin  wrote:
>
> On Fri, Jul 10, 2020 at 07:12:22AM -0700, Shakeel Butt wrote:
> > On Fri, Jul 10, 2020 at 5:29 AM Michal Hocko  wrote:
> > >
> > > On Thu 09-07-20 12:47:18, Roman Gushchin wrote:
> > > > Memory.high limit is implemented in a way such that the kernel
> > > > penalizes all threads which are allocating a memory over the limit.
> > > > Forcing all threads into the synchronous reclaim and adding some
> > > > artificial delays allows to slow down the memory consumption and
> > > > potentially give some time for userspace oom handlers/resource control
> > > > agents to react.
> > > >
> > > > It works nicely if the memory usage is hitting the limit from below,
> > > > however it works sub-optimal if a user adjusts memory.high to a value
> > > > way below the current memory usage. It basically forces all workload
> > > > threads (doing any memory allocations) into the synchronous reclaim
> > > > and sleep. This makes the workload completely unresponsive for
> > > > a long period of time and can also lead to a system-wide contention on
> > > > lru locks. It can happen even if the workload is not actually tight on
> > > > memory and has, for example, a ton of cold pagecache.
> > > >
> > > > In the current implementation writing to memory.high causes an atomic
> > > > update of page counter's high value followed by an attempt to reclaim
> > > > enough memory to fit into the new limit. To fix the problem described
> > > > above, all we need is to change the order of execution: try to push
> > > > the memory usage under the limit first, and only then set the new
> > > > high limit.
> > >
> > > Shakeel would this help with your pro-active reclaim usecase? It would
> > > require to reset the high limit right after the reclaim returns which is
> > > quite ugly but it would at least not require a completely new interface.
> > > You would simply do
> > > high = current - to_reclaim
> > > echo $high > memory.high
> > > echo infinity > memory.high # To prevent direct reclaim
> > > # allocation stalls
> > >
> >
> > This will reduce the chance of stalls but the interface is still
> > non-delegatable i.e. applications can not change their own memory.high
> > for the use-cases like application controlled proactive reclaim and
> > uswapd.
>
> Can you, please, elaborate a bit more on this? I didn't understand
> why.
>

Sure. Do we want memory.high a CFTYPE_NS_DELEGATABLE type file? I
don't think so otherwise any job on a system can change their
memory.high and can adversely impact the isolation and memory
scheduling of the system.

Next we have to agree that there are valid use-cases to allow
applications to reclaim from their cgroups and I think uswapd and
proactive reclaim are valid use-cases. Let's suppose memory.high is
the only way to trigger reclaim but the application can not write to
their top level memory.high, so, it has to create a dummy cgroup of
which it has write access to memory.high and has to move itself to
that dummy cgroup to use memory.high to trigger reclaim for
uswapd/proactive-reclaim.


[PATCH] mm: vmscan: consistent update to pgrefill

2020-07-10 Thread Shakeel Butt
The vmstat pgrefill is useful together with pgscan and pgsteal stats to
measure the reclaim efficiency. However vmstat's pgrefill is not updated
consistently at system level. It gets updated for both global and memcg
reclaim however pgscan and pgsteal are updated for only global reclaim.
So, update pgrefill only for global reclaim. If someone is interested in
the stats representing both system level as well as memcg level reclaim,
then consult the root memcg's memory.stat instead of /proc/vmstat.

Signed-off-by: Shakeel Butt 
---
 mm/vmscan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5215840ee217..4167b0cc1784 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2030,7 +2030,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 
-   __count_vm_events(PGREFILL, nr_scanned);
+   if (!cgroup_reclaim(sc))
+   __count_vm_events(PGREFILL, nr_scanned);
__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
 
spin_unlock_irq(>lru_lock);
-- 
2.27.0.383.g050319c2ae-goog



Re: [PATCH] mm: vmscan: consistent update to pgrefill

2020-07-10 Thread Shakeel Butt
On Fri, Jul 10, 2020 at 7:32 PM Roman Gushchin  wrote:
>
> On Fri, Jul 10, 2020 at 06:14:59PM -0700, Shakeel Butt wrote:
> > The vmstat pgrefill is useful together with pgscan and pgsteal stats to
> > measure the reclaim efficiency. However vmstat's pgrefill is not updated
> > consistently at system level. It gets updated for both global and memcg
> > reclaim however pgscan and pgsteal are updated for only global reclaim.
> > So, update pgrefill only for global reclaim. If someone is interested in
> > the stats representing both system level as well as memcg level reclaim,
> > then consult the root memcg's memory.stat instead of /proc/vmstat.
> >
> > Signed-off-by: Shakeel Butt 
>
> So you went into the opposite direction from the "previous version"
> ( https://lkml.org/lkml/2020/5/7/1464 ) ?
>

Yes because we already had those stats in the root memcg and exposing
root memcg's memory.stat resolved the issue.

> Acked-by: Roman Gushchin 

Thanks a lot.


Re: [PATCH v6 05/19] mm: memcontrol: decouple reference counting from page accounting

2020-06-17 Thread Shakeel Butt
On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
>
> From: Johannes Weiner 
>
> The reference counting of a memcg is currently coupled directly to how
> many 4k pages are charged to it. This doesn't work well with Roman's
> new slab controller, which maintains pools of objects and doesn't want
> to keep an extra balance sheet for the pages backing those objects.
>
> This unusual refcounting design (reference counts usually track
> pointers to an object) is only for historical reasons: memcg used to
> not take any css references and simply stalled offlining until all
> charges had been reparented and the page counters had dropped to
> zero. When we got rid of the reparenting requirement, the simple
> mechanical translation was to take a reference for every charge.
>
> More historical context can be found in commit e8ea14cc6ead ("mm:
> memcontrol: take a css reference for each charged page"),
> commit 64f219938941 ("mm: memcontrol: remove obsolete kmemcg pinning
> tricks") and commit b2052564e66d ("mm: memcontrol: continue cache
> reclaim from offlined groups").
>
> The new slab controller exposes the limitations in this scheme, so
> let's switch it to a more idiomatic reference counting model based on
> actual kernel pointers to the memcg:
>
> - The per-cpu stock holds a reference to the memcg its caching
>
> - User pages hold a reference for their page->mem_cgroup. Transparent
>   huge pages will no longer acquire tail references in advance, we'll
>   get them if needed during the split.
>
> - Kernel pages hold a reference for their page->mem_cgroup
>
> - Pages allocated in the root cgroup will acquire and release css
>   references for simplicity. css_get() and css_put() optimize that.
>
> - The current memcg_charge_slab() already hacked around the per-charge
>   references; this change gets rid of that as well.
>
> Roman:
> 1) Rebased on top of the current mm tree: added css_get() in
>mem_cgroup_charge(), dropped mem_cgroup_try_charge() part
> 2) I've reformatted commit references in the commit log to make
>checkpatch.pl happy.
>
> Signed-off-by: Johannes Weiner 
> Signed-off-by: Roman Gushchin 
> Acked-by: Roman Gushchin 
> ---
>  mm/memcontrol.c | 37 +
>  mm/slab.h   |  2 --
>  2 files changed, 21 insertions(+), 18 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d18bf93e0f19..80282b2e8b7f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2094,13 +2094,17 @@ static void drain_stock(struct memcg_stock_pcp *stock)
>  {
> struct mem_cgroup *old = stock->cached;
>
> +   if (!old)
> +   return;
> +
> if (stock->nr_pages) {
> page_counter_uncharge(>memory, stock->nr_pages);
> if (do_memsw_account())
> page_counter_uncharge(>memsw, stock->nr_pages);
> -   css_put_many(>css, stock->nr_pages);
> stock->nr_pages = 0;
> }
> +
> +   css_put(>css);
> stock->cached = NULL;
>  }
>
> @@ -2136,6 +2140,7 @@ static void refill_stock(struct mem_cgroup *memcg, 
> unsigned int nr_pages)
> stock = this_cpu_ptr(_stock);
> if (stock->cached != memcg) { /* reset if necessary */
> drain_stock(stock);
> +   css_get(>css);
> stock->cached = memcg;
> }
> stock->nr_pages += nr_pages;
> @@ -2594,12 +2599,10 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
> gfp_mask,
> page_counter_charge(>memory, nr_pages);
> if (do_memsw_account())
> page_counter_charge(>memsw, nr_pages);
> -   css_get_many(>css, nr_pages);
>
> return 0;
>
>  done_restock:
> -   css_get_many(>css, batch);
> if (batch > nr_pages)
> refill_stock(memcg, batch - nr_pages);
>
> @@ -2657,8 +2660,6 @@ static void cancel_charge(struct mem_cgroup *memcg, 
> unsigned int nr_pages)
> page_counter_uncharge(>memory, nr_pages);
> if (do_memsw_account())
> page_counter_uncharge(>memsw, nr_pages);
> -
> -   css_put_many(>css, nr_pages);
>  }
>  #endif
>
> @@ -2964,6 +2965,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t 
> gfp, int order)
> if (!ret) {
> page->mem_cgroup = memcg;
> __SetPageKmemcg(page);
> +   return 0;
> }
> }
> css_put(>css);
> @@ -2986,12 +2988,11 @@ void __memcg_kmem_uncharge_page(struct page *page, 
> int order)
> VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
> __memcg_kmem_uncharge(memcg, nr_pages);
> page->mem_cgroup = NULL;
> +   css_put(>css);
>
> /* slab pages do not have PageKmemcg flag set */
> if (PageKmemcg(page))
> __ClearPageKmemcg(page);
> -
> -   css_put_many(>css, nr_pages);
>  }
>  #endif /* CONFIG_MEMCG_KMEM */
>
> @@ -3003,13 +3004,16 @@ void 

Re: [PATCH v6 05/19] mm: memcontrol: decouple reference counting from page accounting

2020-06-18 Thread Shakeel Butt
Not sure if my email went through, so, re-sending.

On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
>
> From: Johannes Weiner 
>
[...]
> @@ -3003,13 +3004,16 @@ void __memcg_kmem_uncharge_page(struct page *page, 
> int order)
>   */
>  void mem_cgroup_split_huge_fixup(struct page *head)
>  {
> +   struct mem_cgroup *memcg = head->mem_cgroup;
> int i;
>
> if (mem_cgroup_disabled())
> return;
>

A memcg NULL check is needed here.

> -   for (i = 1; i < HPAGE_PMD_NR; i++)
> -   head[i].mem_cgroup = head->mem_cgroup;
> +   for (i = 1; i < HPAGE_PMD_NR; i++) {
> +   css_get(>css);
> +   head[i].mem_cgroup = memcg;
> +   }
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>


Re: [PATCH v6 05/19] mm: memcontrol: decouple reference counting from page accounting

2020-06-18 Thread Shakeel Butt
On Thu, Jun 18, 2020 at 6:08 PM Roman Gushchin  wrote:
>
> On Thu, Jun 18, 2020 at 07:55:35AM -0700, Shakeel Butt wrote:
> > Not sure if my email went through, so, re-sending.
> >
> > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
> > >
> > > From: Johannes Weiner 
> > >
> > [...]
> > > @@ -3003,13 +3004,16 @@ void __memcg_kmem_uncharge_page(struct page 
> > > *page, int order)
> > >   */
> > >  void mem_cgroup_split_huge_fixup(struct page *head)
> > >  {
> > > +   struct mem_cgroup *memcg = head->mem_cgroup;
> > > int i;
> > >
> > > if (mem_cgroup_disabled())
> > > return;
> > >
> >
> > A memcg NULL check is needed here.
>
> Hm, it seems like the only way how it can be NULL is if mem_cgroup_disabled() 
> is true:
>
> int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> {
> unsigned int nr_pages = hpage_nr_pages(page);
> struct mem_cgroup *memcg = NULL;
> int ret = 0;
>
> if (mem_cgroup_disabled())
> goto out;
>
> <...>
>
> if (!memcg)
> memcg = get_mem_cgroup_from_mm(mm);
>
> ret = try_charge(memcg, gfp_mask, nr_pages);
> if (ret)
> goto out_put;
>
> css_get(>css);
> commit_charge(page, memcg);
>
>
> Did you hit this issue in reality? The only possible scenario I can imagine
> is if the page was allocated before enabling memory cgroups.
>
> Are you about this case?
>
> Otherwise we put root_mem_cgroup there.
>

Oh yes, you are right. I am confusing this with kmem pages for root
memcg where we don't set the page->mem_cgroup and this patch series
should be changing that.

Shakeel


Re: [PATCH v6 05/19] mm: memcontrol: decouple reference counting from page accounting

2020-06-18 Thread Shakeel Butt
On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
>
> From: Johannes Weiner 
>
> The reference counting of a memcg is currently coupled directly to how
> many 4k pages are charged to it. This doesn't work well with Roman's
> new slab controller, which maintains pools of objects and doesn't want
> to keep an extra balance sheet for the pages backing those objects.
>
> This unusual refcounting design (reference counts usually track
> pointers to an object) is only for historical reasons: memcg used to
> not take any css references and simply stalled offlining until all
> charges had been reparented and the page counters had dropped to
> zero. When we got rid of the reparenting requirement, the simple
> mechanical translation was to take a reference for every charge.
>
> More historical context can be found in commit e8ea14cc6ead ("mm:
> memcontrol: take a css reference for each charged page"),
> commit 64f219938941 ("mm: memcontrol: remove obsolete kmemcg pinning
> tricks") and commit b2052564e66d ("mm: memcontrol: continue cache
> reclaim from offlined groups").
>
> The new slab controller exposes the limitations in this scheme, so
> let's switch it to a more idiomatic reference counting model based on
> actual kernel pointers to the memcg:
>
> - The per-cpu stock holds a reference to the memcg its caching
>
> - User pages hold a reference for their page->mem_cgroup. Transparent
>   huge pages will no longer acquire tail references in advance, we'll
>   get them if needed during the split.
>
> - Kernel pages hold a reference for their page->mem_cgroup
>
> - Pages allocated in the root cgroup will acquire and release css
>   references for simplicity. css_get() and css_put() optimize that.
>
> - The current memcg_charge_slab() already hacked around the per-charge
>   references; this change gets rid of that as well.
>
> Roman:
> 1) Rebased on top of the current mm tree: added css_get() in
>mem_cgroup_charge(), dropped mem_cgroup_try_charge() part
> 2) I've reformatted commit references in the commit log to make
>checkpatch.pl happy.
>
> Signed-off-by: Johannes Weiner 
> Signed-off-by: Roman Gushchin 
> Acked-by: Roman Gushchin 

Reviewed-by: Shakeel Butt 


Re: [PATCH v6 12/19] mm: memcg/slab: use a single set of kmem_caches for all accounted allocations

2020-06-22 Thread Shakeel Butt
On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
>
> This is fairly big but mostly red patch, which makes all accounted
> slab allocations use a single set of kmem_caches instead of
> creating a separate set for each memory cgroup.
>
> Because the number of non-root kmem_caches is now capped by the number
> of root kmem_caches, there is no need to shrink or destroy them
> prematurely. They can be perfectly destroyed together with their
> root counterparts. This allows to dramatically simplify the
> management of non-root kmem_caches and delete a ton of code.
>
> This patch performs the following changes:
> 1) introduces memcg_params.memcg_cache pointer to represent the
>kmem_cache which will be used for all non-root allocations
> 2) reuses the existing memcg kmem_cache creation mechanism
>to create memcg kmem_cache on the first allocation attempt
> 3) memcg kmem_caches are named -memcg,
>e.g. dentry-memcg
> 4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache
>or schedule it's creation and return the root cache
> 5) removes almost all non-root kmem_cache management code
>(separate refcounter, reparenting, shrinking, etc)
> 6) makes slab debugfs to display root_mem_cgroup css id and never
>show :dead and :deact flags in the memcg_slabinfo attribute.
>
> Following patches in the series will simplify the kmem_cache creation.
>
> Signed-off-by: Roman Gushchin 
> Reviewed-by: Vlastimil Babka 

This is a very satisfying patch.

Reviewed-by: Shakeel Butt 


Re: [PATCH v16 00/14] Introduce Data Access MONitor (DAMON)

2020-06-22 Thread Shakeel Butt
Hi SeongJae,

On Mon, Jun 22, 2020 at 1:42 AM SeongJae Park  wrote:
>
> Last week, this patchset received 5 'Reviewed-by' tags, but no further 
> comments
> for changes.  I updated the documentation, but the change is only small.  For
> the reason, I'm only asking more reviews rather than posting the whole series
> again.  Any comment is welcome.
>
>
> Thanks,
> SeongJae Park


This is on my review TODO list and I will get to it in a couple of
days. No need to post new versions.

thanks,
Shakeel


Re: [PATCH v6 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo

2020-06-22 Thread Shakeel Butt
On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
>
> Deprecate memory.kmem.slabinfo.
>
> An empty file will be presented if corresponding config options are
> enabled.
>
> The interface is implementation dependent, isn't present in cgroup v2,
> and is generally useful only for core mm debugging purposes. In other
> words, it doesn't provide any value for the absolute majority of users.
>
> A drgn-based replacement can be found in tools/cgroup/slabinfo.py .
> It does support cgroup v1 and v2, mimics memory.kmem.slabinfo output
> and also allows to get any additional information without a need
> to recompile the kernel.
>
> If a drgn-based solution is too slow for a task, a bpf-based tracing
> tool can be used, which can easily keep track of all slab allocations
> belonging to a memory cgroup.
>
> Signed-off-by: Roman Gushchin 
> Acked-by: Johannes Weiner 
> Reviewed-by: Vlastimil Babka 

Hi Roman,

I am not against removing the memory.kmem.slabinfo interface but I
would like to have an alternative solution more accessible than
tools/cgroup/slabinfo.py.

In our case, we don't have ssh access and if we need something for
debugging, it is much more preferable to provide a file to read to
SREs. After the review, that file will be added to a whitelist and
then we can directly read that file through automated tools without
approval for each request.

I am just wondering if a file interface can be provided for whatever
tools/cgroup/slabinfo.py is providing.

Shakeel


Re: [PATCH v6 13/19] mm: memcg/slab: simplify memcg cache creation

2020-06-22 Thread Shakeel Butt
On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
>
> Because the number of non-root kmem_caches doesn't depend on the
> number of memory cgroups anymore and is generally not very big,
> there is no more need for a dedicated workqueue.
>
> Also, as there is no more need to pass any arguments to the
> memcg_create_kmem_cache() except the root kmem_cache, it's
> possible to just embed the work structure into the kmem_cache
> and avoid the dynamic allocation of the work structure.
>
> This will also simplify the synchronization: for each root kmem_cache
> there is only one work. So there will be no more concurrent attempts
> to create a non-root kmem_cache for a root kmem_cache: the second and
> all following attempts to queue the work will fail.
>
>
> On the kmem_cache destruction path there is no more need to call the
> expensive flush_workqueue() and wait for all pending works to be
> finished. Instead, cancel_work_sync() can be used to cancel/wait for
> only one work.
>
> Signed-off-by: Roman Gushchin 
> Reviewed-by: Vlastimil Babka 

Why not pre-allocate the non-root kmem_cache at the kmem_cache
creation time? No need for work_struct, queue_work() or
cancel_work_sync() at all.


Re: [PATCH v6 16/19] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()

2020-06-22 Thread Shakeel Butt
On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
>
> memcg_accumulate_slabinfo() is never called with a non-root
> kmem_cache as a first argument, so the is_root_cache(s) check
> is redundant and can be removed without any functional change.
>
> Signed-off-by: Roman Gushchin 
> Reviewed-by: Vlastimil Babka 

Reviewed-by: Shakeel Butt 


Re: [PATCH v6 15/19] mm: memcg/slab: deprecate slab_root_caches

2020-06-22 Thread Shakeel Butt
On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
>
> Currently there are two lists of kmem_caches:
> 1) slab_caches, which contains all kmem_caches,
> 2) slab_root_caches, which contains only root kmem_caches.
>
> And there is some preprocessor magic to have a single list
> if CONFIG_MEMCG_KMEM isn't enabled.
>
> It was required earlier because the number of non-root kmem_caches
> was proportional to the number of memory cgroups and could reach
> really big values. Now, when it cannot exceed the number of root
> kmem_caches, there is really no reason to maintain two lists.
>
> We never iterate over the slab_root_caches list on any hot paths,
> so it's perfectly fine to iterate over slab_caches and filter out
> non-root kmem_caches.
>
> It allows to remove a lot of config-dependent code and two pointers
> from the kmem_cache structure.
>
> Signed-off-by: Roman Gushchin 
> Reviewed-by: Vlastimil Babka 

Reviewed-by: Shakeel Butt 


Re: [PATCH v6 13/19] mm: memcg/slab: simplify memcg cache creation

2020-06-22 Thread Shakeel Butt
On Mon, Jun 22, 2020 at 10:40 AM Roman Gushchin  wrote:
>
> On Mon, Jun 22, 2020 at 10:29:29AM -0700, Shakeel Butt wrote:
> > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
> > >
> > > Because the number of non-root kmem_caches doesn't depend on the
> > > number of memory cgroups anymore and is generally not very big,
> > > there is no more need for a dedicated workqueue.
> > >
> > > Also, as there is no more need to pass any arguments to the
> > > memcg_create_kmem_cache() except the root kmem_cache, it's
> > > possible to just embed the work structure into the kmem_cache
> > > and avoid the dynamic allocation of the work structure.
> > >
> > > This will also simplify the synchronization: for each root kmem_cache
> > > there is only one work. So there will be no more concurrent attempts
> > > to create a non-root kmem_cache for a root kmem_cache: the second and
> > > all following attempts to queue the work will fail.
> > >
> > >
> > > On the kmem_cache destruction path there is no more need to call the
> > > expensive flush_workqueue() and wait for all pending works to be
> > > finished. Instead, cancel_work_sync() can be used to cancel/wait for
> > > only one work.
> > >
> > > Signed-off-by: Roman Gushchin 
> > > Reviewed-by: Vlastimil Babka 
> >
> > Why not pre-allocate the non-root kmem_cache at the kmem_cache
> > creation time? No need for work_struct, queue_work() or
> > cancel_work_sync() at all.
>
> Simple because some kmem_caches are created very early, so we don't
> even know at that time if we will need memcg slab caches. But this
> code is likely going away if we're going with a single set for all
> allocations.
>

LGTM.

Reviewed-by: Shakeel Butt 


Re: [PATCH v6 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo

2020-06-22 Thread Shakeel Butt
On Mon, Jun 22, 2020 at 11:02 AM Roman Gushchin  wrote:
>
> On Mon, Jun 22, 2020 at 10:12:46AM -0700, Shakeel Butt wrote:
> > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
> > >
> > > Deprecate memory.kmem.slabinfo.
> > >
> > > An empty file will be presented if corresponding config options are
> > > enabled.
> > >
> > > The interface is implementation dependent, isn't present in cgroup v2,
> > > and is generally useful only for core mm debugging purposes. In other
> > > words, it doesn't provide any value for the absolute majority of users.
> > >
> > > A drgn-based replacement can be found in tools/cgroup/slabinfo.py .
> > > It does support cgroup v1 and v2, mimics memory.kmem.slabinfo output
> > > and also allows to get any additional information without a need
> > > to recompile the kernel.
> > >
> > > If a drgn-based solution is too slow for a task, a bpf-based tracing
> > > tool can be used, which can easily keep track of all slab allocations
> > > belonging to a memory cgroup.
> > >
> > > Signed-off-by: Roman Gushchin 
> > > Acked-by: Johannes Weiner 
> > > Reviewed-by: Vlastimil Babka 
> >
> > Hi Roman,
> >
> > I am not against removing the memory.kmem.slabinfo interface but I
> > would like to have an alternative solution more accessible than
> > tools/cgroup/slabinfo.py.
> >
> > In our case, we don't have ssh access and if we need something for
> > debugging, it is much more preferable to provide a file to read to
> > SREs. After the review, that file will be added to a whitelist and
> > then we can directly read that file through automated tools without
> > approval for each request.
> >
> > I am just wondering if a file interface can be provided for whatever
> > tools/cgroup/slabinfo.py is providing.
> >
> > Shakeel
>
> Hello, Shakeel!
>
> I understand your point, but Idk how much we wanna make this code a part
> of the kernel and the cgroup interface.

No need for the cgroup interface. I was thinking of a new interface
like /proc/slabinfo_full which tells active objects for each
kmem_cache and memcg pair.

> The problem is that reading
> from it will be really slow in comparison to all other cgroup interface
> files. Idk if Google's version of SLAB has a list of all slab pages,
> but if not (as in generic SLUB case), it requires scanning of the whole RAM.

That's a bummer. Does drgn-based script scan the whole RAM?


Re: [PATCH v6 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo

2020-06-22 Thread Shakeel Butt
On Mon, Jun 22, 2020 at 11:25 AM Roman Gushchin  wrote:
>
> On Mon, Jun 22, 2020 at 11:09:47AM -0700, Shakeel Butt wrote:
> > On Mon, Jun 22, 2020 at 11:02 AM Roman Gushchin  wrote:
> > >
> > > On Mon, Jun 22, 2020 at 10:12:46AM -0700, Shakeel Butt wrote:
> > > > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
> > > > >
> > > > > Deprecate memory.kmem.slabinfo.
> > > > >
> > > > > An empty file will be presented if corresponding config options are
> > > > > enabled.
> > > > >
> > > > > The interface is implementation dependent, isn't present in cgroup v2,
> > > > > and is generally useful only for core mm debugging purposes. In other
> > > > > words, it doesn't provide any value for the absolute majority of 
> > > > > users.
> > > > >
> > > > > A drgn-based replacement can be found in tools/cgroup/slabinfo.py .
> > > > > It does support cgroup v1 and v2, mimics memory.kmem.slabinfo output
> > > > > and also allows to get any additional information without a need
> > > > > to recompile the kernel.
> > > > >
> > > > > If a drgn-based solution is too slow for a task, a bpf-based tracing
> > > > > tool can be used, which can easily keep track of all slab allocations
> > > > > belonging to a memory cgroup.
> > > > >
> > > > > Signed-off-by: Roman Gushchin 
> > > > > Acked-by: Johannes Weiner 
> > > > > Reviewed-by: Vlastimil Babka 
> > > >
> > > > Hi Roman,
> > > >
> > > > I am not against removing the memory.kmem.slabinfo interface but I
> > > > would like to have an alternative solution more accessible than
> > > > tools/cgroup/slabinfo.py.
> > > >
> > > > In our case, we don't have ssh access and if we need something for
> > > > debugging, it is much more preferable to provide a file to read to
> > > > SREs. After the review, that file will be added to a whitelist and
> > > > then we can directly read that file through automated tools without
> > > > approval for each request.
> > > >
> > > > I am just wondering if a file interface can be provided for whatever
> > > > tools/cgroup/slabinfo.py is providing.
> > > >
> > > > Shakeel
> > >
> > > Hello, Shakeel!
> > >
> > > I understand your point, but Idk how much we wanna make this code a part
> > > of the kernel and the cgroup interface.
> >
> > No need for the cgroup interface. I was thinking of a new interface
> > like /proc/slabinfo_full which tells active objects for each
> > kmem_cache and memcg pair.
>
> To me it's a perfect example where tools like drgn and bpf shine.
> They are more flexible and do not blow the kernel up with
> the debug-only code.
>
> >
> > > The problem is that reading
> > > from it will be really slow in comparison to all other cgroup interface
> > > files. Idk if Google's version of SLAB has a list of all slab pages,
> > > but if not (as in generic SLUB case), it requires scanning of the whole 
> > > RAM.
> >
> > That's a bummer. Does drgn-based script scan the whole RAM?
>
> To be precise, not over all RAM, but over all struct pages.
> Unfortunately, there is no better option with SLUB, as there is no
> comprehensive list of slab pages available. So the only option is to scan
> over all pages with PageSlab flag set.
>

So, SLUB does not have any field available in the struct page to
support the list of slab pages?

Anyways, that's a separate discussion.

Reviewed-by: Shakeel Butt 


Re: [PATCH v6 14/19] mm: memcg/slab: remove memcg_kmem_get_cache()

2020-06-22 Thread Shakeel Butt
On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
>
> The memcg_kmem_get_cache() function became really trivial,
> so let's just inline it into the single call point:
> memcg_slab_pre_alloc_hook().
>
> It will make the code less bulky and can also help the compiler
> to generate a better code.
>
> Signed-off-by: Roman Gushchin 
> Reviewed-by: Vlastimil Babka 

Reviewed-by: Shakeel Butt 


Re: [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations

2020-06-22 Thread Shakeel Butt
On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
>
> Instead of having two sets of kmem_caches: one for system-wide and
> non-accounted allocations and the second one shared by all accounted
> allocations, we can use just one.
>
> The idea is simple: space for obj_cgroup metadata can be allocated
> on demand and filled only for accounted allocations.
>
> It allows to remove a bunch of code which is required to handle
> kmem_cache clones for accounted allocations. There is no more need
> to create them, accumulate statistics, propagate attributes, etc.
> It's a quite significant simplification.
>
> Also, because the total number of slab_caches is reduced almost twice
> (not all kmem_caches have a memcg clone), some additional memory
> savings are expected. On my devvm it additionally saves about 3.5%
> of slab memory.
>
> Suggested-by: Johannes Weiner 
> Signed-off-by: Roman Gushchin 
> Reviewed-by: Vlastimil Babka 
> ---
[snip]
>  static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
>   struct obj_cgroup *objcg,
> - size_t size, void **p)
> + gfp_t flags, size_t size,
> + void **p)
>  {
> struct page *page;
> unsigned long off;
> size_t i;
>
> +   if (!objcg)
> +   return;
> +
> +   flags &= ~__GFP_ACCOUNT;
> for (i = 0; i < size; i++) {
> if (likely(p[i])) {
> page = virt_to_head_page(p[i]);
> +
> +   if (!page_has_obj_cgroups(page) &&

The page is already linked into the kmem_cache, don't you need
synchronization for memcg_alloc_page_obj_cgroups(). What's the reason
to remove this from charge_slab_page()?

> +   memcg_alloc_page_obj_cgroups(page, s, flags)) {
> +   obj_cgroup_uncharge(objcg, obj_full_size(s));
> +   continue;
> +   }
> +
> off = obj_to_index(s, page, p[i]);
> obj_cgroup_get(objcg);
> page_obj_cgroups(page)[off] = objcg;


Re: [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations

2020-06-22 Thread Shakeel Butt
On Mon, Jun 22, 2020 at 1:37 PM Roman Gushchin  wrote:
>
> On Mon, Jun 22, 2020 at 12:21:28PM -0700, Shakeel Butt wrote:
> > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
> > >
> > > Instead of having two sets of kmem_caches: one for system-wide and
> > > non-accounted allocations and the second one shared by all accounted
> > > allocations, we can use just one.
> > >
> > > The idea is simple: space for obj_cgroup metadata can be allocated
> > > on demand and filled only for accounted allocations.
> > >
> > > It allows to remove a bunch of code which is required to handle
> > > kmem_cache clones for accounted allocations. There is no more need
> > > to create them, accumulate statistics, propagate attributes, etc.
> > > It's a quite significant simplification.
> > >
> > > Also, because the total number of slab_caches is reduced almost twice
> > > (not all kmem_caches have a memcg clone), some additional memory
> > > savings are expected. On my devvm it additionally saves about 3.5%
> > > of slab memory.
> > >
> > > Suggested-by: Johannes Weiner 
> > > Signed-off-by: Roman Gushchin 
> > > Reviewed-by: Vlastimil Babka 
> > > ---
> > [snip]
> > >  static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> > >   struct obj_cgroup *objcg,
> > > - size_t size, void **p)
> > > + gfp_t flags, size_t size,
> > > + void **p)
> > >  {
> > > struct page *page;
> > > unsigned long off;
> > > size_t i;
> > >
> > > +   if (!objcg)
> > > +   return;
> > > +
> > > +   flags &= ~__GFP_ACCOUNT;
> > > for (i = 0; i < size; i++) {
> > > if (likely(p[i])) {
> > > page = virt_to_head_page(p[i]);
> > > +
> > > +   if (!page_has_obj_cgroups(page) &&
> >
> > The page is already linked into the kmem_cache, don't you need
> > synchronization for memcg_alloc_page_obj_cgroups().
>
> Hm, yes, in theory we need it. I guess the reason behind why I've never seen 
> any issues
> here is the SLUB percpu partial list.
>
> So in theory we need something like:
>
> diff --git a/mm/slab.h b/mm/slab.h
> index 0a31600a0f5c..44bf57815816 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -237,7 +237,10 @@ static inline int memcg_alloc_page_obj_cgroups(struct 
> page *page,
> if (!vec)
> return -ENOMEM;
>
> -   page->obj_cgroups = (struct obj_cgroup **) ((unsigned long)vec | 
> 0x1UL);
> +   if (cmpxchg(>obj_cgroups, 0,
> +   (struct obj_cgroup **) ((unsigned long)vec | 0x1UL)))
> +   kfree(vec);
> +
> return 0;
>  }
>
>
> But I wonder if we might put it under #ifdef CONFIG_SLAB?
> Or any other ideas how to make it less expensive?
>
> > What's the reason to remove this from charge_slab_page()?
>
> Because at charge_slab_page() we don't know if we'll ever need
> page->obj_cgroups. Some caches might have only few or even zero
> accounted objects.
>

If slab_pre_alloc_hook() returns a non-NULL objcg then we definitely
need page->obj_cgroups.  The charge_slab_page() happens between
slab_pre_alloc_hook() & slab_post_alloc_hook(), so, we should be able
to tell if page->obj_cgroups is needed.


Re: [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations

2020-06-22 Thread Shakeel Butt
On Mon, Jun 22, 2020 at 2:15 PM Roman Gushchin  wrote:
>
> On Mon, Jun 22, 2020 at 02:04:29PM -0700, Shakeel Butt wrote:
> > On Mon, Jun 22, 2020 at 1:37 PM Roman Gushchin  wrote:
> > >
> > > On Mon, Jun 22, 2020 at 12:21:28PM -0700, Shakeel Butt wrote:
> > > > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
> > > > >
> > > > > Instead of having two sets of kmem_caches: one for system-wide and
> > > > > non-accounted allocations and the second one shared by all accounted
> > > > > allocations, we can use just one.
> > > > >
> > > > > The idea is simple: space for obj_cgroup metadata can be allocated
> > > > > on demand and filled only for accounted allocations.
> > > > >
> > > > > It allows to remove a bunch of code which is required to handle
> > > > > kmem_cache clones for accounted allocations. There is no more need
> > > > > to create them, accumulate statistics, propagate attributes, etc.
> > > > > It's a quite significant simplification.
> > > > >
> > > > > Also, because the total number of slab_caches is reduced almost twice
> > > > > (not all kmem_caches have a memcg clone), some additional memory
> > > > > savings are expected. On my devvm it additionally saves about 3.5%
> > > > > of slab memory.
> > > > >
> > > > > Suggested-by: Johannes Weiner 
> > > > > Signed-off-by: Roman Gushchin 
> > > > > Reviewed-by: Vlastimil Babka 
> > > > > ---
> > > > [snip]
> > > > >  static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> > > > >   struct obj_cgroup 
> > > > > *objcg,
> > > > > - size_t size, void **p)
> > > > > + gfp_t flags, size_t 
> > > > > size,
> > > > > + void **p)
> > > > >  {
> > > > > struct page *page;
> > > > > unsigned long off;
> > > > > size_t i;
> > > > >
> > > > > +   if (!objcg)
> > > > > +   return;
> > > > > +
> > > > > +   flags &= ~__GFP_ACCOUNT;
> > > > > for (i = 0; i < size; i++) {
> > > > > if (likely(p[i])) {
> > > > > page = virt_to_head_page(p[i]);
> > > > > +
> > > > > +   if (!page_has_obj_cgroups(page) &&
> > > >
> > > > The page is already linked into the kmem_cache, don't you need
> > > > synchronization for memcg_alloc_page_obj_cgroups().
> > >
> > > Hm, yes, in theory we need it. I guess the reason behind why I've never 
> > > seen any issues
> > > here is the SLUB percpu partial list.
> > >
> > > So in theory we need something like:
> > >
> > > diff --git a/mm/slab.h b/mm/slab.h
> > > index 0a31600a0f5c..44bf57815816 100644
> > > --- a/mm/slab.h
> > > +++ b/mm/slab.h
> > > @@ -237,7 +237,10 @@ static inline int 
> > > memcg_alloc_page_obj_cgroups(struct page *page,
> > > if (!vec)
> > > return -ENOMEM;
> > >
> > > -   page->obj_cgroups = (struct obj_cgroup **) ((unsigned long)vec | 
> > > 0x1UL);
> > > +   if (cmpxchg(>obj_cgroups, 0,
> > > +   (struct obj_cgroup **) ((unsigned long)vec | 0x1UL)))
> > > +   kfree(vec);
> > > +
> > > return 0;
> > >  }
> > >
> > >
> > > But I wonder if we might put it under #ifdef CONFIG_SLAB?
> > > Or any other ideas how to make it less expensive?
> > >
> > > > What's the reason to remove this from charge_slab_page()?
> > >
> > > Because at charge_slab_page() we don't know if we'll ever need
> > > page->obj_cgroups. Some caches might have only few or even zero
> > > accounted objects.
> > >
> >
> > If slab_pre_alloc_hook() returns a non-NULL objcg then we definitely
> > need page->obj_cgroups.  The charge_slab_page() happens between
> > slab_pre_alloc_hook() & slab_post_alloc_hook(), so, we should be able
> > to tell if page->obj_cgroups is needed.
>
> Yes, but the opposite is not always true: we can reuse the existing page
> without allocated page->obj_cgroups. In this case charge_slab_page() is
> not involved at all.
>

Hmm yeah, you are right. I missed that.

>
> Or do you mean that we can minimize the amount of required synchronization
> by allocating some obj_cgroups vectors from charge_slab_page()?

One optimization would be to always pre-allocate page->obj_cgroups for
kmem_caches with SLAB_ACCOUNT.


Re: [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations

2020-06-22 Thread Shakeel Butt
On Mon, Jun 22, 2020 at 2:58 PM Roman Gushchin  wrote:
>
> On Mon, Jun 22, 2020 at 02:28:54PM -0700, Shakeel Butt wrote:
> > On Mon, Jun 22, 2020 at 2:15 PM Roman Gushchin  wrote:
> > >
> > > On Mon, Jun 22, 2020 at 02:04:29PM -0700, Shakeel Butt wrote:
> > > > On Mon, Jun 22, 2020 at 1:37 PM Roman Gushchin  wrote:
> > > > >
> > > > > On Mon, Jun 22, 2020 at 12:21:28PM -0700, Shakeel Butt wrote:
> > > > > > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
> > > > > > >
> > > > > > > Instead of having two sets of kmem_caches: one for system-wide and
> > > > > > > non-accounted allocations and the second one shared by all 
> > > > > > > accounted
> > > > > > > allocations, we can use just one.
> > > > > > >
> > > > > > > The idea is simple: space for obj_cgroup metadata can be allocated
> > > > > > > on demand and filled only for accounted allocations.
> > > > > > >
> > > > > > > It allows to remove a bunch of code which is required to handle
> > > > > > > kmem_cache clones for accounted allocations. There is no more need
> > > > > > > to create them, accumulate statistics, propagate attributes, etc.
> > > > > > > It's a quite significant simplification.
> > > > > > >
> > > > > > > Also, because the total number of slab_caches is reduced almost 
> > > > > > > twice
> > > > > > > (not all kmem_caches have a memcg clone), some additional memory
> > > > > > > savings are expected. On my devvm it additionally saves about 3.5%
> > > > > > > of slab memory.
> > > > > > >
> > > > > > > Suggested-by: Johannes Weiner 
> > > > > > > Signed-off-by: Roman Gushchin 
> > > > > > > Reviewed-by: Vlastimil Babka 
> > > > > > > ---
> > > > > > [snip]
> > > > > > >  static inline void memcg_slab_post_alloc_hook(struct kmem_cache 
> > > > > > > *s,
> > > > > > >   struct obj_cgroup 
> > > > > > > *objcg,
> > > > > > > - size_t size, void 
> > > > > > > **p)
> > > > > > > + gfp_t flags, size_t 
> > > > > > > size,
> > > > > > > + void **p)
> > > > > > >  {
> > > > > > > struct page *page;
> > > > > > > unsigned long off;
> > > > > > > size_t i;
> > > > > > >
> > > > > > > +   if (!objcg)
> > > > > > > +   return;
> > > > > > > +
> > > > > > > +   flags &= ~__GFP_ACCOUNT;
> > > > > > > for (i = 0; i < size; i++) {
> > > > > > > if (likely(p[i])) {
> > > > > > > page = virt_to_head_page(p[i]);
> > > > > > > +
> > > > > > > +   if (!page_has_obj_cgroups(page) &&
> > > > > >
> > > > > > The page is already linked into the kmem_cache, don't you need
> > > > > > synchronization for memcg_alloc_page_obj_cgroups().
> > > > >
> > > > > Hm, yes, in theory we need it. I guess the reason behind why I've 
> > > > > never seen any issues
> > > > > here is the SLUB percpu partial list.
> > > > >
> > > > > So in theory we need something like:
> > > > >
> > > > > diff --git a/mm/slab.h b/mm/slab.h
> > > > > index 0a31600a0f5c..44bf57815816 100644
> > > > > --- a/mm/slab.h
> > > > > +++ b/mm/slab.h
> > > > > @@ -237,7 +237,10 @@ static inline int 
> > > > > memcg_alloc_page_obj_cgroups(struct page *page,
> > > > > if (!vec)
> > > > > return -ENOMEM;
> > > > >
> > > > > -   page->obj_cgroups = (struct obj_cgroup **) ((unsigned 
> > > > > long)vec | 0x1UL);
> > > > > +   if (cmpxchg(>obj_cgroups

Re: [PATCH tip/core/rcu 02/26] mm/mmap.c: Add cond_resched() for exit_mmap() CPU stalls

2020-06-22 Thread Shakeel Butt
On Mon, Jun 22, 2020 at 5:22 PM  wrote:
>
> From: "Paul E. McKenney" 
>
> A large process running on a heavily loaded system can encounter the
> following RCU CPU stall warning:
>
>   rcu: INFO: rcu_sched self-detected stall on CPU
>   rcu: \x093-: (20998 ticks this GP) idle=4ea/1/0x4002 
> softirq=556558/556558 fqs=5190
>   \x09(t=21013 jiffies g=1005461 q=132576)
>   NMI backtrace for cpu 3
>   CPU: 3 PID: 501900 Comm: aio-free-ring-w Kdump: loaded Not tainted 
> 5.2.9-108_fbk12_rc3_3858_gb83b75af7909 #1
>   Hardware name: Wiwynn   HoneyBadger/PantherPlus, BIOS HBM6.71 02/03/2016
>   Call Trace:
>
>dump_stack+0x46/0x60
>nmi_cpu_backtrace.cold.3+0x13/0x50
>? lapic_can_unplug_cpu.cold.27+0x34/0x34
>nmi_trigger_cpumask_backtrace+0xba/0xca
>rcu_dump_cpu_stacks+0x99/0xc7
>rcu_sched_clock_irq.cold.87+0x1aa/0x397
>? tick_sched_do_timer+0x60/0x60
>update_process_times+0x28/0x60
>tick_sched_timer+0x37/0x70
>__hrtimer_run_queues+0xfe/0x270
>hrtimer_interrupt+0xf4/0x210
>smp_apic_timer_interrupt+0x5e/0x120
>apic_timer_interrupt+0xf/0x20
>
>   RIP: 0010:kmem_cache_free+0x223/0x300
>   Code: 88 00 00 00 0f 85 ca 00 00 00 41 8b 55 18 31 f6 f7 da 41 f6 45 0a 02 
> 40 0f 94 c6 83 c6 05 9c 41 5e fa e8 a0 a7 01 00 41 56 9d <49> 8b 47 08 a8 03 
> 0f 85 87 00 00 00 65 48 ff 08 e9 3d fe ff ff 65
>   RSP: 0018:c9000e8e3da8 EFLAGS: 0206 ORIG_RAX: ff13
>   RAX: 0002 RBX: 88861b9de960 RCX: 0030
>   RDX: fffe41e8 RSI: 60777fe3a100 RDI: 0001be18
>   RBP: ea00186e7780 R08:  R09: 
>   R10: 88861b9dea28 R11: 7ffde000 R12: 81230a1f
>   R13: 54684dc0 R14: 0206 R15: 547dbc00
>? remove_vma+0x4f/0x60
>remove_vma+0x4f/0x60
>exit_mmap+0xd6/0x160
>mmput+0x4a/0x110
>do_exit+0x278/0xae0
>? syscall_trace_enter+0x1d3/0x2b0
>? handle_mm_fault+0xaa/0x1c0
>do_group_exit+0x3a/0xa0
>__x64_sys_exit_group+0x14/0x20
>do_syscall_64+0x42/0x100
>entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> And on a PREEMPT=n kernel, the "while (vma)" loop in exit_mmap() can run
> for a very long time given a large process.  This commit therefore adds
> a cond_resched() to this loop, providing RCU any needed quiescent states.
>
> Cc: Andrew Morton 
> Cc: 
> Signed-off-by: Paul E. McKenney 

We have exactly the same change in our internal kernel since 2018. We
mostly observed the need_resched warnings on the processes mapping the
hugetlbfs.

Reviewed-by: Shakeel Butt 

> ---
>  mm/mmap.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 59a4682..972f839 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3159,6 +3159,7 @@ void exit_mmap(struct mm_struct *mm)
> if (vma->vm_flags & VM_ACCOUNT)
> nr_accounted += vma_pages(vma);
> vma = remove_vma(vma);
> +   cond_resched();
> }
> vm_unacct_memory(nr_accounted);
>  }
> --
> 2.9.5
>


Re: [PATCH v6 06/19] mm: memcg/slab: obj_cgroup API

2020-06-19 Thread Shakeel Butt
On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
>
> Obj_cgroup API provides an ability to account sub-page sized kernel
> objects, which potentially outlive the original memory cgroup.
>
> The top-level API consists of the following functions:
>   bool obj_cgroup_tryget(struct obj_cgroup *objcg);
>   void obj_cgroup_get(struct obj_cgroup *objcg);
>   void obj_cgroup_put(struct obj_cgroup *objcg);
>
>   int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
>   void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
>
>   struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg);
>   struct obj_cgroup *get_obj_cgroup_from_current(void);
>
> Object cgroup is basically a pointer to a memory cgroup with a per-cpu
> reference counter. It substitutes a memory cgroup in places where
> it's necessary to charge a custom amount of bytes instead of pages.
>
> All charged memory rounded down to pages is charged to the
> corresponding memory cgroup using __memcg_kmem_charge().
>
> It implements reparenting: on memcg offlining it's getting reattached
> to the parent memory cgroup. Each online memory cgroup has an
> associated active object cgroup to handle new allocations and the list
> of all attached object cgroups. On offlining of a cgroup this list is
> reparented and for each object cgroup in the list the memcg pointer is
> swapped to the parent memory cgroup. It prevents long-living objects
> from pinning the original memory cgroup in the memory.
>
> The implementation is based on byte-sized per-cpu stocks. A sub-page
> sized leftover is stored in an atomic field, which is a part of
> obj_cgroup object. So on cgroup offlining the leftover is automatically
> reparented.
>
> memcg->objcg is rcu protected.
> objcg->memcg is a raw pointer, which is always pointing at a memory
> cgroup, but can be atomically swapped to the parent memory cgroup. So
> the caller

What type of caller? The allocator?

> must ensure the lifetime of the cgroup, e.g. grab
> rcu_read_lock or css_set_lock.
>
> Suggested-by: Johannes Weiner 
> Signed-off-by: Roman Gushchin 
> ---
>  include/linux/memcontrol.h |  51 +++
>  mm/memcontrol.c| 288 -
>  2 files changed, 338 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 93dbc7f9d8b8..c69e66fe4f12 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -23,6 +23,7 @@
>  #include 
>
>  struct mem_cgroup;
> +struct obj_cgroup;
>  struct page;
>  struct mm_struct;
>  struct kmem_cache;
> @@ -192,6 +193,22 @@ struct memcg_cgwb_frn {
> struct wb_completion done;  /* tracks in-flight foreign 
> writebacks */
>  };
>
> +/*
> + * Bucket for arbitrarily byte-sized objects charged to a memory
> + * cgroup. The bucket can be reparented in one piece when the cgroup
> + * is destroyed, without having to round up the individual references
> + * of all live memory objects in the wild.
> + */
> +struct obj_cgroup {
> +   struct percpu_ref refcnt;
> +   struct mem_cgroup *memcg;
> +   atomic_t nr_charged_bytes;

So, we still charge the mem page counter in pages but keep the
remaining sub-page slack charge in nr_charge_bytes, right?

> +   union {
> +   struct list_head list;
> +   struct rcu_head rcu;
> +   };
> +};
> +
>  /*
>   * The memory controller data structure. The memory controller controls both
>   * page cache and RSS per cgroup. We would eventually like to provide
> @@ -301,6 +318,8 @@ struct mem_cgroup {
> int kmemcg_id;
> enum memcg_kmem_state kmem_state;
> struct list_head kmem_caches;
> +   struct obj_cgroup __rcu *objcg;
> +   struct list_head objcg_list;
>  #endif
>
[snip]
> +
> +static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
> + struct mem_cgroup *parent)
> +{
> +   struct obj_cgroup *objcg, *iter;
> +
> +   objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
> +
> +   spin_lock_irq(_set_lock);
> +
> +   /* Move active objcg to the parent's list */
> +   xchg(>memcg, parent);
> +   css_get(>css);
> +   list_add(>list, >objcg_list);

So, memcg->objcs_list will always only contain the offlined
descendants objcgs. I would recommend to rename objcg_list to clearly
show that. Maybe offlined_objcg_list or descendants_objcg_list or
something else.

> +
> +   /* Move already reparented objcgs to the parent's list */
> +   list_for_each_entry(iter, >objcg_list, list) {
> +   css_get(>css);
> +   xchg(>memcg, parent);
> +   css_put(>css);
> +   }
> +   list_splice(>objcg_list, >objcg_list);
> +
> +   spin_unlock_irq(_set_lock);
> +
> +   percpu_ref_kill(>refcnt);
> +}
> +
>  /*
[snip]
>
> +__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
> +{
> +   struct obj_cgroup *objcg = NULL;
> +   

Re: [PATCH v6 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages

2020-06-19 Thread Shakeel Butt
On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
>
> Allocate and release memory to store obj_cgroup pointers for each
> non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
> to the allocated space.
>
> To distinguish between obj_cgroups and memcg pointers in case
> when it's not obvious which one is used (as in page_cgroup_ino()),
> let's always set the lowest bit in the obj_cgroup case.
>

I think the commit message should talk about the potential overhead
(i.e an extra pointer for each object) along with the justifications
(i.e. less internal fragmentation and potentially more savings than
the overhead).

> Signed-off-by: Roman Gushchin 
> Reviewed-by: Vlastimil Babka 
> ---
>  include/linux/mm_types.h |  5 +++-
>  include/linux/slab_def.h |  6 +
>  include/linux/slub_def.h |  5 
>  mm/memcontrol.c  | 17 +++---
>  mm/slab.h| 49 
>  5 files changed, 78 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 64ede5f150dc..0277fbab7c93 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -198,7 +198,10 @@ struct page {
> atomic_t _refcount;
>
>  #ifdef CONFIG_MEMCG
> -   struct mem_cgroup *mem_cgroup;
> +   union {
> +   struct mem_cgroup *mem_cgroup;
> +   struct obj_cgroup **obj_cgroups;
> +   };
>  #endif
>
> /*
> diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
> index abc7de77b988..ccda7b9669a5 100644
> --- a/include/linux/slab_def.h
> +++ b/include/linux/slab_def.h
> @@ -114,4 +114,10 @@ static inline unsigned int obj_to_index(const struct 
> kmem_cache *cache,
> return reciprocal_divide(offset, cache->reciprocal_buffer_size);
>  }
>
> +static inline int objs_per_slab_page(const struct kmem_cache *cache,
> +const struct page *page)
> +{
> +   return cache->num;
> +}
> +
>  #endif /* _LINUX_SLAB_DEF_H */
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index 30e91c83d401..f87302dcfe8c 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -198,4 +198,9 @@ static inline unsigned int obj_to_index(const struct 
> kmem_cache *cache,
> return __obj_to_index(cache, page_address(page), obj);
>  }
>
> +static inline int objs_per_slab_page(const struct kmem_cache *cache,
> +const struct page *page)
> +{
> +   return page->objects;
> +}
>  #endif /* _LINUX_SLUB_DEF_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7ff66275966c..2020c7542aa1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -569,10 +569,21 @@ ino_t page_cgroup_ino(struct page *page)
> unsigned long ino = 0;
>
> rcu_read_lock();
> -   if (PageSlab(page) && !PageTail(page))
> +   if (PageSlab(page) && !PageTail(page)) {
> memcg = memcg_from_slab_page(page);
> -   else
> -   memcg = READ_ONCE(page->mem_cgroup);
> +   } else {
> +   memcg = page->mem_cgroup;
> +
> +   /*
> +* The lowest bit set means that memcg isn't a valid
> +* memcg pointer, but a obj_cgroups pointer.
> +* In this case the page is shared and doesn't belong
> +* to any specific memory cgroup.
> +*/
> +   if ((unsigned long) memcg & 0x1UL)
> +   memcg = NULL;
> +   }
> +
> while (memcg && !(memcg->css.flags & CSS_ONLINE))
> memcg = parent_mem_cgroup(memcg);
> if (memcg)
> diff --git a/mm/slab.h b/mm/slab.h
> index 8a574d9361c1..a1633ea15fbf 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -319,6 +319,18 @@ static inline struct kmem_cache *memcg_root_cache(struct 
> kmem_cache *s)
> return s->memcg_params.root_cache;
>  }
>
> +static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
> +{
> +   /*
> +* page->mem_cgroup and page->obj_cgroups are sharing the same
> +* space. To distinguish between them in case we don't know for sure
> +* that the page is a slab page (e.g. page_cgroup_ino()), let's
> +* always set the lowest bit of obj_cgroups.
> +*/
> +   return (struct obj_cgroup **)
> +   ((unsigned long)page->obj_cgroups & ~0x1UL);
> +}
> +
>  /*
>   * Expects a pointer to a slab page. Please note, that PageSlab() check
>   * isn't sufficient, as it returns true also for tail compound slab pages,
> @@ -406,6 +418,26 @@ static __always_inline void memcg_uncharge_slab(struct 
> page *page, int order,
> percpu_ref_put_many(>memcg_params.refcnt, nr_pages);
>  }
>
> +static inline int memcg_alloc_page_obj_cgroups(struct page *page,
> +  struct kmem_cache *s, gfp_t 
> gfp)
> +{
> +   unsigned int objects = 

Re: [PATCH v6 06/19] mm: memcg/slab: obj_cgroup API

2020-06-19 Thread Shakeel Butt
On Fri, Jun 19, 2020 at 2:38 PM Roman Gushchin  wrote:
>
[snip]
> > > memcg->objcg is rcu protected.
> > > objcg->memcg is a raw pointer, which is always pointing at a memory
> > > cgroup, but can be atomically swapped to the parent memory cgroup. So
> > > the caller
> >
> > What type of caller? The allocator?
>
> Basically whoever uses the pointer. Is it better to s/caller/user?
>

Yes 'user' feels better.

> >
[...]
> >
> > The normal stock can go to 32*nr_cpus*PAGE_SIZE. I am wondering if
> > just PAGE_SIZE is too less for obj stock.
>
> It works on top of the current stock of 32 pages, so it can grab these
> 32 pages without any atomic operations. And it should be easy to increase
> this limit if we'll see any benefits.
>
> Thank you for looking into the patchset!
>
> Andrew, can you, please, squash the following fix based on Shakeel's 
> suggestions?
> Thanks!
>
> --

For the following squashed into the original patch:

Reviewed-by: Shakeel Butt 

>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7ed3af71a6fb..2499f78cf32d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -326,7 +326,7 @@ struct mem_cgroup {
> int kmemcg_id;
> enum memcg_kmem_state kmem_state;
> struct obj_cgroup __rcu *objcg;
> -   struct list_head objcg_list;
> +   struct list_head objcg_list; /* list of inherited objcgs */
>  #endif
>
>  #ifdef CONFIG_CGROUP_WRITEBACK
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 70cd44b28db1..9f14b91700d9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2843,7 +2843,7 @@ __always_inline struct obj_cgroup 
> *get_obj_cgroup_from_current(void)
> struct obj_cgroup *objcg = NULL;
> struct mem_cgroup *memcg;
>
> -   if (unlikely(!current->mm))
> +   if (unlikely(!current->mm && !current->active_memcg))
> return NULL;
>
> rcu_read_lock();


Re: [PATCH v6 08/19] mm: memcg/slab: save obj_cgroup for non-root slab objects

2020-06-19 Thread Shakeel Butt
On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
>
> Store the obj_cgroup pointer in the corresponding place of
> page->obj_cgroups for each allocated non-root slab object.
> Make sure that each allocated object holds a reference to obj_cgroup.
>
> Objcg pointer is obtained from the memcg->objcg dereferencing
> in memcg_kmem_get_cache() and passed from pre_alloc_hook to
> post_alloc_hook. Then in case of successful allocation(s) it's
> getting stored in the page->obj_cgroups vector.
>
> The objcg obtaining part look a bit bulky now, but it will be simplified
> by next commits in the series.
>
> Signed-off-by: Roman Gushchin 
> Reviewed-by: Vlastimil Babka 

One nit below otherwise:

Reviewed-by: Shakeel Butt 

> ---
[snip]
> +static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
> + struct obj_cgroup *objcg,
> + size_t size, void **p)
> +{
> +   struct page *page;
> +   unsigned long off;
> +   size_t i;
> +
> +   for (i = 0; i < size; i++) {
> +   if (likely(p[i])) {
> +   page = virt_to_head_page(p[i]);
> +   off = obj_to_index(s, page, p[i]);
> +   obj_cgroup_get(objcg);
> +   page_obj_cgroups(page)[off] = objcg;
> +   }
> +   }
> +   obj_cgroup_put(objcg);

Nit: we get the objcg reference in memcg_kmem_get_cache(), doesn't it
look cleaner to put that reference in memcg_kmem_put_cache() instead
of here.

> +   memcg_kmem_put_cache(s);
> +}
> +


Re: [PATCH v6 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages

2020-06-19 Thread Shakeel Butt
On Fri, Jun 19, 2020 at 5:25 PM Roman Gushchin  wrote:
>
> On Fri, Jun 19, 2020 at 09:36:16AM -0700, Shakeel Butt wrote:
> > On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin  wrote:
> > >
> > > Allocate and release memory to store obj_cgroup pointers for each
> > > non-root slab page. Reuse page->mem_cgroup pointer to store a pointer
> > > to the allocated space.
> > >
> > > To distinguish between obj_cgroups and memcg pointers in case
> > > when it's not obvious which one is used (as in page_cgroup_ino()),
> > > let's always set the lowest bit in the obj_cgroup case.
> > >
> >
> > I think the commit message should talk about the potential overhead
> > (i.e an extra pointer for each object) along with the justifications
> > (i.e. less internal fragmentation and potentially more savings than
> > the overhead).
>
> How about adding the following chunk? I don't like forward links in
> commit messages, so maybe putting it into the cover letter?
>
> This commit temporarily increases the memory footprint of the kernel memory
> accounting. To store obj_cgroup pointers we'll need a place for an
> objcg_pointer for each allocated object. However, the following patches
> in the series will enable sharing of slab pages between memory cgroups,
> which will dramatically increase the total slab utilization. And the final
> memory footprint will be significantly smaller than before.
>

This looks good to me.

> >
> > > Signed-off-by: Roman Gushchin 
> > > Reviewed-by: Vlastimil Babka 
> > > ---
> > >  include/linux/mm_types.h |  5 +++-
> > >  include/linux/slab_def.h |  6 +
> > >  include/linux/slub_def.h |  5 
> > >  mm/memcontrol.c  | 17 +++---
> > >  mm/slab.h| 49 
> > >  5 files changed, 78 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 64ede5f150dc..0277fbab7c93 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -198,7 +198,10 @@ struct page {
> > > atomic_t _refcount;
> > >
> > >  #ifdef CONFIG_MEMCG
> > > -   struct mem_cgroup *mem_cgroup;
> > > +   union {
> > > +   struct mem_cgroup *mem_cgroup;
> > > +   struct obj_cgroup **obj_cgroups;
> > > +   };
> > >  #endif
> > >
> > > /*
> > > diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
> > > index abc7de77b988..ccda7b9669a5 100644
> > > --- a/include/linux/slab_def.h
> > > +++ b/include/linux/slab_def.h
> > > @@ -114,4 +114,10 @@ static inline unsigned int obj_to_index(const struct 
> > > kmem_cache *cache,
> > > return reciprocal_divide(offset, cache->reciprocal_buffer_size);
> > >  }
> > >
> > > +static inline int objs_per_slab_page(const struct kmem_cache *cache,
> > > +const struct page *page)
> > > +{
> > > +   return cache->num;
> > > +}
> > > +
> > >  #endif /* _LINUX_SLAB_DEF_H */
> > > diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> > > index 30e91c83d401..f87302dcfe8c 100644
> > > --- a/include/linux/slub_def.h
> > > +++ b/include/linux/slub_def.h
> > > @@ -198,4 +198,9 @@ static inline unsigned int obj_to_index(const struct 
> > > kmem_cache *cache,
> > > return __obj_to_index(cache, page_address(page), obj);
> > >  }
> > >
> > > +static inline int objs_per_slab_page(const struct kmem_cache *cache,
> > > +const struct page *page)
> > > +{
> > > +   return page->objects;
> > > +}
> > >  #endif /* _LINUX_SLUB_DEF_H */
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 7ff66275966c..2020c7542aa1 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -569,10 +569,21 @@ ino_t page_cgroup_ino(struct page *page)
> > > unsigned long ino = 0;
> > >
> > > rcu_read_lock();
> > > -   if (PageSlab(page) && !PageTail(page))
> > > +   if (PageSlab(page) && !PageTail(page)) {
> > > memcg = memcg_from_slab_page(page);
> > > -   else
> > > -   memcg = READ_ONCE(page->mem_cgroup);
> > > +   } else

<    3   4   5   6   7   8   9   10   11   12   >