Re: [syzbot] [mm?] possible deadlock in __mmap_lock_do_trace_start_locking

2024-06-16 Thread Waiman Long

On 6/16/24 10:05, syzbot wrote:

syzbot has bisected this issue to:

commit 21c38a3bd4ee3fb7337d013a638302fb5e5f9dc2
Author: Jesper Dangaard Brouer 
Date:   Wed May 1 14:04:11 2024 +

 cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1669526198
start commit:   36534d3c5453 tcp: use signed arithmetic in tcp_rtx_probe0_..
git tree:   bpf
final oops: https://syzkaller.appspot.com/x/report.txt?x=1569526198
console output: https://syzkaller.appspot.com/x/log.txt?x=1169526198
kernel config:  https://syzkaller.appspot.com/x/.config?x=333ebe38d43c42e2
dashboard link: https://syzkaller.appspot.com/bug?extid=6ff90931779bcdfc840c
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1585acfa98
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=17bdb7ee98

Reported-by: syzbot+6ff90931779bcdfc8...@syzkaller.appspotmail.com
Fixes: 21c38a3bd4ee ("cgroup/rstat: add cgroup_rstat_cpu_lock helpers and 
tracepoints")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection


+static __always_inline
+unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu,
+    struct cgroup *cgrp, const bool 
fast_path)

+{
+   unsigned long flags;
+   bool contended;
+
+   /*
+    * The _irqsave() is needed because cgroup_rstat_lock is
+    * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
+    * this lock with the _irq() suffix only disables interrupts on
+    * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
+    * interrupts on both configurations. The _irqsave() ensures
+    * that interrupts are always disabled and later restored.
+    */
+   contended = !raw_spin_trylock_irqsave(cpu_lock, flags);
+   if (contended) {
+   if (fast_path)
+ trace_cgroup_rstat_cpu_lock_contended_fastpath(cgrp, cp>
+   else
+   trace_cgroup_rstat_cpu_lock_contended(cgrp, cpu, 
conten>

+
+   raw_spin_lock_irqsave(cpu_lock, flags);
+   }

I believe the problem may be caused by the fact that 
trace_cgroup_rstat_cpu_lock_contended*() can be called with IRQ enabled. 
I had suggested before IRQ should be disabled first before doing any 
trace operation. See


https://lore.kernel.org/linux-mm/203fdb35-f4cf-4754-9709-3c024eeca...@redhat.com/

Doing so may be able to resolve this possible deadlock.

Cheers,
Longman




Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

2024-05-03 Thread Waiman Long



On 5/3/24 17:10, David Laight wrote:

From: Waiman Long

Sent: 03 May 2024 17:00

...

David,

Could you respin the series based on the latest upstream code?

I've just reapplied the patches to 'master' and they all apply
cleanly and diffing the new patches to the old ones gives no differences.
So I think they should still apply.

Were you seeing a specific problem?

I don't remember any suggested changed either.
(Apart from a very local variable I used to keep a patch isolated.)


No, I just want to make sure that your patches will still apply. Anyway, 
it will be easier for the maintainer to merge your remaining patches if 
you can send out a new version even if they are almost the same as the 
old ones.


Thanks,
Longman




Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

2024-05-03 Thread Waiman Long



On 12/31/23 23:14, Waiman Long wrote:


On 12/31/23 16:55, David Laight wrote:

per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
This requires the cpu number be 64bit.
However the value is osq_lock() comes from a 32bit xchg() and there
isn't a way of telling gcc the high bits are zero (they are) so
there will always be an instruction to clear the high bits.

The cpu number is also offset by one (to make the initialiser 0)
It seems to be impossible to get gcc to convert 
__per_cpu_offset[cpu_p1 - 1]
into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the 
address).


Converting the cpu number to 32bit unsigned prior to the decrement means
that gcc knows the decrement has set the high bits to zero and doesn't
add a register-register move (or cltq) to zero/sign extend the value.

Not massive but saves two instructions.

Signed-off-by: David Laight 
---
  kernel/locking/osq_lock.c | 6 ++
  1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 35bb99e96697..37a4fa872989 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
  return cpu_nr + 1;
  }
  -static inline struct optimistic_spin_node *decode_cpu(int 
encoded_cpu_val)
+static inline struct optimistic_spin_node *decode_cpu(unsigned int 
encoded_cpu_val)

  {
-    int cpu_nr = encoded_cpu_val - 1;
-
-    return per_cpu_ptr(_node, cpu_nr);
+    return per_cpu_ptr(_node, encoded_cpu_val - 1);
  }
    /*


You really like micro-optimization.

Anyway,

Reviewed-by: Waiman Long 


David,

Could you respin the series based on the latest upstream code?

Thanks,
Longman




Re: [RFC][PATCH] tracing: Introduce restart_critical_timings()

2024-03-20 Thread Waiman Long

On 3/20/24 12:20, Steven Rostedt wrote:

From: Steven Rostedt (Google) 

I'm debugging some latency issues on a Chromebook and the preemptirqsoff
tracer hit this:

# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 5.15.148-21853-g165fd2387469-dirty
# 
# latency: 7813 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
#-
#| task: init-1 (uid:0 nice:0 policy:0 rt_prio:0)
#-
#  => started at: rwsem_optimistic_spin
#  => ended at:   rwsem_optimistic_spin
#
#
#_--=> CPU#
#   / _-=> irqs-off
#  | / _=> need-resched
#  || / _---=> hardirq/softirq
#  ||| / _--=> preempt-depth
#   / _-=> migrate-disable
#  | / delay
#  cmd pid || time  |   caller
# \   /||  \|/
<...>-1 1...1.0us!: rwsem_optimistic_spin+0x20/0x194 
<-rwsem_optimistic_spin+0x20/0x194
<...>-1 1.N.1. 7813us : rwsem_optimistic_spin+0x140/0x194 
<-rwsem_optimistic_spin+0x140/0x194
<...>-1 1.N.1. 7814us+: tracer_preempt_on+0x4c/0x6a 
<-rwsem_optimistic_spin+0x140/0x194
<...>-1 1.N.1. 7824us : 
  => rwsem_optimistic_spin+0x140/0x194
  => rwsem_down_write_slowpath+0xc9/0x3fe
  => copy_process+0xd08/0x1b8a
  => kernel_clone+0x94/0x256
  => __x64_sys_clone+0x7a/0x9a
  => do_syscall_64+0x51/0xa1
  => entry_SYSCALL_64_after_hwframe+0x5c/0xc6

Which isn't a real issue, as it's in the rwsem_optimistic_spin() code that
spins with interrupts enabled, preempt disabled, and checks for
need_resched(). If it is true, it breaks out and schedules.

Hence, it's hiding real issues, because it can spin for a very long time
and this is not the source of the latency I'm tracking down.

I would like to introduce restart_critical_timings() and place it in
locations that have this behavior.

Signed-off-by: Steven Rostedt (Google) 


I have no objection to that. However, there are now 2 function call 
overhead in each iteration if either CONFIG_IRQSOFF_TRACER or 
CONFIG_PREEMPT_TRACER is on. Is it possible to do it with just one 
function call? IOW, make restart_critical_timings() a real function.


Cheers,
Longman



diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 147feebd508c..e9f97f60bfc0 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -145,6 +145,13 @@ do {   \
  # define start_critical_timings() do { } while (0)
  #endif
  
+/* Used in spins that check need_resched() with preemption disabled */

+static inline void restart_critical_timings(void)
+{
+   stop_critical_timings();
+   start_critical_timings();
+}
+
  #ifdef CONFIG_DEBUG_IRQFLAGS
  extern void warn_bogus_irq_restore(void);
  #define raw_check_bogus_irq_restore() \
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 2340b6d90ec6..fa7b43e53d20 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -27,6 +27,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  
  #ifndef CONFIG_PREEMPT_RT

@@ -780,6 +781,7 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
 */
barrier();
  
+		restart_critical_timings();

if (need_resched() || !owner_on_cpu(owner)) {
state = OWNER_NONSPINNABLE;
break;
@@ -912,6 +914,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 * a writer, need_resched() check needs to be done here.
 */
if (owner_state != OWNER_WRITER) {
+   restart_critical_timings();
if (need_resched())
break;
if (rt_task(current) &&






Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

2023-12-31 Thread Waiman Long



On 12/31/23 16:55, David Laight wrote:

per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
This requires the cpu number be 64bit.
However the value is osq_lock() comes from a 32bit xchg() and there
isn't a way of telling gcc the high bits are zero (they are) so
there will always be an instruction to clear the high bits.

The cpu number is also offset by one (to make the initialiser 0)
It seems to be impossible to get gcc to convert __per_cpu_offset[cpu_p1 - 1]
into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the address).

Converting the cpu number to 32bit unsigned prior to the decrement means
that gcc knows the decrement has set the high bits to zero and doesn't
add a register-register move (or cltq) to zero/sign extend the value.

Not massive but saves two instructions.

Signed-off-by: David Laight 
---
  kernel/locking/osq_lock.c | 6 ++
  1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 35bb99e96697..37a4fa872989 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
return cpu_nr + 1;
  }
  
-static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)

+static inline struct optimistic_spin_node *decode_cpu(unsigned int 
encoded_cpu_val)
  {
-   int cpu_nr = encoded_cpu_val - 1;
-
-   return per_cpu_ptr(_node, cpu_nr);
+   return per_cpu_ptr(_node, encoded_cpu_val - 1);
  }
  
  /*


You really like micro-optimization.

Anyway,

Reviewed-by: Waiman Long 




Re: [PATCH next v2 4/5] locking/osq_lock: Avoid writing to node->next in the osq_lock() fast path.

2023-12-31 Thread Waiman Long



On 12/31/23 16:54, David Laight wrote:

When osq_lock() returns false or osq_unlock() returns static
analysis shows that node->next should always be NULL.
This means that it isn't necessary to explicitly set it to NULL
prior to atomic_xchg(>tail, curr) on extry to osq_lock().

Just in case there a non-obvious race condition that can leave it
non-NULL check with WARN_ON_ONCE() and NULL if set.
Note that without this check the fast path (adding at the list head)
doesn't need to to access the per-cpu osq_node at all.

Signed-off-by: David Laight 
---
  kernel/locking/osq_lock.c | 14 ++
  1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 27324b509f68..35bb99e96697 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -87,12 +87,17 @@ osq_wait_next(struct optimistic_spin_queue *lock,
  
  bool osq_lock(struct optimistic_spin_queue *lock)

  {
-   struct optimistic_spin_node *node = this_cpu_ptr(_node);
-   struct optimistic_spin_node *prev, *next;
+   struct optimistic_spin_node *node, *prev, *next;
int curr = encode_cpu(smp_processor_id());
int prev_cpu;
  
-	node->next = NULL;

+   /*
+* node->next should be NULL on entry.
+* Check just in case there is a race somewhere.
+* Note that this is probably an unnecessary cache miss in the fast 
path.
+*/
+   if (WARN_ON_ONCE(raw_cpu_read(osq_node.next) != NULL))
+   raw_cpu_write(osq_node.next, NULL);
  
  	/*

 * We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -104,8 +109,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
if (prev_cpu == OSQ_UNLOCKED_VAL)
return true;
  
-	node->prev_cpu = prev_cpu;

+   node = this_cpu_ptr(_node);
prev = decode_cpu(prev_cpu);
+   node->prev_cpu = prev_cpu;
node->locked = 0;
  
  	/*

Reviewed-by: Waiman Long 




Re: [PATCH next v2 3/5] locking/osq_lock: Use node->prev_cpu instead of saving node->prev.

2023-12-31 Thread Waiman Long

On 12/31/23 16:54, David Laight wrote:

node->prev is only used to update 'prev' in the unlikely case
of concurrent unqueues.
This can be replaced by a check for node->prev_cpu changing
and then calling decode_cpu() to get the changed 'prev' pointer.

node->cpu (or more particularly) prev->cpu is only used for the
osq_wait_next() call in the unqueue path.
Normally this is exactly the value that the initial xchg() read
from lock->tail (used to obtain 'prev'), but can get updated
by concurrent unqueues.

Both the 'prev' and 'cpu' members of optimistic_spin_node are
now unused and can be deleted.

Signed-off-by: David Laight 
---
  kernel/locking/osq_lock.c | 31 +--
  1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index eb8a6dfdb79d..27324b509f68 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -13,9 +13,8 @@
   */
  
  struct optimistic_spin_node {

-   struct optimistic_spin_node *next, *prev;
+   struct optimistic_spin_node *next;
int locked;/* 1 if lock acquired */
-   int cpu;   /* encoded CPU # + 1 value */
int prev_cpu;  /* encoded CPU # + 1 value */
  };
  
@@ -91,10 +90,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)

struct optimistic_spin_node *node = this_cpu_ptr(_node);
struct optimistic_spin_node *prev, *next;
int curr = encode_cpu(smp_processor_id());
-   int old;
+   int prev_cpu;
  
  	node->next = NULL;

-   node->cpu = curr;
  
  	/*

 * We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -102,13 +100,12 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 * the node fields we just initialised) semantics when updating
 * the lock tail.
 */
-   old = atomic_xchg(>tail, curr);
-   if (old == OSQ_UNLOCKED_VAL)
+   prev_cpu = atomic_xchg(>tail, curr);
+   if (prev_cpu == OSQ_UNLOCKED_VAL)
return true;
  
-	node->prev_cpu = old;

-   prev = decode_cpu(old);
-   node->prev = prev;
+   node->prev_cpu = prev_cpu;
+   prev = decode_cpu(prev_cpu);
node->locked = 0;
  
  	/*

@@ -174,9 +171,16 @@ bool osq_lock(struct optimistic_spin_queue *lock)
  
  		/*

 * Or we race against a concurrent unqueue()'s step-B, in which
-* case its step-C will write us a new @node->prev pointer.
+* case its step-C will write us a new @node->prev_cpu value.
 */
-   prev = READ_ONCE(node->prev);
+   {
+   int new_prev_cpu = READ_ONCE(node->prev_cpu);
+
+   if (new_prev_cpu == prev_cpu)
+   continue;
+   prev_cpu = new_prev_cpu;
+   prev = decode_cpu(prev_cpu);
+   }


Just a minor nit. It is not that common in the kernel to add another 
nesting level just to reduce the scope of  new_prev_cpu auto variable.


Anyway,

Reviewed-by: Waiman Long 




Re: [PATCH next v2 2/5] locking/osq_lock: Optimise the vcpu_is_preempted() check.

2023-12-31 Thread Waiman Long



On 12/31/23 16:52, David Laight wrote:

The vcpu_is_preempted() test stops osq_lock() spinning if a virtual
cpu is no longer running.

Although patched out for bare-metal the code still needs the cpu number.
Reading this from 'prev->cpu' is a pretty much guaranteed have a cache miss
when osq_unlock() is waking up the next cpu.

Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead.
Update in the osq_lock() 'unqueue' path when 'node->prev' is changed.

This is simpler than checking for 'node->prev' changing and caching
'prev->cpu'.

Signed-off-by: David Laight 
---
  kernel/locking/osq_lock.c | 16 +++-
  1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index e0bc74d85a76..eb8a6dfdb79d 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -14,8 +14,9 @@
  
  struct optimistic_spin_node {

struct optimistic_spin_node *next, *prev;
-   int locked; /* 1 if lock acquired */
-   int cpu; /* encoded CPU # + 1 value */
+   int locked;/* 1 if lock acquired */
+   int cpu;   /* encoded CPU # + 1 value */
+   int prev_cpu;  /* encoded CPU # + 1 value */
  };
  
  static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);

@@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr)
return cpu_nr + 1;
  }
  
-static inline int node_cpu(struct optimistic_spin_node *node)

-{
-   return node->cpu - 1;
-}
-
  static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
  {
int cpu_nr = encoded_cpu_val - 1;
@@ -110,9 +106,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
if (old == OSQ_UNLOCKED_VAL)
return true;
  
-	node->locked = 0;

+   node->prev_cpu = old;
prev = decode_cpu(old);
node->prev = prev;
+   node->locked = 0;
  
  	/*

 * osq_lock()   unqueue
@@ -144,7 +141,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 * polling, be careful.
 */
if (smp_cond_load_relaxed(>locked, VAL || need_resched() ||
- vcpu_is_preempted(node_cpu(node->prev
+ vcpu_is_preempted(READ_ONCE(node->prev_cpu) - 
1)))
return true;
  
  	/* unqueue */

@@ -201,6 +198,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 * it will wait in Step-A.
 */
  
+	WRITE_ONCE(next->prev_cpu, prev->cpu);

WRITE_ONCE(next->prev, prev);
    WRITE_ONCE(prev->next, next);
  


Reviewed-by: Waiman Long 

Reviewed-by: Waiman Long 




Re: [PATCH next v2 1/5] locking/osq_lock: Defer clearing node->locked until the slow osq_lock() path.

2023-12-31 Thread Waiman Long



On 12/31/23 16:51, David Laight wrote:

Since node->locked cannot be set before the assignment to prev->next
it is save to clear it in the slow path.

Signed-off-by: David Laight 
---
  kernel/locking/osq_lock.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 75a6f6133866..e0bc74d85a76 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -97,7 +97,6 @@ bool osq_lock(struct optimistic_spin_queue *lock)
int curr = encode_cpu(smp_processor_id());
int old;
  
-	node->locked = 0;

node->next = NULL;
node->cpu = curr;
  
@@ -111,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)

if (old == OSQ_UNLOCKED_VAL)
return true;
  
+	node->locked = 0;

prev = decode_cpu(old);
node->prev = prev;
  


Reviewed-by: Waiman Long 




Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

2023-12-30 Thread Waiman Long



On 12/30/23 06:35, David Laight wrote:

From: Ingo Molnar

Sent: 30 December 2023 11:09


* Waiman Long  wrote:


On 12/29/23 15:57, David Laight wrote:

this_cpu_ptr() is rather more expensive than raw_cpu_read() since
the latter can use an 'offset from register' (%gs for x86-84).

Add a 'self' field to 'struct optimistic_spin_node' that can be
read with raw_cpu_read(), initialise on first call.

Signed-off-by: David Laight 
---
   kernel/locking/osq_lock.c | 14 +-
   1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 9bb3a077ba92..b60b0add0161 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -13,7 +13,7 @@
*/
   struct optimistic_spin_node {
-   struct optimistic_spin_node *next, *prev;
+   struct optimistic_spin_node *self, *next, *prev;
int locked; /* 1 if lock acquired */
int cpu; /* encoded CPU # + 1 value */
   };
@@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
   bool osq_lock(struct optimistic_spin_queue *lock)
   {
-   struct optimistic_spin_node *node = this_cpu_ptr(_node);
+   struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);

My gcc 11 compiler produces the following x86-64 code:

92        struct optimistic_spin_node *node = this_cpu_ptr(_node);
    0x0029 <+25>:    mov    %rcx,%rdx
    0x002c <+28>:    add %gs:0x0(%rip),%rdx    # 0x34


Which looks pretty optimized for me. Maybe older compiler may generate more
complex code. However, I do have some doubt as to the benefit of this patch
at the expense of making the code a bit more complex.

My changed code is one instruction shorter!
   18:   65 48 8b 15 00 00 00mov%gs:0x0(%rip),%rdx# 20 

   1f:   00
 1c: R_X86_64_PC32   
.data..percpu..shared_aligned-0x4
However is might have one less cache line miss.


GCC-11 is plenty of a look-back window in terms of compiler efficiency:
latest enterprise distros use GCC-11 or newer, while recent desktop
distros use GCC-13. Anything older won't matter, because no major
distribution is going to use new kernels with old compilers.

There must be a difference in the header files as well.
Possibly forced by the older compiler I'm using (7.5 from Ubuntu 18.04).
But maybe based on some config option.

I'm seeing this_cpu_ptr() converted to per_cpu_ptr(, smp_processor_id())
which necessitates an array lookup (indexed by cpu number).
Whereas I think you are seeing it implemented as
   raw_cpu_read(per_cpu_data_base) + offset_to(xxx)

So the old code generates (after the prologue):
   10:   49 89 fdmov%rdi,%r13
   13:   49 c7 c4 00 00 00 00mov$0x0,%r12
 16: R_X86_64_32S.data..percpu..shared_aligned
   1a:   e8 00 00 00 00  callq  1f 
 1b: R_X86_64_PC32   debug_smp_processor_id-0x4
   1f:   89 c0   mov%eax,%eax
   21:   48 8b 1c c5 00 00 00mov0x0(,%rax,8),%rbx
   28:   00
 25: R_X86_64_32S__per_cpu_offset
   29:   e8 00 00 00 00  callq  2e 
 2a: R_X86_64_PC32   debug_smp_processor_id-0x4
   2e:   4c 01 e3add%r12,%rbx
   31:   83 c0 01add$0x1,%eax
   34:   c7 43 10 00 00 00 00movl   $0x0,0x10(%rbx)
   3b:   48 c7 03 00 00 00 00movq   $0x0,(%rbx)
   42:   89 43 14mov%eax,0x14(%rbx)
   45:   41 87 45 00 xchg   %eax,0x0(%r13)

I was also surprised that smp_processor_id() is a real function rather
than an offset from %gs.


I have looked up definition of this_cpu_ptr() and gotten the following 
results:


this_cpu_ptr() => raw_cpu_ptr() => arch_raw_cpu_ptr()

/*
 * Compared to the generic __my_cpu_offset version, the following
 * saves one instruction and avoids clobbering a temp register.
 */
#define arch_raw_cpu_ptr(ptr)   \
({  \
    unsigned long tcp_ptr__;    \
    asm ("add " __percpu_arg(1) ", %0"  \
 : "=r" (tcp_ptr__) \
 : "m" (this_cpu_off), "0" (ptr));  \
    (typeof(*(ptr)) __kernel __force *)tcp_ptr__;   \
})

The presence of debug_smp_processor_id in your compiled code is likely 
due to the setting of CONFIG_DEBUG_PREEMPT in your kernel config.


#ifdef CONFIG_DEBUG_PREEMPT
  extern unsigned int debug_smp_processor_id(void);
# define smp_processor_id() debug_smp_processor_id()
#else
# define smp_processor_id() __smp_processor_id()
#endif

I don't have that config entry in my kernel config and so I only get 2 
instructions for this_cpu_ptr(). We are not going to optimize the code 
specifically for CONFIG_DEBUG_PREEMPT and so this patch should be dropped.


Cheers,
Longman




Re: [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code

2023-12-30 Thread Waiman Long



On 12/30/23 17:39, David Laight wrote:

From: Linus Torvalds

Sent: 30 December 2023 19:41

On Fri, 29 Dec 2023 at 12:52, David Laight  wrote:

David Laight (5):
   Move the definition of optimistic_spin_node into osf_lock.c
   Clarify osq_wait_next()

I took these two as preparatory independent patches, with that
osq_wait_next() clarification split into two.

I also did the renaming that Waiman asked for.

Ok, I'll try to remove them from any respin.
(Or at least put them first!)


The first 2 patches have been included in Linus' master branch. So you 
should just remove them from a respin.


Cheers,
Longman





Re: [PATCH next 5/5] locking/osq_lock: Optimise vcpu_is_preempted() check.

2023-12-30 Thread Waiman Long

On 12/29/23 22:13, Waiman Long wrote:


On 12/29/23 15:58, David Laight wrote:

The vcpu_is_preempted() test stops osq_lock() spinning if a virtual
   cpu is no longer running.
Although patched out for bare-metal the code still needs the cpu number.
Reading this from 'prev->cpu' is a pretty much guaranteed have a 
cache miss

when osq_unlock() is waking up the next cpu.

Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead.
Update in the osq_lock() 'unqueue' path when 'node->prev' is changed.

This is simpler than checking for 'node->prev' changing and caching
'prev->cpu'.

Signed-off-by: David Laight 
---
  kernel/locking/osq_lock.c | 14 ++
  1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index b60b0add0161..89be63627434 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -14,8 +14,9 @@
    struct optimistic_spin_node {
  struct optimistic_spin_node *self, *next, *prev;
-    int locked; /* 1 if lock acquired */
-    int cpu; /* encoded CPU # + 1 value */
+    int locked;    /* 1 if lock acquired */
+    int cpu;   /* encoded CPU # + 1 value */
+    int prev_cpu;  /* actual CPU # for vpcu_is_preempted() */
  };
    static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, 
osq_node);

@@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr)
  return cpu_nr + 1;
  }
  -static inline int node_cpu(struct optimistic_spin_node *node)
-{
-    return node->cpu - 1;
-}
-
  static inline struct optimistic_spin_node *decode_cpu(int 
encoded_cpu_val)

  {
  int cpu_nr = encoded_cpu_val - 1;
@@ -114,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
  if (old == OSQ_UNLOCKED_VAL)
  return true;
  +    node->prev_cpu = old - 1;
  prev = decode_cpu(old);
  node->prev = prev;
  node->locked = 0;
@@ -148,7 +145,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
   * polling, be careful.
   */
  if (smp_cond_load_relaxed(>locked, VAL || need_resched() ||
-  vcpu_is_preempted(node_cpu(node->prev
+  vcpu_is_preempted(node->prev_cpu)))


On second thought, I believe it is more correct to use READ_ONCE() to 
access "node->prev_cpu" as this field is subjected to change by a 
WRITE_ONCE().


Cheers,
Longman





Re: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.

2023-12-29 Thread Waiman Long



On 12/29/23 17:11, David Laight wrote:

osq_lock() starts by setting node->next to NULL and node->locked to 0.
Careful analysis shows that node->next is always NULL on entry.

node->locked is set non-zero by another cpu to force a wakeup.
This can only happen after the 'prev->next = node' assignment,
so locked can be set to zero just before that (along with the assignment
to node->prev).

Only initialise node->cpu once, after that use its value instead
of smp_processor_id() - which is probably a real function call.

Should reduce cache-line bouncing a little.

Signed-off-by: David Laight 
---

Re-send without the 'RE:' on the subject line.
  kernel/locking/osq_lock.c | 13 ++---
  1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d414eef4bec6..55f5db896c02 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
  struct optimistic_spin_node *prev)
  {
struct optimistic_spin_node *next = NULL;
-   int curr = encode_cpu(smp_processor_id());
+   int curr = node->cpu;
int old;
  
  	/*

@@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
  {
struct optimistic_spin_node *node = this_cpu_ptr(_node);
struct optimistic_spin_node *prev, *next;
-   int curr = encode_cpu(smp_processor_id());
int old;
  
-	node->locked = 0;

-   node->next = NULL;


I have some concern about not clearing node->next at the beginning. I 
know that next is supposed to be cleared at the end. I do have some 
worry that there may exist a race condition that leaves next not cleared 
before it is used again. So you either have to prove that such a race 
does not exist, or initializing it to NULL at the beginning like it is 
today.


Cheers,
Longman


-   node->cpu = curr;
+   if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
+   node->cpu = encode_cpu(smp_processor_id());
  
  	/*

 * We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -111,12 +109,13 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 * the node fields we just initialised) semantics when updating
 * the lock tail.
 */
-   old = atomic_xchg(>tail, curr);
+   old = atomic_xchg(>tail, node->cpu);
if (old == OSQ_UNLOCKED_VAL)
return true;
  
  	prev = decode_cpu(old);

node->prev = prev;
+   node->locked = 0;
  
  	/*

 * osq_lock()   unqueue
@@ -214,7 +213,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
  void osq_unlock(struct optimistic_spin_queue *lock)
  {
struct optimistic_spin_node *node, *next;
-   int curr = encode_cpu(smp_processor_id());
+   int curr = raw_cpu_read(osq_node.cpu);
  
  	/*

 * Fast path for the uncontended case.





Re: [PATCH next 5/5] locking/osq_lock: Optimise vcpu_is_preempted() check.

2023-12-29 Thread Waiman Long



On 12/29/23 15:58, David Laight wrote:

The vcpu_is_preempted() test stops osq_lock() spinning if a virtual
   cpu is no longer running.
Although patched out for bare-metal the code still needs the cpu number.
Reading this from 'prev->cpu' is a pretty much guaranteed have a cache miss
when osq_unlock() is waking up the next cpu.

Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead.
Update in the osq_lock() 'unqueue' path when 'node->prev' is changed.

This is simpler than checking for 'node->prev' changing and caching
'prev->cpu'.

Signed-off-by: David Laight 
---
  kernel/locking/osq_lock.c | 14 ++
  1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index b60b0add0161..89be63627434 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -14,8 +14,9 @@
  
  struct optimistic_spin_node {

struct optimistic_spin_node *self, *next, *prev;
-   int locked; /* 1 if lock acquired */
-   int cpu; /* encoded CPU # + 1 value */
+   int locked;/* 1 if lock acquired */
+   int cpu;   /* encoded CPU # + 1 value */
+   int prev_cpu;  /* actual CPU # for vpcu_is_preempted() */
  };
  
  static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);

@@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr)
return cpu_nr + 1;
  }
  
-static inline int node_cpu(struct optimistic_spin_node *node)

-{
-   return node->cpu - 1;
-}
-
  static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
  {
int cpu_nr = encoded_cpu_val - 1;
@@ -114,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
if (old == OSQ_UNLOCKED_VAL)
return true;
  
+	node->prev_cpu = old - 1;

prev = decode_cpu(old);
node->prev = prev;
node->locked = 0;
@@ -148,7 +145,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 * polling, be careful.
 */
if (smp_cond_load_relaxed(>locked, VAL || need_resched() ||
- vcpu_is_preempted(node_cpu(node->prev
+ vcpu_is_preempted(node->prev_cpu)))
return true;
  
  	/* unqueue */

@@ -205,6 +202,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 * it will wait in Step-A.
 */
  
+	WRITE_ONCE(next->prev_cpu, prev->cpu - 1);

WRITE_ONCE(next->prev, prev);
WRITE_ONCE(prev->next, next);

Reviewed-by: Waiman Long 

  





Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

2023-12-29 Thread Waiman Long

On 12/29/23 15:57, David Laight wrote:

this_cpu_ptr() is rather more expensive than raw_cpu_read() since
the latter can use an 'offset from register' (%gs for x86-84).

Add a 'self' field to 'struct optimistic_spin_node' that can be
read with raw_cpu_read(), initialise on first call.

Signed-off-by: David Laight 
---
  kernel/locking/osq_lock.c | 14 +-
  1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 9bb3a077ba92..b60b0add0161 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -13,7 +13,7 @@
   */
  
  struct optimistic_spin_node {

-   struct optimistic_spin_node *next, *prev;
+   struct optimistic_spin_node *self, *next, *prev;
int locked; /* 1 if lock acquired */
int cpu; /* encoded CPU # + 1 value */
  };
@@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
  
  bool osq_lock(struct optimistic_spin_queue *lock)

  {
-   struct optimistic_spin_node *node = this_cpu_ptr(_node);
+   struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);


My gcc 11 compiler produces the following x86-64 code:

92        struct optimistic_spin_node *node = this_cpu_ptr(_node);
   0x0029 <+25>:    mov    %rcx,%rdx
   0x002c <+28>:    add %gs:0x0(%rip),%rdx    # 0x34 



Which looks pretty optimized for me. Maybe older compiler may generate 
more complex code. However, I do have some doubt as to the benefit of 
this patch at the expense of making the code a bit more complex.


Cheers,
Longman




Re: [PATCH next 3/5] locking/osq_lock: Clarify osq_wait_next()

2023-12-29 Thread Waiman Long



On 12/29/23 15:56, David Laight wrote:

osq_wait_next() is passed 'prev' from osq_lock() and NULL from osq_unlock()
but only needs the 'cpu' value to write to lock->tail.
Just pass prev->cpu or OSQ_UNLOCKED_VAL instead.

Also directly return NULL or 'next' instead of breaking the loop.

Should have no effect on the generated code since gcc manages to
assume that 'prev != NULL' due to an earlier dereference.

Signed-off-by: David Laight 
---
  kernel/locking/osq_lock.c | 23 ++-
  1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 55f5db896c02..9bb3a077ba92 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -48,18 +48,17 @@ static inline struct optimistic_spin_node *decode_cpu(int 
encoded_cpu_val)
  static inline struct optimistic_spin_node *
  osq_wait_next(struct optimistic_spin_queue *lock,
  struct optimistic_spin_node *node,
- struct optimistic_spin_node *prev)
+ int old)


Make the last argument name more descriptive, like "old_cpu" as the 
"int" type does not provide enough context to allow people to guess what 
"old" may be.


Cheers,
Longman





Re: [PATCH next 1/5] locking/osq_lock: Move the definition of optimistic_spin_node into osf_lock.c

2023-12-29 Thread Waiman Long

On 12/29/23 15:53, David Laight wrote:

struct optimistic_spin_node is private to the implementation.
Move it into the C file to ensure nothing is accessing it.

Signed-off-by: David Laight 
---
  include/linux/osq_lock.h  | 5 -
  kernel/locking/osq_lock.c | 7 +++
  2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 5581dbd3bd34..ea8fb31379e3 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -6,11 +6,6 @@
   * An MCS like lock especially tailored for optimistic spinning for sleeping
   * lock implementations (mutex, rwsem, etc).
   */
-struct optimistic_spin_node {
-   struct optimistic_spin_node *next, *prev;
-   int locked; /* 1 if lock acquired */
-   int cpu; /* encoded CPU # + 1 value */
-};
  
  struct optimistic_spin_queue {

/*
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d5610ad52b92..d414eef4bec6 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -11,6 +11,13 @@
   * called from interrupt context and we have preemption disabled while
   * spinning.
   */
+
+struct optimistic_spin_node {
+   struct optimistic_spin_node *next, *prev;
+   int locked; /* 1 if lock acquired */
+   int cpu; /* encoded CPU # + 1 value */
+};
+
  static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
  
  /*


Please correct the patch title "osf_lock.c" => "osq_lock.c".

After the fix, you can add

Acked-by: Waiman Long 




Re: [PATCH v2 00/12] device-core: Enable device_lock() lockdep validation

2022-04-13 Thread Waiman Long

On 4/13/22 02:01, Dan Williams wrote:

Changes since v1 [1]:
- Improve the clarity of the cover letter and changelogs of the
   major patches (Patch2 and Patch12) (Pierre, Kevin, and Dave)
- Fix device_lock_interruptible() false negative deadlock detection
   (Kevin)
- Fix off-by-one error in the device_set_lock_class() enable case (Kevin)
- Spelling fixes in Patch2 changelog (Pierre)
- Compilation fixes when both CONFIG_CXL_BUS=n and
   CONFIG_LIBNVDIMM=n. (0day robot)

[1]: 
https://lore.kernel.org/all/164610292916.2682974.12924748003366352335.st...@dwillia2-desk3.amr.corp.intel.com/

---

The device_lock() is why the lockdep_set_novalidate_class() API exists.
The lock is taken in too many disparate contexts, and lockdep by design
assumes that all device_lock() acquisitions are identical. The lack of
lockdep coverage leads to deadlock scenarios landing upstream. To
mitigate that problem the lockdep_mutex was added [2].

The lockdep_mutex lets a subsystem mirror device_lock() acquisitions
without lockdep_set_novalidate_class() to gain some limited lockdep
coverage. The mirroring approach is limited to taking the device_lock()
after-the-fact in a subsystem's 'struct bus_type' operations and fails
to cover device_lock() acquisition in the driver-core. It also can only
track the needs of one subsystem at a time so, for example the kernel
needs to be recompiled between CONFIG_PROVE_NVDIMM_LOCKING and
CONFIG_PROVE_CXL_LOCKING depending on which subsystem is being
regression tested. Obviously that also means that intra-subsystem
locking dependencies can not be validated.


Instead of using a fake lockdep_mutex, maybe you can just use a unique 
lockdep key for each subsystem and call lockdep_set_class() in the 
device_initialize() if such key is present or 
lockdep_set_novalidate_class() otherwise. The unique key can be passed 
either as a parameter to device_initialize() or as part of the device 
structure. It is certainly less cumbersome that having a fake 
lockdep_mutex array in the device structure.


Cheers,
Longman




[PATCH-next v5 2/4] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp

2021-04-20 Thread Waiman Long
Before the new slab memory controller with per object byte charging,
charging and vmstat data update happen only when new slab pages are
allocated or freed. Now they are done with every kmem_cache_alloc()
and kmem_cache_free(). This causes additional overhead for workloads
that generate a lot of alloc and free calls.

The memcg_stock_pcp is used to cache byte charge for a specific
obj_cgroup to reduce that overhead. To further reducing it, this patch
makes the vmstat data cached in the memcg_stock_pcp structure as well
until it accumulates a page size worth of update or when other cached
data change. Caching the vmstat data in the per-cpu stock eliminates two
writes to non-hot cachelines for memcg specific as well as memcg-lruvecs
specific vmstat data by a write to a hot local stock cacheline.

On a 2-socket Cascade Lake server with instrumentation enabled and this
patch applied, it was found that about 20% (634400 out of 3243830)
of the time when mod_objcg_state() is called leads to an actual call
to __mod_objcg_state() after initial boot. When doing parallel kernel
build, the figure was about 17% (24329265 out of 142512465). So caching
the vmstat data reduces the number of calls to __mod_objcg_state()
by more than 80%.

Signed-off-by: Waiman Long 
Reviewed-by: Shakeel Butt 
---
 mm/memcontrol.c | 86 +++--
 1 file changed, 83 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7cd7187a017c..292b4783b1a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -782,8 +782,9 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item 
idx, int val)
rcu_read_unlock();
 }
 
-void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
-enum node_stat_item idx, int nr)
+static inline void mod_objcg_mlstate(struct obj_cgroup *objcg,
+struct pglist_data *pgdat,
+enum node_stat_item idx, int nr)
 {
struct mem_cgroup *memcg;
struct lruvec *lruvec;
@@ -791,7 +792,7 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct 
pglist_data *pgdat,
rcu_read_lock();
memcg = obj_cgroup_memcg(objcg);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
-   mod_memcg_lruvec_state(lruvec, idx, nr);
+   __mod_memcg_lruvec_state(lruvec, idx, nr);
rcu_read_unlock();
 }
 
@@ -2059,7 +2060,10 @@ struct memcg_stock_pcp {
 
 #ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *cached_objcg;
+   struct pglist_data *cached_pgdat;
unsigned int nr_bytes;
+   int nr_slab_reclaimable_b;
+   int nr_slab_unreclaimable_b;
 #endif
 
struct work_struct work;
@@ -3008,6 +3012,63 @@ void __memcg_kmem_uncharge_page(struct page *page, int 
order)
obj_cgroup_put(objcg);
 }
 
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+enum node_stat_item idx, int nr)
+{
+   struct memcg_stock_pcp *stock;
+   unsigned long flags;
+   int *bytes;
+
+   local_irq_save(flags);
+   stock = this_cpu_ptr(_stock);
+
+   /*
+* Save vmstat data in stock and skip vmstat array update unless
+* accumulating over a page of vmstat data or when pgdat or idx
+* changes.
+*/
+   if (stock->cached_objcg != objcg) {
+   drain_obj_stock(stock);
+   obj_cgroup_get(objcg);
+   stock->nr_bytes = atomic_read(>nr_charged_bytes)
+   ? atomic_xchg(>nr_charged_bytes, 0) : 0;
+   stock->cached_objcg = objcg;
+   stock->cached_pgdat = pgdat;
+   } else if (stock->cached_pgdat != pgdat) {
+   /* Flush the existing cached vmstat data */
+   if (stock->nr_slab_reclaimable_b) {
+   mod_objcg_mlstate(objcg, pgdat, NR_SLAB_RECLAIMABLE_B,
+ stock->nr_slab_reclaimable_b);
+   stock->nr_slab_reclaimable_b = 0;
+   }
+   if (stock->nr_slab_unreclaimable_b) {
+   mod_objcg_mlstate(objcg, pgdat, NR_SLAB_UNRECLAIMABLE_B,
+ stock->nr_slab_unreclaimable_b);
+   stock->nr_slab_unreclaimable_b = 0;
+   }
+   stock->cached_pgdat = pgdat;
+   }
+
+   bytes = (idx == NR_SLAB_RECLAIMABLE_B) ? >nr_slab_reclaimable_b
+  : 
>nr_slab_unreclaimable_b;
+   if (!*bytes) {
+   *bytes = nr;
+   nr = 0;
+   } else {
+   *bytes += nr;
+   if (abs(*bytes) > PAGE_SIZE) {
+   nr = *bytes;
+   *bytes = 0;
+   } else {
+   nr = 0;
+   }
+   }
+   if (nr)
+   mod_objcg_mlstate(objcg,

[PATCH-next v5 3/4] mm/memcg: Improve refill_obj_stock() performance

2021-04-20 Thread Waiman Long
There are two issues with the current refill_obj_stock() code. First of
all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
be used again which leads to another call to drain_obj_stock() and
obj_cgroup_get() as well as atomically retrieve the available byte from
obj_cgroup. That is costly. Instead, we should just uncharge the excess
pages, reduce the stock bytes and be done with it. The drain_obj_stock()
function should only be called when obj_cgroup changes.

Secondly, when charging an object of size not less than a page in
obj_cgroup_charge(), it is possible that the remaining bytes to be
refilled to the stock will overflow a page and cause refill_obj_stock()
to uncharge 1 page. To avoid the additional uncharge in this case,
a new overfill flag is added to refill_obj_stock() which will be set
when called from obj_cgroup_charge().

A multithreaded kmalloc+kfree microbenchmark on a 2-socket 48-core
96-thread x86-64 system with 96 testing threads were run.  Before this
patch, the total number of kilo kmalloc+kfree operations done for a 4k
large object by all the testing threads per second were 4,304 kops/s
(cgroup v1) and 8,478 kops/s (cgroup v2). After applying this patch, the
number were 4,731 (cgroup v1) and 418,142 (cgroup v2) respectively. This
represents a performance improvement of 1.10X (cgroup v1) and 49.3X
(cgroup v2).

Signed-off-by: Waiman Long 
---
 mm/memcontrol.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 292b4783b1a7..2f87d0b05092 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3153,10 +3153,12 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
return false;
 }
 
-static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
+bool overfill)
 {
struct memcg_stock_pcp *stock;
unsigned long flags;
+   unsigned int nr_pages = 0;
 
local_irq_save(flags);
 
@@ -3165,14 +3167,20 @@ static void refill_obj_stock(struct obj_cgroup *objcg, 
unsigned int nr_bytes)
drain_obj_stock(stock);
obj_cgroup_get(objcg);
stock->cached_objcg = objcg;
-   stock->nr_bytes = atomic_xchg(>nr_charged_bytes, 0);
+   stock->nr_bytes = atomic_read(>nr_charged_bytes)
+   ? atomic_xchg(>nr_charged_bytes, 0) : 0;
}
stock->nr_bytes += nr_bytes;
 
-   if (stock->nr_bytes > PAGE_SIZE)
-   drain_obj_stock(stock);
+   if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
+   nr_pages = stock->nr_bytes >> PAGE_SHIFT;
+   stock->nr_bytes &= (PAGE_SIZE - 1);
+   }
 
local_irq_restore(flags);
+
+   if (nr_pages)
+   obj_cgroup_uncharge_pages(objcg, nr_pages);
 }
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
@@ -3201,14 +3209,14 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t 
gfp, size_t size)
 
ret = obj_cgroup_charge_pages(objcg, gfp, nr_pages);
if (!ret && nr_bytes)
-   refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
+   refill_obj_stock(objcg, PAGE_SIZE - nr_bytes, true);
 
return ret;
 }
 
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 {
-   refill_obj_stock(objcg, size);
+   refill_obj_stock(objcg, size, false);
 }
 
 #endif /* CONFIG_MEMCG_KMEM */
-- 
2.18.1



[PATCH-next v5 0/4] mm/memcg: Reduce kmemcache memory accounting overhead

2021-04-20 Thread Waiman Long
ontended
cacheline in the shared page counter >kmem. So the improvement
is only modest.

Patch 4 helps in cgroup v2, but performs worse in cgroup v1 as
eliminating the irq_disable/irq_enable overhead seems to aggravate the
cacheline contention.

[1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
[2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/


Waiman Long (4):
  mm/memcg: Move mod_objcg_state() to memcontrol.c
  mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  mm/memcg: Improve refill_obj_stock() performance
  mm/memcg: Optimize user context object stock access

 mm/memcontrol.c | 195 +---
 mm/slab.h   |  16 +---
 2 files changed, 171 insertions(+), 40 deletions(-)

-- 
2.18.1



[PATCH-next v5 1/4] mm/memcg: Move mod_objcg_state() to memcontrol.c

2021-04-20 Thread Waiman Long
The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c
so that further optimization can be done to it in later patches without
exposing unnecessary details to other mm components.

Signed-off-by: Waiman Long 
Acked-by: Johannes Weiner 
---
 mm/memcontrol.c | 13 +
 mm/slab.h   | 16 ++--
 2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 64ada9e650a5..7cd7187a017c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -782,6 +782,19 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item 
idx, int val)
rcu_read_unlock();
 }
 
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+enum node_stat_item idx, int nr)
+{
+   struct mem_cgroup *memcg;
+   struct lruvec *lruvec;
+
+   rcu_read_lock();
+   memcg = obj_cgroup_memcg(objcg);
+   lruvec = mem_cgroup_lruvec(memcg, pgdat);
+   mod_memcg_lruvec_state(lruvec, idx, nr);
+   rcu_read_unlock();
+}
+
 /**
  * __count_memcg_events - account VM events in a cgroup
  * @memcg: the memory cgroup
diff --git a/mm/slab.h b/mm/slab.h
index a7268072f017..f1f26f22a94a 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -242,6 +242,8 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache 
*s, slab_flags_t fla
 #ifdef CONFIG_MEMCG_KMEM
 int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
 gfp_t gfp, bool new_page);
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+enum node_stat_item idx, int nr);
 
 static inline void memcg_free_page_obj_cgroups(struct page *page)
 {
@@ -286,20 +288,6 @@ static inline bool memcg_slab_pre_alloc_hook(struct 
kmem_cache *s,
return true;
 }
 
-static inline void mod_objcg_state(struct obj_cgroup *objcg,
-  struct pglist_data *pgdat,
-  enum node_stat_item idx, int nr)
-{
-   struct mem_cgroup *memcg;
-   struct lruvec *lruvec;
-
-   rcu_read_lock();
-   memcg = obj_cgroup_memcg(objcg);
-   lruvec = mem_cgroup_lruvec(memcg, pgdat);
-   mod_memcg_lruvec_state(lruvec, idx, nr);
-   rcu_read_unlock();
-}
-
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
  struct obj_cgroup *objcg,
  gfp_t flags, size_t size,
-- 
2.18.1



[PATCH-next v5 4/4] mm/memcg: Optimize user context object stock access

2021-04-20 Thread Waiman Long
Most kmem_cache_alloc() calls are from user context. With instrumentation
enabled, the measured amount of kmem_cache_alloc() calls from non-task
context was about 0.01% of the total.

The irq disable/enable sequence used in this case to access content
from object stock is slow.  To optimize for user context access, there
are now two sets of object stocks (in the new obj_stock structure)
for task context and interrupt context access respectively.

The task context object stock can be accessed after disabling preemption
which is cheap in non-preempt kernel. The interrupt context object stock
can only be accessed after disabling interrupt. User context code can
access interrupt object stock, but not vice versa.

The downside of this change is that there are more data stored in local
object stocks and not reflected in the charge counter and the vmstat
arrays.  However, this is a small price to pay for better performance.

Signed-off-by: Waiman Long 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
---
 mm/memcontrol.c | 94 +++--
 1 file changed, 68 insertions(+), 26 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2f87d0b05092..c79b9837cb85 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -782,6 +782,10 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item 
idx, int val)
rcu_read_unlock();
 }
 
+/*
+ * mod_objcg_mlstate() may be called with irq enabled, so
+ * mod_memcg_lruvec_state() should be used.
+ */
 static inline void mod_objcg_mlstate(struct obj_cgroup *objcg,
 struct pglist_data *pgdat,
 enum node_stat_item idx, int nr)
@@ -792,7 +796,7 @@ static inline void mod_objcg_mlstate(struct obj_cgroup 
*objcg,
rcu_read_lock();
memcg = obj_cgroup_memcg(objcg);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
-   __mod_memcg_lruvec_state(lruvec, idx, nr);
+   mod_memcg_lruvec_state(lruvec, idx, nr);
rcu_read_unlock();
 }
 
@@ -2054,17 +2058,23 @@ void unlock_page_memcg(struct page *page)
 }
 EXPORT_SYMBOL(unlock_page_memcg);
 
-struct memcg_stock_pcp {
-   struct mem_cgroup *cached; /* this never be root cgroup */
-   unsigned int nr_pages;
-
+struct obj_stock {
 #ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *cached_objcg;
struct pglist_data *cached_pgdat;
unsigned int nr_bytes;
int nr_slab_reclaimable_b;
int nr_slab_unreclaimable_b;
+#else
+   int dummy[0];
 #endif
+};
+
+struct memcg_stock_pcp {
+   struct mem_cgroup *cached; /* this never be root cgroup */
+   unsigned int nr_pages;
+   struct obj_stock task_obj;
+   struct obj_stock irq_obj;
 
struct work_struct work;
unsigned long flags;
@@ -2074,12 +2084,12 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, 
memcg_stock);
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct memcg_stock_pcp *stock);
+static void drain_obj_stock(struct obj_stock *stock);
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 struct mem_cgroup *root_memcg);
 
 #else
-static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
+static inline void drain_obj_stock(struct obj_stock *stock)
 {
 }
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -2089,6 +2099,40 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
 }
 #endif
 
+/*
+ * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
+ * sequence used in this case to access content from object stock is slow.
+ * To optimize for user context access, there are now two object stocks for
+ * task context and interrupt context access respectively.
+ *
+ * The task context object stock can be accessed by disabling preemption only
+ * which is cheap in non-preempt kernel. The interrupt context object stock
+ * can only be accessed after disabling interrupt. User context code can
+ * access interrupt object stock, but not vice versa.
+ */
+static inline struct obj_stock *get_obj_stock(unsigned long *pflags)
+{
+   struct memcg_stock_pcp *stock;
+
+   if (likely(in_task())) {
+   preempt_disable();
+   stock = this_cpu_ptr(_stock);
+   return >task_obj;
+   } else {
+   local_irq_save(*pflags);
+   stock = this_cpu_ptr(_stock);
+   return >irq_obj;
+   }
+}
+
+static inline void put_obj_stock(unsigned long flags)
+{
+   if (likely(in_task()))
+   preempt_enable();
+   else
+   local_irq_restore(flags);
+}
+
 /**
  * consume_stock: Try to consume stocked charge on this cpu.
  * @memcg: memcg to consume from.
@@ -2155,7 +2199,9 @@ static void drain_local_stock(struct work_struct *dummy)
local_irq_save(flags);
 
stock = this_cpu_ptr(_stock);
-   drain_obj

Re: [PATCH v4 4/5] mm/memcg: Save both reclaimable & unreclaimable bytes in object stock

2021-04-20 Thread Waiman Long

On 4/19/21 12:55 PM, Johannes Weiner wrote:

On Sun, Apr 18, 2021 at 08:00:31PM -0400, Waiman Long wrote:

Currently, the object stock structure caches either reclaimable vmstat
bytes or unreclaimable vmstat bytes in its object stock structure. The
hit rate can be improved if both types of vmstat data can be cached
especially for single-node system.

This patch supports the cacheing of both type of vmstat data, though
at the expense of a slightly increased complexity in the caching code.
For large object (>= PAGE_SIZE), vmstat array is done directly without
going through the stock caching step.

On a 2-socket Cascade Lake server with instrumentation enabled, the
miss rates are shown in the table below.

   Initial bootup:

   Kernel   __mod_objcg_statemod_objcg_state%age
   --   ----
   Before patch  634400  324383019.6%
   After patch   419810  318242413.2%

   Parallel kernel build:

   Kernel   __mod_objcg_statemod_objcg_state%age
   --   ----
   Before patch  24329265   142512465   17.1%
   After patch   24051721   142445825   16.9%

There was a decrease of miss rate after initial system bootup. However,
the miss rate for parallel kernel build remained about the same probably
because most of the touched kmemcache objects were reclaimable inodes
and dentries.

Signed-off-by: Waiman Long 
---
  mm/memcontrol.c | 79 +++--
  1 file changed, 51 insertions(+), 28 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c13502eab282..a6dd18f6d8a8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2212,8 +2212,8 @@ struct obj_stock {
struct obj_cgroup *cached_objcg;
struct pglist_data *cached_pgdat;
unsigned int nr_bytes;
-   int vmstat_idx;
-   int vmstat_bytes;
+   int reclaimable_bytes;  /* NR_SLAB_RECLAIMABLE_B */
+   int unreclaimable_bytes;/* NR_SLAB_UNRECLAIMABLE_B */

How about

int nr_slab_reclaimable_b;
int nr_slab_unreclaimable_b;

so you don't need the comments?


Sure, will make the change.



  #else
int dummy[0];
  #endif
@@ -3217,40 +3217,56 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct 
pglist_data *pgdat,
 enum node_stat_item idx, int nr)
  {
unsigned long flags;
-   struct obj_stock *stock = get_obj_stock();
+   struct obj_stock *stock;
+   int *bytes, *alt_bytes, alt_idx;
+
+   /*
+* Directly update vmstat array for big object.
+*/
+   if (unlikely(abs(nr) >= PAGE_SIZE))
+   goto update_vmstat;

This looks like an optimization independent of the vmstat item split?

It may not be that helpful. I am going to take it out in the next version.



+   stock = get_obj_stock();
+   if (idx == NR_SLAB_RECLAIMABLE_B) {
+   bytes = >reclaimable_bytes;
+   alt_bytes = >unreclaimable_bytes;
+   alt_idx = NR_SLAB_UNRECLAIMABLE_B;
+   } else {
+   bytes = >unreclaimable_bytes;
+   alt_bytes = >reclaimable_bytes;
+   alt_idx = NR_SLAB_RECLAIMABLE_B;
+   }
  
  	/*

-* Save vmstat data in stock and skip vmstat array update unless
-* accumulating over a page of vmstat data or when pgdat or idx
+* Try to save vmstat data in stock and skip vmstat array update
+* unless accumulating over a page of vmstat data or when pgdat
 * changes.
 */
if (stock->cached_objcg != objcg) {
/* Output the current data as is */
-   } else if (!stock->vmstat_bytes) {
-   /* Save the current data */
-   stock->vmstat_bytes = nr;
-   stock->vmstat_idx = idx;
-   stock->cached_pgdat = pgdat;
-   nr = 0;
-   } else if ((stock->cached_pgdat != pgdat) ||
-  (stock->vmstat_idx != idx)) {
-   /* Output the cached data & save the current data */
-   swap(nr, stock->vmstat_bytes);
-   swap(idx, stock->vmstat_idx);
+   } else if (stock->cached_pgdat != pgdat) {
+   /* Save the current data and output cached data, if any */
+   swap(nr, *bytes);
swap(pgdat, stock->cached_pgdat);
+   if (*alt_bytes) {
+   __mod_objcg_state(objcg, pgdat, alt_idx,
+ *alt_bytes);
+   *alt_bytes = 0;
+   }

As per the other email, I really don't think optimizing the pgdat
switch (in a percpu cache) is worth this level of complexity.


I am going to simplify it in the next version.





} else {
-   stock->vmstat_bytes += nr

Re: [PATCH v4 2/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp

2021-04-19 Thread Waiman Long

On 4/19/21 12:38 PM, Johannes Weiner wrote:

On Sun, Apr 18, 2021 at 08:00:29PM -0400, Waiman Long wrote:

Before the new slab memory controller with per object byte charging,
charging and vmstat data update happen only when new slab pages are
allocated or freed. Now they are done with every kmem_cache_alloc()
and kmem_cache_free(). This causes additional overhead for workloads
that generate a lot of alloc and free calls.

The memcg_stock_pcp is used to cache byte charge for a specific
obj_cgroup to reduce that overhead. To further reducing it, this patch
makes the vmstat data cached in the memcg_stock_pcp structure as well
until it accumulates a page size worth of update or when other cached
data change. Caching the vmstat data in the per-cpu stock eliminates two
writes to non-hot cachelines for memcg specific as well as memcg-lruvecs
specific vmstat data by a write to a hot local stock cacheline.

On a 2-socket Cascade Lake server with instrumentation enabled and this
patch applied, it was found that about 20% (634400 out of 3243830)
of the time when mod_objcg_state() is called leads to an actual call
to __mod_objcg_state() after initial boot. When doing parallel kernel
build, the figure was about 17% (24329265 out of 142512465). So caching
the vmstat data reduces the number of calls to __mod_objcg_state()
by more than 80%.

Signed-off-by: Waiman Long 
Reviewed-by: Shakeel Butt 
---
  mm/memcontrol.c | 64 ++---
  1 file changed, 61 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dc9032f28f2e..693453f95d99 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2213,7 +2213,10 @@ struct memcg_stock_pcp {
  
  #ifdef CONFIG_MEMCG_KMEM

struct obj_cgroup *cached_objcg;
+   struct pglist_data *cached_pgdat;
unsigned int nr_bytes;
+   int vmstat_idx;
+   int vmstat_bytes;
  #endif
  
  	struct work_struct work;

@@ -3150,8 +3153,9 @@ void __memcg_kmem_uncharge_page(struct page *page, int 
order)
css_put(>css);
  }
  
-void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,

-enum node_stat_item idx, int nr)
+static inline void __mod_objcg_state(struct obj_cgroup *objcg,
+struct pglist_data *pgdat,
+enum node_stat_item idx, int nr)

This naming is dangerous, as the __mod_foo naming scheme we use
everywhere else suggests it's the same function as mod_foo() just with
preemption/irqs disabled.

I will change its name to, say, mod_objcg_mlstate() to indicate that it 
is something different. Actually, it is hard to come up with a good name 
which is not too long.




@@ -3159,10 +3163,53 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct 
pglist_data *pgdat,
rcu_read_lock();
memcg = obj_cgroup_memcg(objcg);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
-   mod_memcg_lruvec_state(lruvec, idx, nr);
+   __mod_memcg_lruvec_state(lruvec, idx, nr);
rcu_read_unlock();
  }
  
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,

+enum node_stat_item idx, int nr)
+{
+   struct memcg_stock_pcp *stock;
+   unsigned long flags;
+
+   local_irq_save(flags);
+   stock = this_cpu_ptr(_stock);
+
+   /*
+* Save vmstat data in stock and skip vmstat array update unless
+* accumulating over a page of vmstat data or when pgdat or idx
+* changes.
+*/
+   if (stock->cached_objcg != objcg) {
+   /* Output the current data as is */

When you get here with the wrong objcg and hit the cold path, it's
usually immediately followed by an uncharge -> refill_obj_stock() that
will then flush and reset cached_objcg.

Instead of doing two cold paths, why not flush the old objcg right
away and set the new so that refill_obj_stock() can use the fast path?


That is a good idea. Will do that.





+   } else if (!stock->vmstat_bytes) {
+   /* Save the current data */
+   stock->vmstat_bytes = nr;
+   stock->vmstat_idx = idx;
+   stock->cached_pgdat = pgdat;
+   nr = 0;
+   } else if ((stock->cached_pgdat != pgdat) ||
+  (stock->vmstat_idx != idx)) {
+   /* Output the cached data & save the current data */
+   swap(nr, stock->vmstat_bytes);
+   swap(idx, stock->vmstat_idx);
+   swap(pgdat, stock->cached_pgdat);

Is this optimization worth doing?

You later split vmstat_bytes and idx doesn't change anymore.


I am going to merge patch 2 and patch 4 to avoid the confusion.




How often does the pgdat change? This is a per-cpu cache after all,
and the numa node a given cpu allocates from tends to not change that
often. Even with interleaving mode, which I think is pretty rare, the
interleaving happens at t

Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c

2021-04-19 Thread Waiman Long

On 4/19/21 5:11 PM, Johannes Weiner wrote:



BTW, have you ever thought of moving the cgroup-v1 specific functions out
into a separate memcontrol-v1.c file just like kernel/cgroup/cgroup-v1.c?

I thought of that before, but memcontrol.c is a frequently changed file and
so a bit hard to do.

I haven't looked too deeply at it so far, but I think it would make
sense to try.

There are indeed many of the entry paths from the MM code that are
shared between cgroup1 and cgroup2, with smaller branches here and
there to adjust behavior. Those would throw conflicts, but those we
should probably keep in the main memcontrol.c for readability anyway.

But there is also plenty of code that is exclusively about cgroup1,
and which actually doesn't change much in a long time. Moving that
elsewhere shouldn't create difficult conflicts - maybe a few line
offset warnings or fuzz-- Rafael


  in the diff context of unrelated changes:

- the soft limit tree and soft limit reclaim

- the threshold and oom event notification stuff

- the charge moving code

- remaining v1 interface files, as well as their helper functions

 From a quick scan, this adds up to ~2,500 lines of old code with no
actual dependencies from the common code or from v2, and which could
be moved out of the way without disrupting ongoing development much.


Right.

Currently memcontrol.c has over 7000 lines of code and keep growing. 
That makes it harder to read, navigate and update. If we can cut out 
2000 lines or more from memcontrol.c, it will make it more manageable.


Cheers,
Longman



Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c

2021-04-19 Thread Waiman Long

On 4/19/21 1:19 PM, Waiman Long wrote:

On 4/19/21 1:13 PM, Johannes Weiner wrote:

On Mon, Apr 19, 2021 at 12:18:29PM -0400, Waiman Long wrote:

On 4/19/21 11:21 AM, Waiman Long wrote:

On 4/19/21 11:14 AM, Johannes Weiner wrote:

On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:

The mod_objcg_state() function is moved from mm/slab.h to
mm/memcontrol.c
so that further optimization can be done to it in later patches 
without

exposing unnecessary details to other mm components.

Signed-off-by: Waiman Long 
---
   mm/memcontrol.c | 13 +
   mm/slab.h   | 16 ++--
   2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d850a..dc9032f28f2e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct
page *page, int order)
   css_put(>css);
   }
   +void mod_objcg_state(struct obj_cgroup *objcg, struct
pglist_data *pgdat,
+ enum node_stat_item idx, int nr)
+{
+    struct mem_cgroup *memcg;
+    struct lruvec *lruvec = NULL;
+
+    rcu_read_lock();
+    memcg = obj_cgroup_memcg(objcg);
+    lruvec = mem_cgroup_lruvec(memcg, pgdat);
+    mod_memcg_lruvec_state(lruvec, idx, nr);
+    rcu_read_unlock();
+}

It would be more naturally placed next to the others, e.g. below
__mod_lruvec_kmem_state().

But no deal breaker if there isn't another revision.

Acked-by: Johannes Weiner 


Yes, there will be another revision by rebasing patch series on the
linux-next. I will move the function then.

OK, it turns out that mod_objcg_state() is only defined if
CONFIG_MEMCG_KMEM. That was why I put it in the CONFIG_MEMCG_KMEM 
block with

the other obj_stock functions. I think I will keep it there.

The CONFIG_MEMCG_KMEM has become sort of useless now. It used to be
configurable, but now it just means CONFIG_MEMCG && !CONFIG_SLOB. And
even that doesn't make sense because while slob doesn't support slab
object tracking, we can still do all the other stuff we do under
KMEM. I have a patch in the works to remove the symbol and ifdefs.

With that in mind, it's better to group the functions based on what
they do rather than based on CONFIG_MEMCG_KMEM. It's easier to just
remove another ifdef later than it is to reorder the functions.

OK, I will make the move then. Thanks for the explanation. 


BTW, have you ever thought of moving the cgroup-v1 specific functions 
out into a separate memcontrol-v1.c file just like 
kernel/cgroup/cgroup-v1.c?


I thought of that before, but memcontrol.c is a frequently changed file 
and so a bit hard to do.


Cheers,
Longman



Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c

2021-04-19 Thread Waiman Long

On 4/19/21 1:13 PM, Johannes Weiner wrote:

On Mon, Apr 19, 2021 at 12:18:29PM -0400, Waiman Long wrote:

On 4/19/21 11:21 AM, Waiman Long wrote:

On 4/19/21 11:14 AM, Johannes Weiner wrote:

On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:

The mod_objcg_state() function is moved from mm/slab.h to
mm/memcontrol.c
so that further optimization can be done to it in later patches without
exposing unnecessary details to other mm components.

Signed-off-by: Waiman Long 
---
   mm/memcontrol.c | 13 +
   mm/slab.h   | 16 ++--
   2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d850a..dc9032f28f2e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct
page *page, int order)
   css_put(>css);
   }
   +void mod_objcg_state(struct obj_cgroup *objcg, struct
pglist_data *pgdat,
+ enum node_stat_item idx, int nr)
+{
+    struct mem_cgroup *memcg;
+    struct lruvec *lruvec = NULL;
+
+    rcu_read_lock();
+    memcg = obj_cgroup_memcg(objcg);
+    lruvec = mem_cgroup_lruvec(memcg, pgdat);
+    mod_memcg_lruvec_state(lruvec, idx, nr);
+    rcu_read_unlock();
+}

It would be more naturally placed next to the others, e.g. below
__mod_lruvec_kmem_state().

But no deal breaker if there isn't another revision.

Acked-by: Johannes Weiner 


Yes, there will be another revision by rebasing patch series on the
linux-next. I will move the function then.

OK, it turns out that mod_objcg_state() is only defined if
CONFIG_MEMCG_KMEM. That was why I put it in the CONFIG_MEMCG_KMEM block with
the other obj_stock functions. I think I will keep it there.

The CONFIG_MEMCG_KMEM has become sort of useless now. It used to be
configurable, but now it just means CONFIG_MEMCG && !CONFIG_SLOB. And
even that doesn't make sense because while slob doesn't support slab
object tracking, we can still do all the other stuff we do under
KMEM. I have a patch in the works to remove the symbol and ifdefs.

With that in mind, it's better to group the functions based on what
they do rather than based on CONFIG_MEMCG_KMEM. It's easier to just
remove another ifdef later than it is to reorder the functions.


OK, I will make the move then. Thanks for the explanation.

Cheers,
Longman



Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c

2021-04-19 Thread Waiman Long

On 4/19/21 11:21 AM, Waiman Long wrote:

On 4/19/21 11:14 AM, Johannes Weiner wrote:

On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
The mod_objcg_state() function is moved from mm/slab.h to 
mm/memcontrol.c

so that further optimization can be done to it in later patches without
exposing unnecessary details to other mm components.

Signed-off-by: Waiman Long 
---
  mm/memcontrol.c | 13 +
  mm/slab.h   | 16 ++--
  2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d850a..dc9032f28f2e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page 
*page, int order)

  css_put(>css);
  }
  +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data 
*pgdat,

+ enum node_stat_item idx, int nr)
+{
+    struct mem_cgroup *memcg;
+    struct lruvec *lruvec = NULL;
+
+    rcu_read_lock();
+    memcg = obj_cgroup_memcg(objcg);
+    lruvec = mem_cgroup_lruvec(memcg, pgdat);
+    mod_memcg_lruvec_state(lruvec, idx, nr);
+    rcu_read_unlock();
+}

It would be more naturally placed next to the others, e.g. below
__mod_lruvec_kmem_state().

But no deal breaker if there isn't another revision.

Acked-by: Johannes Weiner 

Yes, there will be another revision by rebasing patch series on the 
linux-next. I will move the function then. 


OK, it turns out that mod_objcg_state() is only defined if 
CONFIG_MEMCG_KMEM. That was why I put it in the CONFIG_MEMCG_KMEM block 
with the other obj_stock functions. I think I will keep it there.


Thanks,
Longman



Re: [External] [PATCH v4 5/5] mm/memcg: Improve refill_obj_stock() performance

2021-04-19 Thread Waiman Long

On 4/19/21 2:06 AM, Muchun Song wrote:

On Mon, Apr 19, 2021 at 8:01 AM Waiman Long  wrote:

There are two issues with the current refill_obj_stock() code. First of
all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
be used again which leads to another call to drain_obj_stock() and
obj_cgroup_get() as well as atomically retrieve the available byte from
obj_cgroup. That is costly. Instead, we should just uncharge the excess
pages, reduce the stock bytes and be done with it. The drain_obj_stock()
function should only be called when obj_cgroup changes.

Secondly, when charging an object of size not less than a page in
obj_cgroup_charge(), it is possible that the remaining bytes to be
refilled to the stock will overflow a page and cause refill_obj_stock()
to uncharge 1 page. To avoid the additional uncharge in this case,
a new overfill flag is added to refill_obj_stock() which will be set
when called from obj_cgroup_charge().

Signed-off-by: Waiman Long 
---
  mm/memcontrol.c | 23 +--
  1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a6dd18f6d8a8..d13961352eef 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
 return false;
  }

-static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
+bool overfill)
  {
 unsigned long flags;
 struct obj_stock *stock = get_obj_stock();
+   unsigned int nr_pages = 0;

 if (stock->cached_objcg != objcg) { /* reset if necessary */
-   drain_obj_stock(stock);
+   if (stock->cached_objcg)
+   drain_obj_stock(stock);
 obj_cgroup_get(objcg);
 stock->cached_objcg = objcg;
 stock->nr_bytes = atomic_xchg(>nr_charged_bytes, 0);
 }
 stock->nr_bytes += nr_bytes;

-   if (stock->nr_bytes > PAGE_SIZE)
-   drain_obj_stock(stock);
+   if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
+   nr_pages = stock->nr_bytes >> PAGE_SHIFT;
+   stock->nr_bytes &= (PAGE_SIZE - 1);
+   }

 put_obj_stock(flags);
+
+   if (nr_pages) {
+   rcu_read_lock();
+   __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
+   rcu_read_unlock();
+   }

It is not safe to call __memcg_kmem_uncharge() under rcu lock
and without holding a reference to memcg. More details can refer
to the following link.

https://lore.kernel.org/linux-mm/20210319163821.20704-2-songmuc...@bytedance.com/

In the above patchset, we introduce obj_cgroup_uncharge_pages to
uncharge some pages from object cgroup. You can use this safe
API.


Thanks for the comment. Will update my patch with call to 
obj_cgroup_uncharge_pages().


Cheers,
Longman



Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c

2021-04-19 Thread Waiman Long

On 4/19/21 11:14 AM, Johannes Weiner wrote:

On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:

The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c
so that further optimization can be done to it in later patches without
exposing unnecessary details to other mm components.

Signed-off-by: Waiman Long 
---
  mm/memcontrol.c | 13 +
  mm/slab.h   | 16 ++--
  2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d850a..dc9032f28f2e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page *page, int 
order)
css_put(>css);
  }
  
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,

+enum node_stat_item idx, int nr)
+{
+   struct mem_cgroup *memcg;
+   struct lruvec *lruvec = NULL;
+
+   rcu_read_lock();
+   memcg = obj_cgroup_memcg(objcg);
+   lruvec = mem_cgroup_lruvec(memcg, pgdat);
+   mod_memcg_lruvec_state(lruvec, idx, nr);
+   rcu_read_unlock();
+}

It would be more naturally placed next to the others, e.g. below
__mod_lruvec_kmem_state().

But no deal breaker if there isn't another revision.

Acked-by: Johannes Weiner 

Yes, there will be another revision by rebasing patch series on the 
linux-next. I will move the function then.


Cheers,
Longman



Re: [External] [PATCH v4 5/5] mm/memcg: Improve refill_obj_stock() performance

2021-04-19 Thread Waiman Long

On 4/19/21 11:00 AM, Shakeel Butt wrote:

On Sun, Apr 18, 2021 at 11:07 PM Muchun Song  wrote:

On Mon, Apr 19, 2021 at 8:01 AM Waiman Long  wrote:

There are two issues with the current refill_obj_stock() code. First of
all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
be used again which leads to another call to drain_obj_stock() and
obj_cgroup_get() as well as atomically retrieve the available byte from
obj_cgroup. That is costly. Instead, we should just uncharge the excess
pages, reduce the stock bytes and be done with it. The drain_obj_stock()
function should only be called when obj_cgroup changes.

Secondly, when charging an object of size not less than a page in
obj_cgroup_charge(), it is possible that the remaining bytes to be
refilled to the stock will overflow a page and cause refill_obj_stock()
to uncharge 1 page. To avoid the additional uncharge in this case,
a new overfill flag is added to refill_obj_stock() which will be set
when called from obj_cgroup_charge().

Signed-off-by: Waiman Long 
---
  mm/memcontrol.c | 23 +--
  1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a6dd18f6d8a8..d13961352eef 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
 return false;
  }

-static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
+bool overfill)
  {
 unsigned long flags;
 struct obj_stock *stock = get_obj_stock();
+   unsigned int nr_pages = 0;

 if (stock->cached_objcg != objcg) { /* reset if necessary */
-   drain_obj_stock(stock);
+   if (stock->cached_objcg)
+   drain_obj_stock(stock);
 obj_cgroup_get(objcg);
 stock->cached_objcg = objcg;
 stock->nr_bytes = atomic_xchg(>nr_charged_bytes, 0);
 }
 stock->nr_bytes += nr_bytes;

-   if (stock->nr_bytes > PAGE_SIZE)
-   drain_obj_stock(stock);
+   if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
+   nr_pages = stock->nr_bytes >> PAGE_SHIFT;
+   stock->nr_bytes &= (PAGE_SIZE - 1);
+   }

 put_obj_stock(flags);
+
+   if (nr_pages) {
+   rcu_read_lock();
+   __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
+   rcu_read_unlock();
+   }

It is not safe to call __memcg_kmem_uncharge() under rcu lock
and without holding a reference to memcg. More details can refer
to the following link.

https://lore.kernel.org/linux-mm/20210319163821.20704-2-songmuc...@bytedance.com/

In the above patchset, we introduce obj_cgroup_uncharge_pages to
uncharge some pages from object cgroup. You can use this safe
API.


I would recommend just rebase the patch series over the latest mm tree.


I see, I will rebase it to the latest mm tree.

Thanks,
Longman



[PATCH v4 4/5] mm/memcg: Save both reclaimable & unreclaimable bytes in object stock

2021-04-18 Thread Waiman Long
Currently, the object stock structure caches either reclaimable vmstat
bytes or unreclaimable vmstat bytes in its object stock structure. The
hit rate can be improved if both types of vmstat data can be cached
especially for single-node system.

This patch supports the cacheing of both type of vmstat data, though
at the expense of a slightly increased complexity in the caching code.
For large object (>= PAGE_SIZE), vmstat array is done directly without
going through the stock caching step.

On a 2-socket Cascade Lake server with instrumentation enabled, the
miss rates are shown in the table below.

  Initial bootup:

  Kernel   __mod_objcg_statemod_objcg_state%age
  --   ----
  Before patch  634400  324383019.6%
  After patch   419810  318242413.2%

  Parallel kernel build:

  Kernel   __mod_objcg_statemod_objcg_state%age
  --   ----
  Before patch  24329265   142512465   17.1%
  After patch   24051721   142445825   16.9%

There was a decrease of miss rate after initial system bootup. However,
the miss rate for parallel kernel build remained about the same probably
because most of the touched kmemcache objects were reclaimable inodes
and dentries.

Signed-off-by: Waiman Long 
---
 mm/memcontrol.c | 79 +++--
 1 file changed, 51 insertions(+), 28 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c13502eab282..a6dd18f6d8a8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2212,8 +2212,8 @@ struct obj_stock {
struct obj_cgroup *cached_objcg;
struct pglist_data *cached_pgdat;
unsigned int nr_bytes;
-   int vmstat_idx;
-   int vmstat_bytes;
+   int reclaimable_bytes;  /* NR_SLAB_RECLAIMABLE_B */
+   int unreclaimable_bytes;/* NR_SLAB_UNRECLAIMABLE_B */
 #else
int dummy[0];
 #endif
@@ -3217,40 +3217,56 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct 
pglist_data *pgdat,
 enum node_stat_item idx, int nr)
 {
unsigned long flags;
-   struct obj_stock *stock = get_obj_stock();
+   struct obj_stock *stock;
+   int *bytes, *alt_bytes, alt_idx;
+
+   /*
+* Directly update vmstat array for big object.
+*/
+   if (unlikely(abs(nr) >= PAGE_SIZE))
+   goto update_vmstat;
+
+   stock = get_obj_stock();
+   if (idx == NR_SLAB_RECLAIMABLE_B) {
+   bytes = >reclaimable_bytes;
+   alt_bytes = >unreclaimable_bytes;
+   alt_idx = NR_SLAB_UNRECLAIMABLE_B;
+   } else {
+   bytes = >unreclaimable_bytes;
+   alt_bytes = >reclaimable_bytes;
+   alt_idx = NR_SLAB_RECLAIMABLE_B;
+   }
 
/*
-* Save vmstat data in stock and skip vmstat array update unless
-* accumulating over a page of vmstat data or when pgdat or idx
+* Try to save vmstat data in stock and skip vmstat array update
+* unless accumulating over a page of vmstat data or when pgdat
 * changes.
 */
if (stock->cached_objcg != objcg) {
/* Output the current data as is */
-   } else if (!stock->vmstat_bytes) {
-   /* Save the current data */
-   stock->vmstat_bytes = nr;
-   stock->vmstat_idx = idx;
-   stock->cached_pgdat = pgdat;
-   nr = 0;
-   } else if ((stock->cached_pgdat != pgdat) ||
-  (stock->vmstat_idx != idx)) {
-   /* Output the cached data & save the current data */
-   swap(nr, stock->vmstat_bytes);
-   swap(idx, stock->vmstat_idx);
+   } else if (stock->cached_pgdat != pgdat) {
+   /* Save the current data and output cached data, if any */
+   swap(nr, *bytes);
swap(pgdat, stock->cached_pgdat);
+   if (*alt_bytes) {
+   __mod_objcg_state(objcg, pgdat, alt_idx,
+ *alt_bytes);
+   *alt_bytes = 0;
+   }
} else {
-   stock->vmstat_bytes += nr;
-   if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
-   nr = stock->vmstat_bytes;
-   stock->vmstat_bytes = 0;
+   *bytes += nr;
+   if (abs(*bytes) > PAGE_SIZE) {
+   nr = *bytes;
+   *bytes = 0;
} else {
nr = 0;
}
}
-   if (nr)
-   __mod_objcg_state(objcg, pgdat, idx, nr);
-
put_obj_stock(flags);
+   if (!nr)
+   return;
+update_vmstat:
+   __mod_objcg_state(objcg, pgdat, idx, nr);
 

[PATCH v4 5/5] mm/memcg: Improve refill_obj_stock() performance

2021-04-18 Thread Waiman Long
There are two issues with the current refill_obj_stock() code. First of
all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
be used again which leads to another call to drain_obj_stock() and
obj_cgroup_get() as well as atomically retrieve the available byte from
obj_cgroup. That is costly. Instead, we should just uncharge the excess
pages, reduce the stock bytes and be done with it. The drain_obj_stock()
function should only be called when obj_cgroup changes.

Secondly, when charging an object of size not less than a page in
obj_cgroup_charge(), it is possible that the remaining bytes to be
refilled to the stock will overflow a page and cause refill_obj_stock()
to uncharge 1 page. To avoid the additional uncharge in this case,
a new overfill flag is added to refill_obj_stock() which will be set
when called from obj_cgroup_charge().

Signed-off-by: Waiman Long 
---
 mm/memcontrol.c | 23 +--
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a6dd18f6d8a8..d13961352eef 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
return false;
 }
 
-static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
+bool overfill)
 {
unsigned long flags;
struct obj_stock *stock = get_obj_stock();
+   unsigned int nr_pages = 0;
 
if (stock->cached_objcg != objcg) { /* reset if necessary */
-   drain_obj_stock(stock);
+   if (stock->cached_objcg)
+   drain_obj_stock(stock);
obj_cgroup_get(objcg);
stock->cached_objcg = objcg;
stock->nr_bytes = atomic_xchg(>nr_charged_bytes, 0);
}
stock->nr_bytes += nr_bytes;
 
-   if (stock->nr_bytes > PAGE_SIZE)
-   drain_obj_stock(stock);
+   if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
+   nr_pages = stock->nr_bytes >> PAGE_SHIFT;
+   stock->nr_bytes &= (PAGE_SIZE - 1);
+   }
 
put_obj_stock(flags);
+
+   if (nr_pages) {
+   rcu_read_lock();
+   __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
+   rcu_read_unlock();
+   }
 }
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
@@ -3410,7 +3421,7 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t 
gfp, size_t size)
 
ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
if (!ret && nr_bytes)
-   refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
+   refill_obj_stock(objcg, PAGE_SIZE - nr_bytes, true);
 
css_put(>css);
return ret;
@@ -3418,7 +3429,7 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t 
gfp, size_t size)
 
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 {
-   refill_obj_stock(objcg, size);
+   refill_obj_stock(objcg, size, false);
 }
 
 #endif /* CONFIG_MEMCG_KMEM */
-- 
2.18.1



[PATCH v4 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

2021-04-18 Thread Waiman Long
 v4:
  - Drop v3 patch 1 as well as modification to mm/percpu.c as the percpu
vmstat update isn't frequent enough to worth caching it.
  - Add a new patch 1 to move Move mod_objcg_state() to memcontrol.c instead.
  - Combine v3 patches 4 & 5 into a single patch (patch 3).
  - Add a new patch 4 to cache both reclaimable & unreclaimable vmstat updates.
  - Add a new patch 5 to improve refill_obj_stock() performance.

 v3:
  - Add missing "inline" qualifier to the alternate mod_obj_stock_state()
in patch 3.
  - Remove redundant current_obj_stock() call in patch 5.

 v2:
  - Fix bug found by test robot in patch 5.
  - Update cover letter and commit logs.

With the recent introduction of the new slab memory controller, we
eliminate the need for having separate kmemcaches for each memory
cgroup and reduce overall kernel memory usage. However, we also add
additional memory accounting overhead to each call of kmem_cache_alloc()
and kmem_cache_free().

For workloads that require a lot of kmemcache allocations and
de-allocations, they may experience performance regression as illustrated
in [1] and [2].

A simple kernel module that performs repeated loop of 100,000,000
kmem_cache_alloc() and kmem_cache_free() of either a small 32-byte object
or a big 4k object at module init time is used for benchmarking. The
test was run on a CascadeLake server with turbo-boosting disable to
reduce run-to-run variation.

With memory accounting disable, the run time was 2.848s with small object
and 2.890s for the big object. With memory accounting enabled, the run
times with the application of various patches in the patchset were:

  Applied patches   Run time   Accounting overhead   %age 1   %age 2
  ---      ---   --   --

  Small 32-byte object:
   None  10.570s 7.722s  100.0%   271.1%
1-2   8.560s 5.712s   74.0%   200.6%
1-3   6.592s 3.744s   48.5%   131.5%
1-4   7.154s 4.306s   55.8%   151.2%
1-5   7.192s 4.344s   56.3%   152.5%

  Large 4k object:
   None  20.612s17.722s  100.0%   613.2%
1-2  20.354s17.464s   98.5%   604.3%
1-3  19.395s16.505s   93.1%   571.1%
1-4  19.094s16.204s   91.4%   560.7%
1-5  13.576s10.686s   60.3%   369.8%

  N.B. %age 1 = overhead/unpatched overhead
   %age 2 = overhead/accounting disable time

The small object test exercises mainly the object stock charging and
vmstat update code paths. The large object test also exercises the
refill_obj_stock() and  __memcg_kmem_charge()/__memcg_kmem_uncharge()
code paths.

The vmstat data stock caching helps in the small object test,
but not so much on the large object test. Similarly, eliminating
irq_disable/irq_enable helps in the small object test and less so in
the large object test. Caching both reclaimable and non-reclaimable
vmstat data actually regresses performance a bit in this particular
small object test.

The final patch to optimize refill_obj_stock() has negligible impact
on the small object test as this code path isn't being exercised. The
large object test, however, sees a pretty good performance improvement
with this patch.

[1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
[2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/

Waiman Long (5):
  mm/memcg: Move mod_objcg_state() to memcontrol.c
  mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  mm/memcg: Optimize user context object stock access
  mm/memcg: Save both reclaimable & unreclaimable bytes in object stock
  mm/memcg: Improve refill_obj_stock() performance

 mm/memcontrol.c | 199 +---
 mm/slab.h   |  16 +---
 2 files changed, 175 insertions(+), 40 deletions(-)

-- 
2.18.1



[PATCH v4 3/5] mm/memcg: Optimize user context object stock access

2021-04-18 Thread Waiman Long
Most kmem_cache_alloc() calls are from user context. With instrumentation
enabled, the measured amount of kmem_cache_alloc() calls from non-task
context was about 0.01% of the total.

The irq disable/enable sequence used in this case to access content
from object stock is slow.  To optimize for user context access, there
are now two sets of object stocks (in the new obj_stock structure)
for task context and interrupt context access respectively.

The task context object stock can be accessed after disabling preemption
which is cheap in non-preempt kernel. The interrupt context object stock
can only be accessed after disabling interrupt. User context code can
access interrupt object stock, but not vice versa.

The downside of this change is that there are more data stored in local
object stocks and not reflected in the charge counter and the vmstat
arrays.  However, this is a small price to pay for better performance.

Signed-off-by: Waiman Long 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
---
 mm/memcontrol.c | 94 +++--
 1 file changed, 68 insertions(+), 26 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 693453f95d99..c13502eab282 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2207,17 +2207,23 @@ void unlock_page_memcg(struct page *page)
 }
 EXPORT_SYMBOL(unlock_page_memcg);
 
-struct memcg_stock_pcp {
-   struct mem_cgroup *cached; /* this never be root cgroup */
-   unsigned int nr_pages;
-
+struct obj_stock {
 #ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *cached_objcg;
struct pglist_data *cached_pgdat;
unsigned int nr_bytes;
int vmstat_idx;
int vmstat_bytes;
+#else
+   int dummy[0];
 #endif
+};
+
+struct memcg_stock_pcp {
+   struct mem_cgroup *cached; /* this never be root cgroup */
+   unsigned int nr_pages;
+   struct obj_stock task_obj;
+   struct obj_stock irq_obj;
 
struct work_struct work;
unsigned long flags;
@@ -2227,12 +2233,12 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, 
memcg_stock);
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct memcg_stock_pcp *stock);
+static void drain_obj_stock(struct obj_stock *stock);
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 struct mem_cgroup *root_memcg);
 
 #else
-static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
+static inline void drain_obj_stock(struct obj_stock *stock)
 {
 }
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -2242,6 +2248,40 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
 }
 #endif
 
+/*
+ * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
+ * sequence used in this case to access content from object stock is slow.
+ * To optimize for user context access, there are now two object stocks for
+ * task context and interrupt context access respectively.
+ *
+ * The task context object stock can be accessed by disabling preemption only
+ * which is cheap in non-preempt kernel. The interrupt context object stock
+ * can only be accessed after disabling interrupt. User context code can
+ * access interrupt object stock, but not vice versa.
+ */
+static inline struct obj_stock *get_obj_stock(unsigned long *pflags)
+{
+   struct memcg_stock_pcp *stock;
+
+   if (likely(in_task())) {
+   preempt_disable();
+   stock = this_cpu_ptr(_stock);
+   return >task_obj;
+   } else {
+   local_irq_save(*pflags);
+   stock = this_cpu_ptr(_stock);
+   return >irq_obj;
+   }
+}
+
+static inline void put_obj_stock(unsigned long flags)
+{
+   if (likely(in_task()))
+   preempt_enable();
+   else
+   local_irq_restore(flags);
+}
+
 /**
  * consume_stock: Try to consume stocked charge on this cpu.
  * @memcg: memcg to consume from.
@@ -2308,7 +2348,9 @@ static void drain_local_stock(struct work_struct *dummy)
local_irq_save(flags);
 
stock = this_cpu_ptr(_stock);
-   drain_obj_stock(stock);
+   drain_obj_stock(>irq_obj);
+   if (in_task())
+   drain_obj_stock(>task_obj);
drain_stock(stock);
clear_bit(FLUSHING_CACHED_CHARGE, >flags);
 
@@ -3153,6 +3195,10 @@ void __memcg_kmem_uncharge_page(struct page *page, int 
order)
css_put(>css);
 }
 
+/*
+ * __mod_objcg_state() may be called with irq enabled, so
+ * mod_memcg_lruvec_state() should be used.
+ */
 static inline void __mod_objcg_state(struct obj_cgroup *objcg,
 struct pglist_data *pgdat,
 enum node_stat_item idx, int nr)
@@ -3163,18 +3209,15 @@ static inline void __mod_objcg_state(struct obj_cgroup 
*objcg,
rcu_read_lock();
memcg = obj_cgroup_memcg(objc

[PATCH v4 2/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp

2021-04-18 Thread Waiman Long
Before the new slab memory controller with per object byte charging,
charging and vmstat data update happen only when new slab pages are
allocated or freed. Now they are done with every kmem_cache_alloc()
and kmem_cache_free(). This causes additional overhead for workloads
that generate a lot of alloc and free calls.

The memcg_stock_pcp is used to cache byte charge for a specific
obj_cgroup to reduce that overhead. To further reducing it, this patch
makes the vmstat data cached in the memcg_stock_pcp structure as well
until it accumulates a page size worth of update or when other cached
data change. Caching the vmstat data in the per-cpu stock eliminates two
writes to non-hot cachelines for memcg specific as well as memcg-lruvecs
specific vmstat data by a write to a hot local stock cacheline.

On a 2-socket Cascade Lake server with instrumentation enabled and this
patch applied, it was found that about 20% (634400 out of 3243830)
of the time when mod_objcg_state() is called leads to an actual call
to __mod_objcg_state() after initial boot. When doing parallel kernel
build, the figure was about 17% (24329265 out of 142512465). So caching
the vmstat data reduces the number of calls to __mod_objcg_state()
by more than 80%.

Signed-off-by: Waiman Long 
Reviewed-by: Shakeel Butt 
---
 mm/memcontrol.c | 64 ++---
 1 file changed, 61 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dc9032f28f2e..693453f95d99 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2213,7 +2213,10 @@ struct memcg_stock_pcp {
 
 #ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *cached_objcg;
+   struct pglist_data *cached_pgdat;
unsigned int nr_bytes;
+   int vmstat_idx;
+   int vmstat_bytes;
 #endif
 
struct work_struct work;
@@ -3150,8 +3153,9 @@ void __memcg_kmem_uncharge_page(struct page *page, int 
order)
css_put(>css);
 }
 
-void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
-enum node_stat_item idx, int nr)
+static inline void __mod_objcg_state(struct obj_cgroup *objcg,
+struct pglist_data *pgdat,
+enum node_stat_item idx, int nr)
 {
struct mem_cgroup *memcg;
struct lruvec *lruvec = NULL;
@@ -3159,10 +3163,53 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct 
pglist_data *pgdat,
rcu_read_lock();
memcg = obj_cgroup_memcg(objcg);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
-   mod_memcg_lruvec_state(lruvec, idx, nr);
+   __mod_memcg_lruvec_state(lruvec, idx, nr);
rcu_read_unlock();
 }
 
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+enum node_stat_item idx, int nr)
+{
+   struct memcg_stock_pcp *stock;
+   unsigned long flags;
+
+   local_irq_save(flags);
+   stock = this_cpu_ptr(_stock);
+
+   /*
+* Save vmstat data in stock and skip vmstat array update unless
+* accumulating over a page of vmstat data or when pgdat or idx
+* changes.
+*/
+   if (stock->cached_objcg != objcg) {
+   /* Output the current data as is */
+   } else if (!stock->vmstat_bytes) {
+   /* Save the current data */
+   stock->vmstat_bytes = nr;
+   stock->vmstat_idx = idx;
+   stock->cached_pgdat = pgdat;
+   nr = 0;
+   } else if ((stock->cached_pgdat != pgdat) ||
+  (stock->vmstat_idx != idx)) {
+   /* Output the cached data & save the current data */
+   swap(nr, stock->vmstat_bytes);
+   swap(idx, stock->vmstat_idx);
+   swap(pgdat, stock->cached_pgdat);
+   } else {
+   stock->vmstat_bytes += nr;
+   if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
+   nr = stock->vmstat_bytes;
+   stock->vmstat_bytes = 0;
+   } else {
+   nr = 0;
+   }
+   }
+   if (nr)
+   __mod_objcg_state(objcg, pgdat, idx, nr);
+
+   local_irq_restore(flags);
+}
+
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
struct memcg_stock_pcp *stock;
@@ -3213,6 +3260,17 @@ static void drain_obj_stock(struct memcg_stock_pcp 
*stock)
stock->nr_bytes = 0;
}
 
+   /*
+* Flush the vmstat data in current stock
+*/
+   if (stock->vmstat_bytes) {
+   __mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
+ stock->vmstat_bytes);
+   stock->cached_pgdat = NULL;
+   stock->vmstat_bytes = 0;
+   stock->vmstat_idx = 0;
+   }
+
obj_cgroup_put(old);
stock->cached_objcg = NULL;
 }
-- 
2.18.1



[PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c

2021-04-18 Thread Waiman Long
The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c
so that further optimization can be done to it in later patches without
exposing unnecessary details to other mm components.

Signed-off-by: Waiman Long 
---
 mm/memcontrol.c | 13 +
 mm/slab.h   | 16 ++--
 2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d850a..dc9032f28f2e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page *page, int 
order)
css_put(>css);
 }
 
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+enum node_stat_item idx, int nr)
+{
+   struct mem_cgroup *memcg;
+   struct lruvec *lruvec = NULL;
+
+   rcu_read_lock();
+   memcg = obj_cgroup_memcg(objcg);
+   lruvec = mem_cgroup_lruvec(memcg, pgdat);
+   mod_memcg_lruvec_state(lruvec, idx, nr);
+   rcu_read_unlock();
+}
+
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
struct memcg_stock_pcp *stock;
diff --git a/mm/slab.h b/mm/slab.h
index 076582f58f68..ae8b85875426 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -239,6 +239,8 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache 
*s, slab_flags_t fla
 #ifdef CONFIG_MEMCG_KMEM
 int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
 gfp_t gfp, bool new_page);
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+enum node_stat_item idx, int nr);
 
 static inline void memcg_free_page_obj_cgroups(struct page *page)
 {
@@ -283,20 +285,6 @@ static inline bool memcg_slab_pre_alloc_hook(struct 
kmem_cache *s,
return true;
 }
 
-static inline void mod_objcg_state(struct obj_cgroup *objcg,
-  struct pglist_data *pgdat,
-  enum node_stat_item idx, int nr)
-{
-   struct mem_cgroup *memcg;
-   struct lruvec *lruvec;
-
-   rcu_read_lock();
-   memcg = obj_cgroup_memcg(objcg);
-   lruvec = mem_cgroup_lruvec(memcg, pgdat);
-   mod_memcg_lruvec_state(lruvec, idx, nr);
-   rcu_read_unlock();
-}
-
 static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
  struct obj_cgroup *objcg,
  gfp_t flags, size_t size,
-- 
2.18.1



[PATCH v5] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only

2021-04-15 Thread Waiman Long
The handling of sysrq key can be activated by echoing the key to
/proc/sysrq-trigger or via the magic key sequence typed into a terminal
that is connected to the system in some way (serial, USB or other mean).
In the former case, the handling is done in a user context. In the
latter case, it is likely to be in an interrupt context.

Currently in print_cpu() of kernel/sched/debug.c, sched_debug_lock is
taken with interrupt disabled for the whole duration of the calls to
print_*_stats() and print_rq() which could last for the quite some time
if the information dump happens on the serial console.

If the system has many cpus and the sched_debug_lock is somehow busy
(e.g. parallel sysrq-t), the system may hit a hard lockup panic
depending on the actually serial console implementation of the
system. For instance,

[ 7809.796262] Kernel panic - not syncing: Hard LOCKUP
[ 7809.796264] CPU: 13 PID: 79867 Comm: reproducer.sh Kdump: loaded Tainted: G  
I  - -  - 4.18.0-301.el8.x86_64 #1
[ 7809.796264] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.4.9 
06/29/2018
[ 7809.796265] Call Trace:
[ 7809.796265]  
[ 7809.796266]  dump_stack+0x5c/0x80
[ 7809.796266]  panic+0xe7/0x2a9
[ 7809.796267]  nmi_panic.cold.9+0xc/0xc
[ 7809.796267]  watchdog_overflow_callback.cold.7+0x5c/0x70
[ 7809.796268]  __perf_event_overflow+0x52/0xf0
[ 7809.796268]  handle_pmi_common+0x204/0x2a0
[ 7809.796269]  ? __set_pte_vaddr+0x32/0x50
[ 7809.796269]  ? __native_set_fixmap+0x24/0x30
[ 7809.796270]  ? ghes_copy_tofrom_phys+0xd3/0x1c0
[ 7809.796271]  intel_pmu_handle_irq+0xbf/0x160
[ 7809.796271]  perf_event_nmi_handler+0x2d/0x50
[ 7809.796272]  nmi_handle+0x63/0x110
[ 7809.796272]  default_do_nmi+0x49/0x100
[ 7809.796273]  do_nmi+0x17e/0x1e0
[ 7809.796273]  end_repeat_nmi+0x16/0x6f
[ 7809.796274] RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1d0
[ 7809.796275] Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 
30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 
c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00
[ 7809.796276] RSP: 0018:aa54cd887df8 EFLAGS: 0002
[ 7809.796277] RAX: 0101 RBX: 0246 RCX: 
[ 7809.796278] RDX:  RSI:  RDI: 936b66d0
[ 7809.796278] RBP: 9301fb40 R08: 0004 R09: 004f
[ 7809.796279] R10:  R11: aa54cd887cc0 R12: 907fd0a29ec0
[ 7809.796280] R13:  R14: 926ab7c0 R15: 
[ 7809.796280]  ? native_queued_spin_lock_slowpath+0x5b/0x1d0
[ 7809.796281]  ? native_queued_spin_lock_slowpath+0x5b/0x1d0
[ 7809.796281]  
[ 7809.796282]  _raw_spin_lock_irqsave+0x32/0x40
[ 7809.796283]  print_cpu+0x261/0x7c0
[ 7809.796283]  sysrq_sched_debug_show+0x34/0x50
[ 7809.796284]  sysrq_handle_showstate+0xc/0x20
[ 7809.796284]  __handle_sysrq.cold.11+0x48/0xfb
[ 7809.796285]  write_sysrq_trigger+0x2b/0x30
[ 7809.796285]  proc_reg_write+0x39/0x60
[ 7809.796286]  vfs_write+0xa5/0x1a0
[ 7809.796286]  ksys_write+0x4f/0xb0
[ 7809.796287]  do_syscall_64+0x5b/0x1a0
[ 7809.796287]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 7809.796288] RIP: 0033:0x7fabe4ceb648

The purpose of sched_debug_lock is to serialize the use of the global
cgroup_path[] buffer in print_cpu(). The rests of the printk calls don't
need serialization from sched_debug_lock.

Calling printk() with interrupt disabled can still be problematic if
multiple instances are running. Allocating a stack buffer of PATH_MAX
bytes is not feasible because of the limited size of the kernel stack.

The solution implemented in this patch is to allow only one caller at a
time to use the full size group_path[], while other simultaneous callers
will have to use shorter stack buffers with the possibility of path
name truncation. A "..." suffix will be printed if truncation may have
happened.  The cgroup path name is provided for informational purpose
only, so occasional path name truncation should not be a big problem.

Fixes: efe25c2c7b3a ("sched: Reinstate group names in /proc/sched_debug")
Suggested-by: Peter Zijlstra 
Signed-off-by: Waiman Long 
---
 kernel/sched/debug.c | 42 +-
 1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 486f403a778b..9c8b3ed2199a 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -8,8 +8,6 @@
  */
 #include "sched.h"
 
-static DEFINE_SPINLOCK(sched_debug_lock);
-
 /*
  * This allows printing both to /proc/sched_debug and
  * to the console
@@ -470,16 +468,37 @@ static void print_cfs_group_stats(struct seq_file *m, int 
cpu, struct task_group
 #endif
 
 #ifdef CONFIG_CGROUP_SCHED
+static DEFINE_SPINLOCK(sched_debug_lock);
 static char group_path[PATH_MAX];
 
-static char *task_group_path(struct task_group *tg)
+static void task_group_path(struct 

Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

2021-04-15 Thread Waiman Long

On 4/15/21 3:40 PM, Johannes Weiner wrote:

On Thu, Apr 15, 2021 at 02:47:31PM -0400, Waiman Long wrote:

On 4/15/21 2:10 PM, Johannes Weiner wrote:

On Thu, Apr 15, 2021 at 12:35:45PM -0400, Waiman Long wrote:

On 4/15/21 12:30 PM, Johannes Weiner wrote:

On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:

In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
is followed by mod_objcg_state()/mod_memcg_state(). Each of these
function call goes through a separate irq_save/irq_restore cycle. That
is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
that combines them with a single irq_save/irq_restore cycle.

@@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, 
size_t size)
refill_obj_stock(objcg, size);
}
+void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
+  struct pglist_data *pgdat, int idx)

The optimization makes sense.

But please don't combine independent operations like this into a
single function. It makes for an unclear parameter list, it's a pain
in the behind to change the constituent operations later on, and it
has a habit of attracting more random bools over time. E.g. what if
the caller already has irqs disabled? What if it KNOWS that irqs are
enabled and it could use local_irq_disable() instead of save?

Just provide an __obj_cgroup_uncharge() that assumes irqs are
disabled, combine with the existing __mod_memcg_lruvec_state(), and
bubble the irq handling up to those callsites which know better.


That will also work. However, the reason I did that was because of patch 5
in the series. I could put the get_obj_stock() and put_obj_stock() code in
slab.h and allowed them to be used directly in various places, but hiding in
one function is easier.

Yeah it's more obvious after getting to patch 5.

But with the irq disabling gone entirely, is there still an incentive
to combine the atomic section at all? Disabling preemption is pretty
cheap, so it wouldn't matter to just do it twice.

I.e. couldn't the final sequence in slab code simply be

objcg_uncharge()
mod_objcg_state()

again and each function disables preemption (and in the rare case
irqs) as it sees fit?

You lose the irqsoff batching in the cold path, but as you say, hit
rates are pretty good, and it doesn't seem worth complicating the code
for the cold path.


That does make sense, though a little bit of performance may be lost. I will
try that out to see how it work out performance wise.

Thanks.

Even if we still end up doing it, it's great to have that cost
isolated, so we know how much extra code complexity corresponds to how
much performance gain. It seems the task/irq split could otherwise be
a pretty localized change with no API implications.

I still want to move mod_objcg_state() function to memcontrol.c though 
as I don't want to put any obj_stock stuff in mm/slab.h.


Cheers,
Longman



Re: [PATCH v3 5/5] mm/memcg: Optimize user context object stock access

2021-04-15 Thread Waiman Long

On 4/15/21 2:53 PM, Johannes Weiner wrote:

On Thu, Apr 15, 2021 at 02:16:17PM -0400, Waiman Long wrote:

On 4/15/21 1:53 PM, Johannes Weiner wrote:

On Tue, Apr 13, 2021 at 09:20:27PM -0400, Waiman Long wrote:

Most kmem_cache_alloc() calls are from user context. With instrumentation
enabled, the measured amount of kmem_cache_alloc() calls from non-task
context was about 0.01% of the total.

The irq disable/enable sequence used in this case to access content
from object stock is slow.  To optimize for user context access, there
are now two object stocks for task context and interrupt context access
respectively.

The task context object stock can be accessed after disabling preemption
which is cheap in non-preempt kernel. The interrupt context object stock
can only be accessed after disabling interrupt. User context code can
access interrupt object stock, but not vice versa.

The mod_objcg_state() function is also modified to make sure that memcg
and lruvec stat updates are done with interrupted disabled.

The downside of this change is that there are more data stored in local
object stocks and not reflected in the charge counter and the vmstat
arrays.  However, this is a small price to pay for better performance.

Signed-off-by: Waiman Long 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 

This makes sense, and also explains the previous patch a bit
better. But please merge those two.

The reason I broke it into two is so that the patches are individually
easier to review. I prefer to update the commit log of patch 4 to explain
why the obj_stock structure is introduced instead of merging the two.

Well I did not find them easier to review separately.


@@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
local_irq_save(flags);
stock = this_cpu_ptr(_stock);
-   drain_obj_stock(>obj);
+   drain_obj_stock(>irq_obj);
+   if (in_task())
+   drain_obj_stock(>task_obj);
drain_stock(stock);
clear_bit(FLUSHING_CACHED_CHARGE, >flags);
@@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup 
*objcg,
memcg = obj_cgroup_memcg(objcg);
if (pgdat)
lruvec = mem_cgroup_lruvec(memcg, pgdat);
-   __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
+   mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
rcu_read_unlock();

This is actually a bug introduced in the earlier patch, isn't it?
Calling __mod_memcg_lruvec_state() without irqs disabled...


Not really, in patch 3, mod_objcg_state() is called only in the stock update
context where interrupt had already been disabled. But now, that is no
longer the case, that is why i need to update mod_objcg_state() to make sure
irq is disabled before updating vmstat data array.

Oh, I see it now. Man, that's subtle. We've had several very hard to
track down preemption bugs in those stats, because they manifest as
counter imbalances and you have no idea if there is a leak somewhere.

The convention for these functions is that the __ prefix indicates
that preemption has been suitably disabled. Please always follow this
convention, even if the semantic change is temporary.

I see. I will fix that in the next version.


Btw, is there a reason why the stock caching isn't just part of
mod_objcg_state()? Why does the user need to choose if they want the
caching or not? It's not like we ask for this when charging, either.

Yes, I can revert it to make mod_objcg_state() to call 
mod_obj_stock_state() internally instead of the other way around.


Cheers,
Longman



Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

2021-04-15 Thread Waiman Long

On 4/15/21 2:10 PM, Johannes Weiner wrote:

On Thu, Apr 15, 2021 at 12:35:45PM -0400, Waiman Long wrote:

On 4/15/21 12:30 PM, Johannes Weiner wrote:

On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:

In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
is followed by mod_objcg_state()/mod_memcg_state(). Each of these
function call goes through a separate irq_save/irq_restore cycle. That
is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
that combines them with a single irq_save/irq_restore cycle.

@@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, 
size_t size)
refill_obj_stock(objcg, size);
   }
+void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
+  struct pglist_data *pgdat, int idx)

The optimization makes sense.

But please don't combine independent operations like this into a
single function. It makes for an unclear parameter list, it's a pain
in the behind to change the constituent operations later on, and it
has a habit of attracting more random bools over time. E.g. what if
the caller already has irqs disabled? What if it KNOWS that irqs are
enabled and it could use local_irq_disable() instead of save?

Just provide an __obj_cgroup_uncharge() that assumes irqs are
disabled, combine with the existing __mod_memcg_lruvec_state(), and
bubble the irq handling up to those callsites which know better.


That will also work. However, the reason I did that was because of patch 5
in the series. I could put the get_obj_stock() and put_obj_stock() code in
slab.h and allowed them to be used directly in various places, but hiding in
one function is easier.

Yeah it's more obvious after getting to patch 5.

But with the irq disabling gone entirely, is there still an incentive
to combine the atomic section at all? Disabling preemption is pretty
cheap, so it wouldn't matter to just do it twice.

I.e. couldn't the final sequence in slab code simply be

objcg_uncharge()
mod_objcg_state()

again and each function disables preemption (and in the rare case
irqs) as it sees fit?

You lose the irqsoff batching in the cold path, but as you say, hit
rates are pretty good, and it doesn't seem worth complicating the code
for the cold path.

That does make sense, though a little bit of performance may be lost. I 
will try that out to see how it work out performance wise.


Cheers,
Longman



Re: [PATCH v3 5/5] mm/memcg: Optimize user context object stock access

2021-04-15 Thread Waiman Long

On 4/15/21 1:53 PM, Johannes Weiner wrote:

On Tue, Apr 13, 2021 at 09:20:27PM -0400, Waiman Long wrote:

Most kmem_cache_alloc() calls are from user context. With instrumentation
enabled, the measured amount of kmem_cache_alloc() calls from non-task
context was about 0.01% of the total.

The irq disable/enable sequence used in this case to access content
from object stock is slow.  To optimize for user context access, there
are now two object stocks for task context and interrupt context access
respectively.

The task context object stock can be accessed after disabling preemption
which is cheap in non-preempt kernel. The interrupt context object stock
can only be accessed after disabling interrupt. User context code can
access interrupt object stock, but not vice versa.

The mod_objcg_state() function is also modified to make sure that memcg
and lruvec stat updates are done with interrupted disabled.

The downside of this change is that there are more data stored in local
object stocks and not reflected in the charge counter and the vmstat
arrays.  However, this is a small price to pay for better performance.

Signed-off-by: Waiman Long 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 

This makes sense, and also explains the previous patch a bit
better. But please merge those two.
The reason I broke it into two is so that the patches are individually 
easier to review. I prefer to update the commit log of patch 4 to 
explain why the obj_stock structure is introduced instead of merging the 
two.



@@ -2229,7 +2229,8 @@ struct obj_stock {
  struct memcg_stock_pcp {
struct mem_cgroup *cached; /* this never be root cgroup */
unsigned int nr_pages;
-   struct obj_stock obj;
+   struct obj_stock task_obj;
+   struct obj_stock irq_obj;
  
  	struct work_struct work;

unsigned long flags;
@@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
  }
  #endif
  
+/*

+ * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
+ * sequence used in this case to access content from object stock is slow.
+ * To optimize for user context access, there are now two object stocks for
+ * task context and interrupt context access respectively.
+ *
+ * The task context object stock can be accessed by disabling preemption only
+ * which is cheap in non-preempt kernel. The interrupt context object stock
+ * can only be accessed after disabling interrupt. User context code can
+ * access interrupt object stock, but not vice versa.
+ */
  static inline struct obj_stock *current_obj_stock(void)
  {
struct memcg_stock_pcp *stock = this_cpu_ptr(_stock);
  
-	return >obj;

+   return in_task() ? >task_obj : >irq_obj;
+}
+
+#define get_obj_stock(flags)   \
+({ \
+   struct memcg_stock_pcp *stock;  \
+   struct obj_stock *obj_stock;\
+   \
+   if (in_task()) {\
+   preempt_disable();  \
+   (flags) = -1L;  \
+   stock = this_cpu_ptr(_stock); \
+   obj_stock = >task_obj;\
+   } else {\
+   local_irq_save(flags);  \
+   stock = this_cpu_ptr(_stock); \
+   obj_stock = >irq_obj; \
+   }   \
+   obj_stock;  \
+})
+
+static inline void put_obj_stock(unsigned long flags)
+{
+   if (flags == -1L)
+   preempt_enable();
+   else
+   local_irq_restore(flags);
  }

Please make them both functions and use 'unsigned long *flags'.

Sure, I can do that.


Also I'm not sure doing in_task() twice would actually be more
expensive than the == -1 special case, and easier to understand.

I can make that change too. Either way is fine with me.



@@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
local_irq_save(flags);
  
  	stock = this_cpu_ptr(_stock);

-   drain_obj_stock(>obj);
+   drain_obj_stock(>irq_obj);
+   if (in_task())
+   drain_obj_stock(>task_obj);
drain_stock(stock);
clear_bit(FLUSHING_CACHED_CHARGE, >flags);
  
@@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,

memcg = obj_cgroup_memcg(objcg);
if (pgdat)
lruvec = mem_cgroup_lruvec(memcg, pgdat);
-   __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
+   mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
rcu_read_unlock();

This is actually a bug introduced in the earlier patch, isn't it?
Calling __mod_memcg_lruvec_state() without irqs disabled...

Not really, in patch 3, 

Re: [PATCH v2] locking/qrwlock: Fix ordering in queued_write_lock_slowpath

2021-04-15 Thread Waiman Long

On 4/15/21 1:27 PM, Ali Saidi wrote:

While this code is executed with the wait_lock held, a reader can
acquire the lock without holding wait_lock.  The writer side loops
checking the value with the atomic_cond_read_acquire(), but only truly
acquires the lock when the compare-and-exchange is completed
successfully which isn’t ordered. This exposes the window between the
acquire and the cmpxchg to an A-B-A problem which allows reads following
the lock acquisition to observe values speculatively before the write
lock is truly acquired.

We've seen a problem in epoll where the reader does a xchg while
holding the read lock, but the writer can see a value change out from under it.

Writer   | Reader 2

ep_scan_ready_list() |
|- write_lock_irq()  |
 |- queued_write_lock_slowpath()  |
   |- atomic_cond_read_acquire()  |
  | read_lock_irqsave(>lock, flags);
--> (observes value before unlock)|  chain_epi_lockless()
| |epi->next = xchg(>ovflist, epi);
| | read_unlock_irqrestore(>lock, 
flags);
| |
| atomic_cmpxchg_relaxed()|
|-- READ_ONCE(ep->ovflist);   |

A core can order the read of the ovflist ahead of the
atomic_cmpxchg_relaxed(). Switching the cmpxchg to use acquire semantics
addresses this issue at which point the atomic_cond_read can be switched
to use relaxed semantics.

Fixes: b519b56e378ee ("locking/qrwlock: Use atomic_cond_read_acquire() when spinning 
in qrwlock")
Signed-off-by: Ali Saidi 
Cc: sta...@vger.kernel.org
Acked-by: Will Deacon 
Tested-by: Steve Capper 
Reviewed-by: Steve Capper 

---
  kernel/locking/qrwlock.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 4786dd271b45..10770f6ac4d9 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -73,8 +73,8 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
  
  	/* When no more readers or writers, set the locked flag */

do {
-   atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING);
-   } while (atomic_cmpxchg_relaxed(>cnts, _QW_WAITING,
+   atomic_cond_read_relaxed(>cnts, VAL == _QW_WAITING);
+   } while (atomic_cmpxchg_acquire(>cnts, _QW_WAITING,
_QW_LOCKED) != _QW_WAITING);
  unlock:
arch_spin_unlock(>wait_lock);


Acked-by: Waiman Long 



Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

2021-04-15 Thread Waiman Long

On 4/15/21 1:10 PM, Matthew Wilcox wrote:

On Tue, Apr 13, 2021 at 09:20:22PM -0400, Waiman Long wrote:

With memory accounting disable, the run time was 2.848s. With memory
accounting enabled, the run times with the application of various
patches in the patchset were:

   Applied patches   Run time   Accounting overhead   Overhead %age
   ---      ---   -
None  10.800s 7.952s  100.0%
 1-2   9.140s 6.292s   79.1%
 1-3   7.641s 4.793s   60.3%
 1-5   6.801s 3.953s   49.7%

I think this is a misleading way to report the overhead.  I would have said:

10.800s 7.952s  279.2%
 9.140s 6.292s  220.9%
 7.641s 4.793s  168.3%
 6.801s 3.953s  138.8%

What I want to emphasize is the reduction in the accounting overhead 
part of execution time. Your percentage used the accounting disable time 
as the denominator. I think both are valid, I will be more clear about 
that in my version of the patch.


Thanks,
Longman



Re: [PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp

2021-04-15 Thread Waiman Long

On 4/15/21 12:50 PM, Johannes Weiner wrote:

On Tue, Apr 13, 2021 at 09:20:25PM -0400, Waiman Long wrote:

Before the new slab memory controller with per object byte charging,
charging and vmstat data update happen only when new slab pages are
allocated or freed. Now they are done with every kmem_cache_alloc()
and kmem_cache_free(). This causes additional overhead for workloads
that generate a lot of alloc and free calls.

The memcg_stock_pcp is used to cache byte charge for a specific
obj_cgroup to reduce that overhead. To further reducing it, this patch
makes the vmstat data cached in the memcg_stock_pcp structure as well
until it accumulates a page size worth of update or when other cached
data change.

On a 2-socket Cascade Lake server with instrumentation enabled and this
patch applied, it was found that about 17% (946796 out of 5515184) of the
time when __mod_obj_stock_state() is called leads to an actual call to
mod_objcg_state() after initial boot. When doing parallel kernel build,
the figure was about 16% (21894614 out of 139780628). So caching the
vmstat data reduces the number of calls to mod_objcg_state() by more
than 80%.

Right, but mod_objcg_state() is itself already percpu-cached. What's
the benefit of avoiding calls to it with another percpu cache?

There are actually 2 set of vmstat data that have to be updated. One is 
associated with the memcg and other one is for each lruvec within the 
cgroup. Caching it in obj_stock, we replace 2 writes to two colder 
cachelines with one write to a hot cacheline. If you look at patch 5, I 
break obj_stock into two - one for task context and one for irq context. 
Interrupt disable is no longer needed in task context, but that is not 
possible when writing to the actual vmstat data arrays.


Cheers,
Longman



Re: [PATCH v3 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()

2021-04-15 Thread Waiman Long

On 4/15/21 12:40 PM, Johannes Weiner wrote:

On Tue, Apr 13, 2021 at 09:20:23PM -0400, Waiman Long wrote:

The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily
available. So both of them are now passed to mod_memcg_lruvec_state()
and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is
updated to allow either of the two parameters to be set to null. This
makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec
is null.

The new __mod_memcg_lruvec_state() function will be used in the next
patch as a replacement of mod_memcg_state() in mm/percpu.c for the
consolidation of the memory uncharge and vmstat update functions in
the kmem_cache_free() path.

This requires users who want both to pass a pgdat that can be derived
from the lruvec. This is error prone, and we just acked a patch that
removes this very thing from mem_cgroup_page_lruvec().

With the suggestion for patch 2, this shouldn't be necessary anymore,
though. And sort of underlines my point around that combined function
creating akwward code above and below it.

The reason of passing in the pgdat is because of the caching of vmstat 
data. lruvec may be gone if the corresponding memory cgroup is removed, 
but pgdat should stay put. That is why I put pgdat in the obj_stock for 
caching. I could also put the node id instead of pgdat.


Cheers,
Longman



Re: [PATCH] locking/qrwlock: Fix ordering in queued_write_lock_slowpath

2021-04-15 Thread Waiman Long

On 4/15/21 12:45 PM, Will Deacon wrote:



With that in mind, it would probably be a good idea to eyeball the qspinlock
slowpath as well, as that uses both atomic_cond_read_acquire() and
atomic_try_cmpxchg_relaxed().

It seems plausible that the same thing could occur here in qspinlock:
   if ((val & _Q_TAIL_MASK) == tail) {
   if (atomic_try_cmpxchg_relaxed(>val, , 
_Q_LOCKED_VAL))
   goto release; /* No contention */
   }

Just been thinking about this, but I don't see an issue here because
everybody is queuing the same way (i.e. we don't have a mechanism to jump
the queue like we do for qrwlock) and the tail portion of the lock word
isn't susceptible to ABA. That is, once we're at the head of the queue
and we've seen the lock become unlocked via atomic_cond_read_acquire(),
then we know we hold it.

So qspinlock looks fine to me, but I'd obviously value anybody else's
opinion on that.


I agree with your assessment of qspinlock. I think qspinlock is fine.

Cheers,
Longman



Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

2021-04-15 Thread Waiman Long

On 4/15/21 12:30 PM, Johannes Weiner wrote:

On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:

In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
is followed by mod_objcg_state()/mod_memcg_state(). Each of these
function call goes through a separate irq_save/irq_restore cycle. That
is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
that combines them with a single irq_save/irq_restore cycle.

@@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, 
size_t size)
refill_obj_stock(objcg, size);
  }
  
+void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,

+  struct pglist_data *pgdat, int idx)

The optimization makes sense.

But please don't combine independent operations like this into a
single function. It makes for an unclear parameter list, it's a pain
in the behind to change the constituent operations later on, and it
has a habit of attracting more random bools over time. E.g. what if
the caller already has irqs disabled? What if it KNOWS that irqs are
enabled and it could use local_irq_disable() instead of save?

Just provide an __obj_cgroup_uncharge() that assumes irqs are
disabled, combine with the existing __mod_memcg_lruvec_state(), and
bubble the irq handling up to those callsites which know better.

That will also work. However, the reason I did that was because of patch 
5 in the series. I could put the get_obj_stock() and put_obj_stock() 
code in slab.h and allowed them to be used directly in various places, 
but hiding in one function is easier.


Anyway, I can change the patch if you think that is the right way.

Cheers,
Longman



Re: [PATCH] locking/qrwlock: Fix ordering in queued_write_lock_slowpath

2021-04-15 Thread Waiman Long

On 4/15/21 10:25 AM, Ali Saidi wrote:

While this code is executed with the wait_lock held, a reader can
acquire the lock without holding wait_lock.  The writer side loops
checking the value with the atomic_cond_read_acquire(), but only truly
acquires the lock when the compare-and-exchange is completed
successfully which isn’t ordered. The other atomic operations from this
point are release-ordered and thus reads after the lock acquisition can
be completed before the lock is truly acquired which violates the
guarantees the lock should be making.

Fixes: b519b56e378ee ("locking/qrwlock: Use atomic_cond_read_acquire() when spinning 
in qrwloc")
Signed-off-by: Ali Saidi 
Cc: sta...@vger.kernel.org
---
  kernel/locking/qrwlock.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 4786dd271b45..10770f6ac4d9 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -73,8 +73,8 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
  
  	/* When no more readers or writers, set the locked flag */

do {
-   atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING);
-   } while (atomic_cmpxchg_relaxed(>cnts, _QW_WAITING,
+   atomic_cond_read_relaxed(>cnts, VAL == _QW_WAITING);
+   } while (atomic_cmpxchg_acquire(>cnts, _QW_WAITING,
_QW_LOCKED) != _QW_WAITING);
  unlock:
arch_spin_unlock(>wait_lock);


I think the original code isn't wrong. The read_acquire provides the 
acquire barrier for cmpxchg. Because of conditional dependency, the 
wait_lock unlock won't happen until the cmpxchg succeeds. Without doing 
a read_acquire, there may be a higher likelihood that the cmpxchg may fail.


Anyway, I will let Will or Peter chime in on this as I am not as 
proficient as them on this topic.


Cheers,
Longman




Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

2021-04-15 Thread Waiman Long

On 4/14/21 11:26 PM, Masayoshi Mizuma wrote:


Hi Longman,

Thank you for your patches.
I rerun the benchmark with your patches, it seems that the reduction
is small... The total duration of sendto() and recvfrom() system call
during the benchmark are as follows.

- sendto
   - v5.8 vanilla:  2576.056 msec (100%)
   - v5.12-rc7 vanilla: 2988.911 msec (116%)
   - v5.12-rc7 with your patches (1-5): 2984.307 msec (115%)

- recvfrom
   - v5.8 vanilla:  2113.156 msec (100%)
   - v5.12-rc7 vanilla: 2305.810 msec (109%)
   - v5.12-rc7 with your patches (1-5): 2287.351 msec (108%)

kmem_cache_alloc()/kmem_cache_free() are called around 1,400,000 times during
the benchmark. I ran a loop in a kernel module as following. The duration
is reduced by your patches actually.

   ---
   dummy_cache = KMEM_CACHE(dummy, SLAB_ACCOUNT);
   for (i = 0; i < 140; i++) {
p = kmem_cache_alloc(dummy_cache, GFP_KERNEL);
kmem_cache_free(dummy_cache, p);
   }
   ---

- v5.12-rc7 vanilla: 110 msec (100%)
- v5.12-rc7 with your patches (1-5):  85 msec (77%)

It seems that the reduction is small for the benchmark though...
Anyway, I can see your patches reduce the overhead.
Please feel free to add:

Tested-by: Masayoshi Mizuma 

Thanks!
Masa


Thanks for the testing.

I was focusing on your kernel module benchmark in testing my patch. I 
will try out your pgbench benchmark to see if there can be other tuning 
that can be done.


BTW, how many numa nodes does your test machine? I did my testing with a 
2-socket system. The vmstat caching part may be less effective on 
systems with more numa nodes. I will try to find a larger 4-socket 
systems for testing.


Cheers,
Longman



Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

2021-04-14 Thread Waiman Long

On 4/14/21 1:26 PM, Andi Kleen wrote:

The CNA code, if enabled, will be in vmlinux, not in a kernel module. As a
result, I think a module parameter will be no different from a kernel
command line parameter in this regard.

You can still change it in /sys at runtime, even if it's in the vmlinux.


I see, thank for the clarification.

Cheers,
Longman



Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

2021-04-14 Thread Waiman Long

On 4/13/21 5:01 PM, Alex Kogan wrote:

Hi, Andi.

Thanks for your comments!


On Apr 13, 2021, at 2:03 AM, Andi Kleen  wrote:

Alex Kogan  writes:

+   numa_spinlock_threshold=[NUMA, PV_OPS]
+   Set the time threshold in milliseconds for the
+   number of intra-node lock hand-offs before the
+   NUMA-aware spinlock is forced to be passed to
+   a thread on another NUMA node.  Valid values
+   are in the [1..100] range. Smaller values result
+   in a more fair, but less performant spinlock,
+   and vice versa. The default value is 10.

ms granularity seems very coarse grained for this. Surely
at some point of spinning you can afford a ktime_get? But ok.

We are reading time when we are at the head of the (main) queue, but
don’t have the lock yet. Not sure about the latency of ktime_get(), but
anything reasonably fast but not necessarily precise should work.


Could you turn that into a moduleparm which can be changed at runtime?
Would be strange to have to reboot just to play with this parameter

Yes, good suggestion, thanks.


This would also make the code a lot shorter I guess.

So you don’t think we need the command-line parameter, just the module_param?


The CNA code, if enabled, will be in vmlinux, not in a kernel module. As 
a result, I think a module parameter will be no different from a kernel 
command line parameter in this regard.


Cheers,
Longman




[PATCH v3 4/5] mm/memcg: Separate out object stock data into its own struct

2021-04-13 Thread Waiman Long
The object stock data stored in struct memcg_stock_pcp are independent
of the other page based data stored there. Separating them out into
their own struct to highlight the independency.

Signed-off-by: Waiman Long 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
---
 mm/memcontrol.c | 41 ++---
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 539c3b632e47..69f728383efe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2214,17 +2214,22 @@ void unlock_page_memcg(struct page *page)
 }
 EXPORT_SYMBOL(unlock_page_memcg);
 
-struct memcg_stock_pcp {
-   struct mem_cgroup *cached; /* this never be root cgroup */
-   unsigned int nr_pages;
-
+struct obj_stock {
 #ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *cached_objcg;
struct pglist_data *cached_pgdat;
unsigned int nr_bytes;
int vmstat_idx;
int vmstat_bytes;
+#else
+   int dummy[0];
 #endif
+};
+
+struct memcg_stock_pcp {
+   struct mem_cgroup *cached; /* this never be root cgroup */
+   unsigned int nr_pages;
+   struct obj_stock obj;
 
struct work_struct work;
unsigned long flags;
@@ -2234,12 +2239,12 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, 
memcg_stock);
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct memcg_stock_pcp *stock);
+static void drain_obj_stock(struct obj_stock *stock);
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 struct mem_cgroup *root_memcg);
 
 #else
-static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
+static inline void drain_obj_stock(struct obj_stock *stock)
 {
 }
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -2249,6 +2254,13 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
 }
 #endif
 
+static inline struct obj_stock *current_obj_stock(void)
+{
+   struct memcg_stock_pcp *stock = this_cpu_ptr(_stock);
+
+   return >obj;
+}
+
 /**
  * consume_stock: Try to consume stocked charge on this cpu.
  * @memcg: memcg to consume from.
@@ -2315,7 +2327,7 @@ static void drain_local_stock(struct work_struct *dummy)
local_irq_save(flags);
 
stock = this_cpu_ptr(_stock);
-   drain_obj_stock(stock);
+   drain_obj_stock(>obj);
drain_stock(stock);
clear_bit(FLUSHING_CACHED_CHARGE, >flags);
 
@@ -3177,13 +3189,13 @@ static inline void mod_objcg_state(struct obj_cgroup 
*objcg,
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
-   struct memcg_stock_pcp *stock;
+   struct obj_stock *stock;
unsigned long flags;
bool ret = false;
 
local_irq_save(flags);
 
-   stock = this_cpu_ptr(_stock);
+   stock = current_obj_stock();
if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
stock->nr_bytes -= nr_bytes;
ret = true;
@@ -3194,7 +3206,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, 
unsigned int nr_bytes)
return ret;
 }
 
-static void drain_obj_stock(struct memcg_stock_pcp *stock)
+static void drain_obj_stock(struct obj_stock *stock)
 {
struct obj_cgroup *old = stock->cached_objcg;
 
@@ -3242,8 +3254,8 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
 {
struct mem_cgroup *memcg;
 
-   if (stock->cached_objcg) {
-   memcg = obj_cgroup_memcg(stock->cached_objcg);
+   if (stock->obj.cached_objcg) {
+   memcg = obj_cgroup_memcg(stock->obj.cached_objcg);
if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
return true;
}
@@ -3253,9 +3265,8 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
 
 static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
-   struct memcg_stock_pcp *stock;
+   struct obj_stock *stock = current_obj_stock();
 
-   stock = this_cpu_ptr(_stock);
if (stock->cached_objcg != objcg) { /* reset if necessary */
drain_obj_stock(stock);
obj_cgroup_get(objcg);
@@ -3280,7 +3291,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, 
unsigned int nr_bytes)
 static void __mod_obj_stock_state(struct obj_cgroup *objcg,
  struct pglist_data *pgdat, int idx, int nr)
 {
-   struct memcg_stock_pcp *stock = this_cpu_ptr(_stock);
+   struct obj_stock *stock = current_obj_stock();
 
if (stock->cached_objcg != objcg) {
/* Output the current data as is */
-- 
2.18.1



[PATCH v3 5/5] mm/memcg: Optimize user context object stock access

2021-04-13 Thread Waiman Long
Most kmem_cache_alloc() calls are from user context. With instrumentation
enabled, the measured amount of kmem_cache_alloc() calls from non-task
context was about 0.01% of the total.

The irq disable/enable sequence used in this case to access content
from object stock is slow.  To optimize for user context access, there
are now two object stocks for task context and interrupt context access
respectively.

The task context object stock can be accessed after disabling preemption
which is cheap in non-preempt kernel. The interrupt context object stock
can only be accessed after disabling interrupt. User context code can
access interrupt object stock, but not vice versa.

The mod_objcg_state() function is also modified to make sure that memcg
and lruvec stat updates are done with interrupted disabled.

The downside of this change is that there are more data stored in local
object stocks and not reflected in the charge counter and the vmstat
arrays.  However, this is a small price to pay for better performance.

Signed-off-by: Waiman Long 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
---
 mm/memcontrol.c | 74 +++--
 1 file changed, 59 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 69f728383efe..8875e896e52b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2229,7 +2229,8 @@ struct obj_stock {
 struct memcg_stock_pcp {
struct mem_cgroup *cached; /* this never be root cgroup */
unsigned int nr_pages;
-   struct obj_stock obj;
+   struct obj_stock task_obj;
+   struct obj_stock irq_obj;
 
struct work_struct work;
unsigned long flags;
@@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
 }
 #endif
 
+/*
+ * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
+ * sequence used in this case to access content from object stock is slow.
+ * To optimize for user context access, there are now two object stocks for
+ * task context and interrupt context access respectively.
+ *
+ * The task context object stock can be accessed by disabling preemption only
+ * which is cheap in non-preempt kernel. The interrupt context object stock
+ * can only be accessed after disabling interrupt. User context code can
+ * access interrupt object stock, but not vice versa.
+ */
 static inline struct obj_stock *current_obj_stock(void)
 {
struct memcg_stock_pcp *stock = this_cpu_ptr(_stock);
 
-   return >obj;
+   return in_task() ? >task_obj : >irq_obj;
+}
+
+#define get_obj_stock(flags)   \
+({ \
+   struct memcg_stock_pcp *stock;  \
+   struct obj_stock *obj_stock;\
+   \
+   if (in_task()) {\
+   preempt_disable();  \
+   (flags) = -1L;  \
+   stock = this_cpu_ptr(_stock); \
+   obj_stock = >task_obj;   \
+   } else {\
+   local_irq_save(flags);  \
+   stock = this_cpu_ptr(_stock); \
+   obj_stock = >irq_obj;\
+   }   \
+   obj_stock;  \
+})
+
+static inline void put_obj_stock(unsigned long flags)
+{
+   if (flags == -1L)
+   preempt_enable();
+   else
+   local_irq_restore(flags);
 }
 
 /**
@@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
local_irq_save(flags);
 
stock = this_cpu_ptr(_stock);
-   drain_obj_stock(>obj);
+   drain_obj_stock(>irq_obj);
+   if (in_task())
+   drain_obj_stock(>task_obj);
drain_stock(stock);
clear_bit(FLUSHING_CACHED_CHARGE, >flags);
 
@@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup 
*objcg,
memcg = obj_cgroup_memcg(objcg);
if (pgdat)
lruvec = mem_cgroup_lruvec(memcg, pgdat);
-   __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
+   mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
rcu_read_unlock();
 }
 
@@ -3193,15 +3233,14 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, 
unsigned int nr_bytes)
unsigned long flags;
bool ret = false;
 
-   local_irq_save(flags);
+   stock = get_obj_stock(flags);
 
-   stock = current_obj_stock();
if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
stock->nr_bytes -= nr_bytes;
ret = true;
}
 
-   local_irq_restore(flags);
+   put_obj_stock(flags);
 
return ret;
 }
@@ -3254,8 +3293,13 @@ static bool o

[PATCH v3 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()

2021-04-13 Thread Waiman Long
The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily
available. So both of them are now passed to mod_memcg_lruvec_state()
and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is
updated to allow either of the two parameters to be set to null. This
makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec
is null.

The new __mod_memcg_lruvec_state() function will be used in the next
patch as a replacement of mod_memcg_state() in mm/percpu.c for the
consolidation of the memory uncharge and vmstat update functions in
the kmem_cache_free() path.

Signed-off-by: Waiman Long 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
---
 include/linux/memcontrol.h | 12 +++-
 mm/memcontrol.c| 19 +--
 mm/slab.h  |  2 +-
 3 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0c04d39a7967..95f12996e66c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -955,8 +955,8 @@ static inline unsigned long lruvec_page_state_local(struct 
lruvec *lruvec,
return x;
 }
 
-void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
- int val);
+void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec,
+ enum node_stat_item idx, int val);
 void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val);
 
 static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
@@ -969,13 +969,14 @@ static inline void mod_lruvec_kmem_state(void *p, enum 
node_stat_item idx,
local_irq_restore(flags);
 }
 
-static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
+static inline void mod_memcg_lruvec_state(struct mem_cgroup *memcg,
+ struct lruvec *lruvec,
  enum node_stat_item idx, int val)
 {
unsigned long flags;
 
local_irq_save(flags);
-   __mod_memcg_lruvec_state(lruvec, idx, val);
+   __mod_memcg_lruvec_state(memcg, lruvec, idx, val);
local_irq_restore(flags);
 }
 
@@ -1369,7 +1370,8 @@ static inline unsigned long 
lruvec_page_state_local(struct lruvec *lruvec,
return node_page_state(lruvec_pgdat(lruvec), idx);
 }
 
-static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec,
+static inline void __mod_memcg_lruvec_state(struct mem_cgroup *memcg,
+   struct lruvec *lruvec,
enum node_stat_item idx, int val)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d850a..d66e1e38f8ac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -799,20 +799,27 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
return mem_cgroup_nodeinfo(parent, nid);
 }
 
-void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
- int val)
+/*
+ * Either one of memcg or lruvec can be NULL, but not both.
+ */
+void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec,
+ enum node_stat_item idx, int val)
 {
struct mem_cgroup_per_node *pn;
-   struct mem_cgroup *memcg;
long x, threshold = MEMCG_CHARGE_BATCH;
 
+   /* Update lruvec */
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
-   memcg = pn->memcg;
+
+   if (!memcg)
+   memcg = pn->memcg;
 
/* Update memcg */
__mod_memcg_state(memcg, idx, val);
 
-   /* Update lruvec */
+   if (!lruvec)
+   return;
+
__this_cpu_add(pn->lruvec_stat_local->count[idx], val);
 
if (vmstat_item_in_bytes(idx))
@@ -848,7 +855,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum 
node_stat_item idx,
 
/* Update memcg and lruvec */
if (!mem_cgroup_disabled())
-   __mod_memcg_lruvec_state(lruvec, idx, val);
+   __mod_memcg_lruvec_state(NULL, lruvec, idx, val);
 }
 
 void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
diff --git a/mm/slab.h b/mm/slab.h
index 076582f58f68..bc6c7545e487 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -293,7 +293,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
rcu_read_lock();
memcg = obj_cgroup_memcg(objcg);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
-   mod_memcg_lruvec_state(lruvec, idx, nr);
+   mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
rcu_read_unlock();
 }
 
-- 
2.18.1



[PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp

2021-04-13 Thread Waiman Long
Before the new slab memory controller with per object byte charging,
charging and vmstat data update happen only when new slab pages are
allocated or freed. Now they are done with every kmem_cache_alloc()
and kmem_cache_free(). This causes additional overhead for workloads
that generate a lot of alloc and free calls.

The memcg_stock_pcp is used to cache byte charge for a specific
obj_cgroup to reduce that overhead. To further reducing it, this patch
makes the vmstat data cached in the memcg_stock_pcp structure as well
until it accumulates a page size worth of update or when other cached
data change.

On a 2-socket Cascade Lake server with instrumentation enabled and this
patch applied, it was found that about 17% (946796 out of 5515184) of the
time when __mod_obj_stock_state() is called leads to an actual call to
mod_objcg_state() after initial boot. When doing parallel kernel build,
the figure was about 16% (21894614 out of 139780628). So caching the
vmstat data reduces the number of calls to mod_objcg_state() by more
than 80%.

Signed-off-by: Waiman Long 
Reviewed-by: Shakeel Butt 
---
 mm/memcontrol.c | 78 +++--
 mm/slab.h   | 26 +++--
 2 files changed, 79 insertions(+), 25 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b19100c68aa0..539c3b632e47 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2220,7 +2220,10 @@ struct memcg_stock_pcp {
 
 #ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *cached_objcg;
+   struct pglist_data *cached_pgdat;
unsigned int nr_bytes;
+   int vmstat_idx;
+   int vmstat_bytes;
 #endif
 
struct work_struct work;
@@ -3157,6 +3160,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int 
order)
css_put(>css);
 }
 
+static inline void mod_objcg_state(struct obj_cgroup *objcg,
+  struct pglist_data *pgdat,
+  enum node_stat_item idx, int nr)
+{
+   struct mem_cgroup *memcg;
+   struct lruvec *lruvec = NULL;
+
+   rcu_read_lock();
+   memcg = obj_cgroup_memcg(objcg);
+   if (pgdat)
+   lruvec = mem_cgroup_lruvec(memcg, pgdat);
+   __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
+   rcu_read_unlock();
+}
+
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
struct memcg_stock_pcp *stock;
@@ -3207,6 +3225,14 @@ static void drain_obj_stock(struct memcg_stock_pcp 
*stock)
stock->nr_bytes = 0;
}
 
+   if (stock->vmstat_bytes) {
+   mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
+   stock->vmstat_bytes);
+   stock->vmstat_bytes = 0;
+   stock->vmstat_idx = 0;
+   stock->cached_pgdat = NULL;
+   }
+
obj_cgroup_put(old);
stock->cached_objcg = NULL;
 }
@@ -3251,6 +3277,48 @@ static void refill_obj_stock(struct obj_cgroup *objcg, 
unsigned int nr_bytes)
local_irq_restore(flags);
 }
 
+static void __mod_obj_stock_state(struct obj_cgroup *objcg,
+ struct pglist_data *pgdat, int idx, int nr)
+{
+   struct memcg_stock_pcp *stock = this_cpu_ptr(_stock);
+
+   if (stock->cached_objcg != objcg) {
+   /* Output the current data as is */
+   } else if (!stock->vmstat_bytes) {
+   /* Save the current data */
+   stock->vmstat_bytes = nr;
+   stock->vmstat_idx = idx;
+   stock->cached_pgdat = pgdat;
+   nr = 0;
+   } else if ((stock->cached_pgdat != pgdat) ||
+  (stock->vmstat_idx != idx)) {
+   /* Output the cached data & save the current data */
+   swap(nr, stock->vmstat_bytes);
+   swap(idx, stock->vmstat_idx);
+   swap(pgdat, stock->cached_pgdat);
+   } else {
+   stock->vmstat_bytes += nr;
+   if (abs(nr) > PAGE_SIZE) {
+   nr = stock->vmstat_bytes;
+   stock->vmstat_bytes = 0;
+   } else {
+   nr = 0;
+   }
+   }
+   if (nr)
+   mod_objcg_state(objcg, pgdat, idx, nr);
+}
+
+void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+int idx, int nr)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   __mod_obj_stock_state(objcg, pgdat, idx, nr);
+   local_irq_restore(flags);
+}
+
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
 {
struct mem_cgroup *memcg;
@@ -3300,18 +3368,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup 
*objcg, size_t size,
   struct pglist_data *pgdat, int idx)
 {
unsigned long flags;
-   struct mem_cgroup 

[PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

2021-04-13 Thread Waiman Long
In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
is followed by mod_objcg_state()/mod_memcg_state(). Each of these
function call goes through a separate irq_save/irq_restore cycle. That
is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
that combines them with a single irq_save/irq_restore cycle.

Signed-off-by: Waiman Long 
Reviewed-by: Shakeel Butt 
Acked-by: Roman Gushchin 
---
 include/linux/memcontrol.h |  2 ++
 mm/memcontrol.c| 31 +++
 mm/percpu.c|  9 ++---
 mm/slab.h  |  6 +++---
 4 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 95f12996e66c..6890f999c1a3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1592,6 +1592,8 @@ struct obj_cgroup *get_obj_cgroup_from_current(void);
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
+void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
+  struct pglist_data *pgdat, int idx);
 
 extern struct static_key_false memcg_kmem_enabled_key;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d66e1e38f8ac..b19100c68aa0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3225,12 +3225,9 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
return false;
 }
 
-static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
struct memcg_stock_pcp *stock;
-   unsigned long flags;
-
-   local_irq_save(flags);
 
stock = this_cpu_ptr(_stock);
if (stock->cached_objcg != objcg) { /* reset if necessary */
@@ -3243,7 +3240,14 @@ static void refill_obj_stock(struct obj_cgroup *objcg, 
unsigned int nr_bytes)
 
if (stock->nr_bytes > PAGE_SIZE)
drain_obj_stock(stock);
+}
+
+static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+{
+   unsigned long flags;
 
+   local_irq_save(flags);
+   __refill_obj_stock(objcg, nr_bytes);
local_irq_restore(flags);
 }
 
@@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, 
size_t size)
refill_obj_stock(objcg, size);
 }
 
+void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
+  struct pglist_data *pgdat, int idx)
+{
+   unsigned long flags;
+   struct mem_cgroup *memcg;
+   struct lruvec *lruvec = NULL;
+
+   local_irq_save(flags);
+   __refill_obj_stock(objcg, size);
+
+   rcu_read_lock();
+   memcg = obj_cgroup_memcg(objcg);
+   if (pgdat)
+   lruvec = mem_cgroup_lruvec(memcg, pgdat);
+   __mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size);
+   rcu_read_unlock();
+   local_irq_restore(flags);
+}
+
 #endif /* CONFIG_MEMCG_KMEM */
 
 /*
diff --git a/mm/percpu.c b/mm/percpu.c
index 23308113a5ff..fd7aad6d7f90 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1631,13 +1631,8 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk 
*chunk, int off, size_t size)
objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
 
-   obj_cgroup_uncharge(objcg, size * num_possible_cpus());
-
-   rcu_read_lock();
-   mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
-   -(size * num_possible_cpus()));
-   rcu_read_unlock();
-
+   obj_cgroup_uncharge_mod_state(objcg, size * num_possible_cpus(),
+ NULL, MEMCG_PERCPU_B);
obj_cgroup_put(objcg);
 }
 
diff --git a/mm/slab.h b/mm/slab.h
index bc6c7545e487..677cdc52e641 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -366,9 +366,9 @@ static inline void memcg_slab_free_hook(struct kmem_cache 
*s_orig,
continue;
 
objcgs[off] = NULL;
-   obj_cgroup_uncharge(objcg, obj_full_size(s));
-   mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
-   -obj_full_size(s));
+   obj_cgroup_uncharge_mod_state(objcg, obj_full_size(s),
+ page_pgdat(page),
+ cache_vmstat_idx(s));
obj_cgroup_put(objcg);
}
 }
-- 
2.18.1



[PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

2021-04-13 Thread Waiman Long
 v3:
  - Add missing "inline" qualifier to the alternate mod_obj_stock_state()
in patch 3.
  - Remove redundant current_obj_stock() call in patch 5.

 v2:
  - Fix bug found by test robot in patch 5.
  - Update cover letter and commit logs.

With the recent introduction of the new slab memory controller, we
eliminate the need for having separate kmemcaches for each memory
cgroup and reduce overall kernel memory usage. However, we also add
additional memory accounting overhead to each call of kmem_cache_alloc()
and kmem_cache_free().

For workloads that require a lot of kmemcache allocations and
de-allocations, they may experience performance regression as illustrated
in [1] and [2].

A simple kernel module that performs repeated loop of 100,000,000
kmem_cache_alloc() and kmem_cache_free() of a 64-byte object at module
init time is used for benchmarking. The test was run on a CascadeLake
server with turbo-boosting disable to reduce run-to-run variation.

With memory accounting disable, the run time was 2.848s. With memory
accounting enabled, the run times with the application of various
patches in the patchset were:

  Applied patches   Run time   Accounting overhead   Overhead %age
  ---      ---   -
   None  10.800s 7.952s  100.0%
1-2   9.140s 6.292s   79.1%
1-3   7.641s 4.793s   60.3%
1-5   6.801s 3.953s   49.7%

Note that this is the best case scenario where most updates happen only
to the percpu stocks. Real workloads will likely have a certain amount
of updates to the memcg charges and vmstats. So the performance benefit
will be less.

It was found that a big part of the memory accounting overhead
was caused by the local_irq_save()/local_irq_restore() sequences in
updating local stock charge bytes and vmstat array, at least in x86
systems. There are two such sequences in kmem_cache_alloc() and two
in kmem_cache_free(). This patchset tries to reduce the use of such
sequences as much as possible. In fact, it eliminates them in the common
case. Another part of this patchset to cache the vmstat data update in
the local stock as well which also helps.

[1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
[2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/

Waiman Long (5):
  mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
  mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  mm/memcg: Separate out object stock data into its own struct
  mm/memcg: Optimize user context object stock access

 include/linux/memcontrol.h |  14 ++-
 mm/memcontrol.c| 199 -
 mm/percpu.c|   9 +-
 mm/slab.h  |  32 +++---
 4 files changed, 196 insertions(+), 58 deletions(-)

-- 
2.18.1



Re: [PATCH v2 5/5] mm/memcg: Optimize user context object stock access

2021-04-12 Thread Waiman Long

On 4/12/21 7:10 PM, Shakeel Butt wrote:

On Mon, Apr 12, 2021 at 3:55 PM Waiman Long  wrote:

Most kmem_cache_alloc() calls are from user context. With instrumentation
enabled, the measured amount of kmem_cache_alloc() calls from non-task
context was about 0.01% of the total.

The irq disable/enable sequence used in this case to access content
from object stock is slow.  To optimize for user context access, there
are now two object stocks for task context and interrupt context access
respectively.

The task context object stock can be accessed after disabling preemption
which is cheap in non-preempt kernel. The interrupt context object stock
can only be accessed after disabling interrupt. User context code can
access interrupt object stock, but not vice versa.

The mod_objcg_state() function is also modified to make sure that memcg
and lruvec stat updates are done with interrupted disabled.

The downside of this change is that there are more data stored in local
object stocks and not reflected in the charge counter and the vmstat
arrays.  However, this is a small price to pay for better performance.

Signed-off-by: Waiman Long 
Acked-by: Roman Gushchin 
---
  mm/memcontrol.c | 73 +++--
  1 file changed, 59 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 69f728383efe..29f2df76644a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2229,7 +2229,8 @@ struct obj_stock {
  struct memcg_stock_pcp {
 struct mem_cgroup *cached; /* this never be root cgroup */
 unsigned int nr_pages;
-   struct obj_stock obj;
+   struct obj_stock task_obj;
+   struct obj_stock irq_obj;

 struct work_struct work;
 unsigned long flags;
@@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
  }
  #endif

+/*
+ * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
+ * sequence used in this case to access content from object stock is slow.
+ * To optimize for user context access, there are now two object stocks for
+ * task context and interrupt context access respectively.
+ *
+ * The task context object stock can be accessed by disabling preemption only
+ * which is cheap in non-preempt kernel. The interrupt context object stock
+ * can only be accessed after disabling interrupt. User context code can
+ * access interrupt object stock, but not vice versa.
+ */
  static inline struct obj_stock *current_obj_stock(void)
  {
 struct memcg_stock_pcp *stock = this_cpu_ptr(_stock);

-   return >obj;
+   return in_task() ? >task_obj : >irq_obj;
+}
+
+#define get_obj_stock(flags)   \
+({ \
+   struct memcg_stock_pcp *stock;  \
+   struct obj_stock *obj_stock;\
+   \
+   if (in_task()) {\
+   preempt_disable();  \
+   (flags) = -1L;  \
+   stock = this_cpu_ptr(_stock); \

The above line was missing in the previous version.


+   obj_stock = >task_obj;   \
+   } else {\
+   local_irq_save(flags);  \
+   stock = this_cpu_ptr(_stock); \
+   obj_stock = >irq_obj;\
+   }   \
+   obj_stock;  \
+})
+
+static inline void put_obj_stock(unsigned long flags)
+{
+   if (flags == -1L)
+   preempt_enable();
+   else
+   local_irq_restore(flags);
  }

  /**
@@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
 local_irq_save(flags);

 stock = this_cpu_ptr(_stock);
-   drain_obj_stock(>obj);
+   drain_obj_stock(>irq_obj);
+   if (in_task())
+   drain_obj_stock(>task_obj);
 drain_stock(stock);
 clear_bit(FLUSHING_CACHED_CHARGE, >flags);

@@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup 
*objcg,
 memcg = obj_cgroup_memcg(objcg);
 if (pgdat)
 lruvec = mem_cgroup_lruvec(memcg, pgdat);
-   __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
+   mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
 rcu_read_unlock();
  }

@@ -3193,7 +3233,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, 
unsigned int nr_bytes)
 unsigned long flags;
 bool ret = false;

-   local_irq_save(flags);
+   stock = get_obj_stock(flags);

 stock = current_obj_stock();

The above is redundant.


Right. I should check the patch carefully. Will remove it.

Thanks,
Longman




[PATCH v2 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

2021-04-12 Thread Waiman Long
 v2:
  - Fix bug found by test robot in patch 5.
  - Update cover letter and commit logs.

With the recent introduction of the new slab memory controller, we
eliminate the need for having separate kmemcaches for each memory
cgroup and reduce overall kernel memory usage. However, we also add
additional memory accounting overhead to each call of kmem_cache_alloc()
and kmem_cache_free().

For workloads that require a lot of kmemcache allocations and
de-allocations, they may experience performance regression as illustrated
in [1] and [2].

A simple kernel module that performs repeated loop of 100,000,000
kmem_cache_alloc() and kmem_cache_free() of a 64-byte object at module
init time is used for benchmarking. The test was run on a CascadeLake
server with turbo-boosting disable to reduce run-to-run variation.

With memory accounting disable, the run time was 2.848s. With memory
accounting enabled, the run times with the application of various
patches in the patchset were:

  Applied patches   Run time   Accounting overhead   Overhead %age
  ---      ---   -
   None  10.800s 7.952s  100.0%
1-2   9.140s 6.292s   79.1%
1-3   7.641s 4.793s   60.3%
1-5   6.801s 3.953s   49.7%

Note that this is the best case scenario where most updates happen only
to the percpu stocks. Real workloads will likely have a certain amount
of updates to the memcg charges and vmstats. So the performance benefit
will be less.

It was found that a big part of the memory accounting overhead
was caused by the local_irq_save()/local_irq_restore() sequences in
updating local stock charge bytes and vmstat array, at least in x86
systems. There are two such sequences in kmem_cache_alloc() and two
in kmem_cache_free(). This patchset tries to reduce the use of such
sequences as much as possible. In fact, it eliminates them in the common
case. Another part of this patchset to cache the vmstat data update in
the local stock as well which also helps.

[1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
[2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/

Waiman Long (5):
  mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
  mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  mm/memcg: Separate out object stock data into its own struct
  mm/memcg: Optimize user context object stock access

 include/linux/memcontrol.h |  14 ++-
 mm/memcontrol.c| 200 -
 mm/percpu.c|   9 +-
 mm/slab.h  |  32 +++---
 4 files changed, 197 insertions(+), 58 deletions(-)

-- 
2.18.1



[PATCH v2 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()

2021-04-12 Thread Waiman Long
The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily
available. So both of them are now passed to mod_memcg_lruvec_state()
and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is
updated to allow either of the two parameters to be set to null. This
makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec
is null.

The new __mod_memcg_lruvec_state() function will be used in the next
patch as a replacement of mod_memcg_state() in mm/percpu.c for the
consolidation of the memory uncharge and vmstat update functions in
the kmem_cache_free() path.

Signed-off-by: Waiman Long 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
---
 include/linux/memcontrol.h | 12 +++-
 mm/memcontrol.c| 19 +--
 mm/slab.h  |  2 +-
 3 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0c04d39a7967..95f12996e66c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -955,8 +955,8 @@ static inline unsigned long lruvec_page_state_local(struct 
lruvec *lruvec,
return x;
 }
 
-void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
- int val);
+void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec,
+ enum node_stat_item idx, int val);
 void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val);
 
 static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
@@ -969,13 +969,14 @@ static inline void mod_lruvec_kmem_state(void *p, enum 
node_stat_item idx,
local_irq_restore(flags);
 }
 
-static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
+static inline void mod_memcg_lruvec_state(struct mem_cgroup *memcg,
+ struct lruvec *lruvec,
  enum node_stat_item idx, int val)
 {
unsigned long flags;
 
local_irq_save(flags);
-   __mod_memcg_lruvec_state(lruvec, idx, val);
+   __mod_memcg_lruvec_state(memcg, lruvec, idx, val);
local_irq_restore(flags);
 }
 
@@ -1369,7 +1370,8 @@ static inline unsigned long 
lruvec_page_state_local(struct lruvec *lruvec,
return node_page_state(lruvec_pgdat(lruvec), idx);
 }
 
-static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec,
+static inline void __mod_memcg_lruvec_state(struct mem_cgroup *memcg,
+   struct lruvec *lruvec,
enum node_stat_item idx, int val)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d850a..d66e1e38f8ac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -799,20 +799,27 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
return mem_cgroup_nodeinfo(parent, nid);
 }
 
-void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
- int val)
+/*
+ * Either one of memcg or lruvec can be NULL, but not both.
+ */
+void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec,
+ enum node_stat_item idx, int val)
 {
struct mem_cgroup_per_node *pn;
-   struct mem_cgroup *memcg;
long x, threshold = MEMCG_CHARGE_BATCH;
 
+   /* Update lruvec */
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
-   memcg = pn->memcg;
+
+   if (!memcg)
+   memcg = pn->memcg;
 
/* Update memcg */
__mod_memcg_state(memcg, idx, val);
 
-   /* Update lruvec */
+   if (!lruvec)
+   return;
+
__this_cpu_add(pn->lruvec_stat_local->count[idx], val);
 
if (vmstat_item_in_bytes(idx))
@@ -848,7 +855,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum 
node_stat_item idx,
 
/* Update memcg and lruvec */
if (!mem_cgroup_disabled())
-   __mod_memcg_lruvec_state(lruvec, idx, val);
+   __mod_memcg_lruvec_state(NULL, lruvec, idx, val);
 }
 
 void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
diff --git a/mm/slab.h b/mm/slab.h
index 076582f58f68..bc6c7545e487 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -293,7 +293,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
rcu_read_lock();
memcg = obj_cgroup_memcg(objcg);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
-   mod_memcg_lruvec_state(lruvec, idx, nr);
+   mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
rcu_read_unlock();
 }
 
-- 
2.18.1



[PATCH v2 4/5] mm/memcg: Separate out object stock data into its own struct

2021-04-12 Thread Waiman Long
The object stock data stored in struct memcg_stock_pcp are independent
of the other page based data stored there. Separating them out into
their own struct to highlight the independency.

Signed-off-by: Waiman Long 
Acked-by: Roman Gushchin 
---
 mm/memcontrol.c | 41 ++---
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 539c3b632e47..69f728383efe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2214,17 +2214,22 @@ void unlock_page_memcg(struct page *page)
 }
 EXPORT_SYMBOL(unlock_page_memcg);
 
-struct memcg_stock_pcp {
-   struct mem_cgroup *cached; /* this never be root cgroup */
-   unsigned int nr_pages;
-
+struct obj_stock {
 #ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *cached_objcg;
struct pglist_data *cached_pgdat;
unsigned int nr_bytes;
int vmstat_idx;
int vmstat_bytes;
+#else
+   int dummy[0];
 #endif
+};
+
+struct memcg_stock_pcp {
+   struct mem_cgroup *cached; /* this never be root cgroup */
+   unsigned int nr_pages;
+   struct obj_stock obj;
 
struct work_struct work;
unsigned long flags;
@@ -2234,12 +2239,12 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, 
memcg_stock);
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct memcg_stock_pcp *stock);
+static void drain_obj_stock(struct obj_stock *stock);
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 struct mem_cgroup *root_memcg);
 
 #else
-static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
+static inline void drain_obj_stock(struct obj_stock *stock)
 {
 }
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -2249,6 +2254,13 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
 }
 #endif
 
+static inline struct obj_stock *current_obj_stock(void)
+{
+   struct memcg_stock_pcp *stock = this_cpu_ptr(_stock);
+
+   return >obj;
+}
+
 /**
  * consume_stock: Try to consume stocked charge on this cpu.
  * @memcg: memcg to consume from.
@@ -2315,7 +2327,7 @@ static void drain_local_stock(struct work_struct *dummy)
local_irq_save(flags);
 
stock = this_cpu_ptr(_stock);
-   drain_obj_stock(stock);
+   drain_obj_stock(>obj);
drain_stock(stock);
clear_bit(FLUSHING_CACHED_CHARGE, >flags);
 
@@ -3177,13 +3189,13 @@ static inline void mod_objcg_state(struct obj_cgroup 
*objcg,
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
-   struct memcg_stock_pcp *stock;
+   struct obj_stock *stock;
unsigned long flags;
bool ret = false;
 
local_irq_save(flags);
 
-   stock = this_cpu_ptr(_stock);
+   stock = current_obj_stock();
if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
stock->nr_bytes -= nr_bytes;
ret = true;
@@ -3194,7 +3206,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, 
unsigned int nr_bytes)
return ret;
 }
 
-static void drain_obj_stock(struct memcg_stock_pcp *stock)
+static void drain_obj_stock(struct obj_stock *stock)
 {
struct obj_cgroup *old = stock->cached_objcg;
 
@@ -3242,8 +3254,8 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
 {
struct mem_cgroup *memcg;
 
-   if (stock->cached_objcg) {
-   memcg = obj_cgroup_memcg(stock->cached_objcg);
+   if (stock->obj.cached_objcg) {
+   memcg = obj_cgroup_memcg(stock->obj.cached_objcg);
if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
return true;
}
@@ -3253,9 +3265,8 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
 
 static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
-   struct memcg_stock_pcp *stock;
+   struct obj_stock *stock = current_obj_stock();
 
-   stock = this_cpu_ptr(_stock);
if (stock->cached_objcg != objcg) { /* reset if necessary */
drain_obj_stock(stock);
obj_cgroup_get(objcg);
@@ -3280,7 +3291,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, 
unsigned int nr_bytes)
 static void __mod_obj_stock_state(struct obj_cgroup *objcg,
  struct pglist_data *pgdat, int idx, int nr)
 {
-   struct memcg_stock_pcp *stock = this_cpu_ptr(_stock);
+   struct obj_stock *stock = current_obj_stock();
 
if (stock->cached_objcg != objcg) {
/* Output the current data as is */
-- 
2.18.1



[PATCH v2 5/5] mm/memcg: Optimize user context object stock access

2021-04-12 Thread Waiman Long
Most kmem_cache_alloc() calls are from user context. With instrumentation
enabled, the measured amount of kmem_cache_alloc() calls from non-task
context was about 0.01% of the total.

The irq disable/enable sequence used in this case to access content
from object stock is slow.  To optimize for user context access, there
are now two object stocks for task context and interrupt context access
respectively.

The task context object stock can be accessed after disabling preemption
which is cheap in non-preempt kernel. The interrupt context object stock
can only be accessed after disabling interrupt. User context code can
access interrupt object stock, but not vice versa.

The mod_objcg_state() function is also modified to make sure that memcg
and lruvec stat updates are done with interrupted disabled.

The downside of this change is that there are more data stored in local
object stocks and not reflected in the charge counter and the vmstat
arrays.  However, this is a small price to pay for better performance.

Signed-off-by: Waiman Long 
Acked-by: Roman Gushchin 
---
 mm/memcontrol.c | 73 +++--
 1 file changed, 59 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 69f728383efe..29f2df76644a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2229,7 +2229,8 @@ struct obj_stock {
 struct memcg_stock_pcp {
struct mem_cgroup *cached; /* this never be root cgroup */
unsigned int nr_pages;
-   struct obj_stock obj;
+   struct obj_stock task_obj;
+   struct obj_stock irq_obj;
 
struct work_struct work;
unsigned long flags;
@@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
 }
 #endif
 
+/*
+ * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
+ * sequence used in this case to access content from object stock is slow.
+ * To optimize for user context access, there are now two object stocks for
+ * task context and interrupt context access respectively.
+ *
+ * The task context object stock can be accessed by disabling preemption only
+ * which is cheap in non-preempt kernel. The interrupt context object stock
+ * can only be accessed after disabling interrupt. User context code can
+ * access interrupt object stock, but not vice versa.
+ */
 static inline struct obj_stock *current_obj_stock(void)
 {
struct memcg_stock_pcp *stock = this_cpu_ptr(_stock);
 
-   return >obj;
+   return in_task() ? >task_obj : >irq_obj;
+}
+
+#define get_obj_stock(flags)   \
+({ \
+   struct memcg_stock_pcp *stock;  \
+   struct obj_stock *obj_stock;\
+   \
+   if (in_task()) {\
+   preempt_disable();  \
+   (flags) = -1L;  \
+   stock = this_cpu_ptr(_stock); \
+   obj_stock = >task_obj;   \
+   } else {\
+   local_irq_save(flags);  \
+   stock = this_cpu_ptr(_stock); \
+   obj_stock = >irq_obj;\
+   }   \
+   obj_stock;  \
+})
+
+static inline void put_obj_stock(unsigned long flags)
+{
+   if (flags == -1L)
+   preempt_enable();
+   else
+   local_irq_restore(flags);
 }
 
 /**
@@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
local_irq_save(flags);
 
stock = this_cpu_ptr(_stock);
-   drain_obj_stock(>obj);
+   drain_obj_stock(>irq_obj);
+   if (in_task())
+   drain_obj_stock(>task_obj);
drain_stock(stock);
clear_bit(FLUSHING_CACHED_CHARGE, >flags);
 
@@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup 
*objcg,
memcg = obj_cgroup_memcg(objcg);
if (pgdat)
lruvec = mem_cgroup_lruvec(memcg, pgdat);
-   __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
+   mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
rcu_read_unlock();
 }
 
@@ -3193,7 +3233,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, 
unsigned int nr_bytes)
unsigned long flags;
bool ret = false;
 
-   local_irq_save(flags);
+   stock = get_obj_stock(flags);
 
stock = current_obj_stock();
if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
@@ -3201,7 +3241,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, 
unsigned int nr_bytes)
ret = true;
}
 
-   local_irq_restore(flags);
+   put_obj_stock(flags);
 
return ret;
 }
@@ -3254,8 +329

[PATCH v2 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

2021-04-12 Thread Waiman Long
In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
is followed by mod_objcg_state()/mod_memcg_state(). Each of these
function call goes through a separate irq_save/irq_restore cycle. That
is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
that combines them with a single irq_save/irq_restore cycle.

Signed-off-by: Waiman Long 
Reviewed-by: Shakeel Butt 
Acked-by: Roman Gushchin 
---
 include/linux/memcontrol.h |  2 ++
 mm/memcontrol.c| 31 +++
 mm/percpu.c|  9 ++---
 mm/slab.h  |  6 +++---
 4 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 95f12996e66c..6890f999c1a3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1592,6 +1592,8 @@ struct obj_cgroup *get_obj_cgroup_from_current(void);
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
+void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
+  struct pglist_data *pgdat, int idx);
 
 extern struct static_key_false memcg_kmem_enabled_key;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d66e1e38f8ac..b19100c68aa0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3225,12 +3225,9 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
return false;
 }
 
-static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
struct memcg_stock_pcp *stock;
-   unsigned long flags;
-
-   local_irq_save(flags);
 
stock = this_cpu_ptr(_stock);
if (stock->cached_objcg != objcg) { /* reset if necessary */
@@ -3243,7 +3240,14 @@ static void refill_obj_stock(struct obj_cgroup *objcg, 
unsigned int nr_bytes)
 
if (stock->nr_bytes > PAGE_SIZE)
drain_obj_stock(stock);
+}
+
+static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+{
+   unsigned long flags;
 
+   local_irq_save(flags);
+   __refill_obj_stock(objcg, nr_bytes);
local_irq_restore(flags);
 }
 
@@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, 
size_t size)
refill_obj_stock(objcg, size);
 }
 
+void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
+  struct pglist_data *pgdat, int idx)
+{
+   unsigned long flags;
+   struct mem_cgroup *memcg;
+   struct lruvec *lruvec = NULL;
+
+   local_irq_save(flags);
+   __refill_obj_stock(objcg, size);
+
+   rcu_read_lock();
+   memcg = obj_cgroup_memcg(objcg);
+   if (pgdat)
+   lruvec = mem_cgroup_lruvec(memcg, pgdat);
+   __mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size);
+   rcu_read_unlock();
+   local_irq_restore(flags);
+}
+
 #endif /* CONFIG_MEMCG_KMEM */
 
 /*
diff --git a/mm/percpu.c b/mm/percpu.c
index 23308113a5ff..fd7aad6d7f90 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1631,13 +1631,8 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk 
*chunk, int off, size_t size)
objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
 
-   obj_cgroup_uncharge(objcg, size * num_possible_cpus());
-
-   rcu_read_lock();
-   mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
-   -(size * num_possible_cpus()));
-   rcu_read_unlock();
-
+   obj_cgroup_uncharge_mod_state(objcg, size * num_possible_cpus(),
+ NULL, MEMCG_PERCPU_B);
obj_cgroup_put(objcg);
 }
 
diff --git a/mm/slab.h b/mm/slab.h
index bc6c7545e487..677cdc52e641 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -366,9 +366,9 @@ static inline void memcg_slab_free_hook(struct kmem_cache 
*s_orig,
continue;
 
objcgs[off] = NULL;
-   obj_cgroup_uncharge(objcg, obj_full_size(s));
-   mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
-   -obj_full_size(s));
+   obj_cgroup_uncharge_mod_state(objcg, obj_full_size(s),
+ page_pgdat(page),
+ cache_vmstat_idx(s));
obj_cgroup_put(objcg);
}
 }
-- 
2.18.1



[PATCH v2 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp

2021-04-12 Thread Waiman Long
Before the new slab memory controller with per object byte charging,
charging and vmstat data update happen only when new slab pages are
allocated or freed. Now they are done with every kmem_cache_alloc()
and kmem_cache_free(). This causes additional overhead for workloads
that generate a lot of alloc and free calls.

The memcg_stock_pcp is used to cache byte charge for a specific
obj_cgroup to reduce that overhead. To further reducing it, this patch
makes the vmstat data cached in the memcg_stock_pcp structure as well
until it accumulates a page size worth of update or when other cached
data change.

On a 2-socket Cascade Lake server with instrumentation enabled and this
patch applied, it was found that about 17% (946796 out of 5515184) of the
time when __mod_obj_stock_state() is called leads to an actual call to
mod_objcg_state() after initial boot. When doing parallel kernel build,
the figure was about 16% (21894614 out of 139780628). So caching the
vmstat data reduces the number of calls to mod_objcg_state() by more
than 80%.

Signed-off-by: Waiman Long 
---
 mm/memcontrol.c | 78 +++--
 mm/slab.h   | 26 +++--
 2 files changed, 79 insertions(+), 25 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b19100c68aa0..539c3b632e47 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2220,7 +2220,10 @@ struct memcg_stock_pcp {
 
 #ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *cached_objcg;
+   struct pglist_data *cached_pgdat;
unsigned int nr_bytes;
+   int vmstat_idx;
+   int vmstat_bytes;
 #endif
 
struct work_struct work;
@@ -3157,6 +3160,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int 
order)
css_put(>css);
 }
 
+static inline void mod_objcg_state(struct obj_cgroup *objcg,
+  struct pglist_data *pgdat,
+  enum node_stat_item idx, int nr)
+{
+   struct mem_cgroup *memcg;
+   struct lruvec *lruvec = NULL;
+
+   rcu_read_lock();
+   memcg = obj_cgroup_memcg(objcg);
+   if (pgdat)
+   lruvec = mem_cgroup_lruvec(memcg, pgdat);
+   __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
+   rcu_read_unlock();
+}
+
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
struct memcg_stock_pcp *stock;
@@ -3207,6 +3225,14 @@ static void drain_obj_stock(struct memcg_stock_pcp 
*stock)
stock->nr_bytes = 0;
}
 
+   if (stock->vmstat_bytes) {
+   mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
+   stock->vmstat_bytes);
+   stock->vmstat_bytes = 0;
+   stock->vmstat_idx = 0;
+   stock->cached_pgdat = NULL;
+   }
+
obj_cgroup_put(old);
stock->cached_objcg = NULL;
 }
@@ -3251,6 +3277,48 @@ static void refill_obj_stock(struct obj_cgroup *objcg, 
unsigned int nr_bytes)
local_irq_restore(flags);
 }
 
+static void __mod_obj_stock_state(struct obj_cgroup *objcg,
+ struct pglist_data *pgdat, int idx, int nr)
+{
+   struct memcg_stock_pcp *stock = this_cpu_ptr(_stock);
+
+   if (stock->cached_objcg != objcg) {
+   /* Output the current data as is */
+   } else if (!stock->vmstat_bytes) {
+   /* Save the current data */
+   stock->vmstat_bytes = nr;
+   stock->vmstat_idx = idx;
+   stock->cached_pgdat = pgdat;
+   nr = 0;
+   } else if ((stock->cached_pgdat != pgdat) ||
+  (stock->vmstat_idx != idx)) {
+   /* Output the cached data & save the current data */
+   swap(nr, stock->vmstat_bytes);
+   swap(idx, stock->vmstat_idx);
+   swap(pgdat, stock->cached_pgdat);
+   } else {
+   stock->vmstat_bytes += nr;
+   if (abs(nr) > PAGE_SIZE) {
+   nr = stock->vmstat_bytes;
+   stock->vmstat_bytes = 0;
+   } else {
+   nr = 0;
+   }
+   }
+   if (nr)
+   mod_objcg_state(objcg, pgdat, idx, nr);
+}
+
+void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+int idx, int nr)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   __mod_obj_stock_state(objcg, pgdat, idx, nr);
+   local_irq_restore(flags);
+}
+
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
 {
struct mem_cgroup *memcg;
@@ -3300,18 +3368,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup 
*objcg, size_t size,
   struct pglist_data *pgdat, int idx)
 {
unsigned long flags;
-   struct mem_cgroup *memcg;
-   struct lruvec *lruvec = N

Re: [PATCH 5/5] mm/memcg: Optimize user context object stock access

2021-04-12 Thread Waiman Long

On 4/12/21 2:55 PM, Roman Gushchin wrote:

On Fri, Apr 09, 2021 at 07:18:42PM -0400, Waiman Long wrote:

Most kmem_cache_alloc() calls are from user context. With instrumentation
enabled, the measured amount of kmem_cache_alloc() calls from non-task
context was about 0.01% of the total.

The irq disable/enable sequence used in this case to access content
from object stock is slow.  To optimize for user context access, there
are now two object stocks for task context and interrupt context access
respectively.

The task context object stock can be accessed after disabling preemption
which is cheap in non-preempt kernel. The interrupt context object stock
can only be accessed after disabling interrupt. User context code can
access interrupt object stock, but not vice versa.

The mod_objcg_state() function is also modified to make sure that memcg
and lruvec stat updates are done with interrupted disabled.

The downside of this change is that there are more data stored in local
object stocks and not reflected in the charge counter and the vmstat
arrays.  However, this is a small price to pay for better performance.

I agree, the extra memory space is not a significant concern.
I'd be more worried about the code complexity, but the result looks
nice to me!

Acked-by: Roman Gushchin 

Btw, it seems that the mm tree ran a bit off, so I had to apply this series
on top of Linus's tree to review. Please, rebase.


This patchset is based on the code in Linus' tree. I had applied the 
patchset to linux-next to see if there was any conflicts. Two of the 
patches had minor fuzzes around the edge but no actual merge conflict 
for now.


Cheers,
Longman




Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

2021-04-12 Thread Waiman Long

On 4/12/21 3:05 PM, Roman Gushchin wrote:

On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote:

With the recent introduction of the new slab memory controller, we
eliminate the need for having separate kmemcaches for each memory
cgroup and reduce overall kernel memory usage. However, we also add
additional memory accounting overhead to each call of kmem_cache_alloc()
and kmem_cache_free().

For workloads that require a lot of kmemcache allocations and
de-allocations, they may experience performance regression as illustrated
in [1].

With a simple kernel module that performs repeated loop of 100,000,000
kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
init. The execution time to load the kernel module with and without
memory accounting were:

   with accounting = 6.798s
   w/o  accounting = 1.758s

That is an increase of 5.04s (287%). With this patchset applied, the
execution time became 4.254s. So the memory accounting overhead is now
2.496s which is a 50% reduction.

Btw, there were two recent independent report about benchmark results
regression caused by the introduction of the per-object accounting:
1) Xing reported a hackbench regression:
https://lkml.org/lkml/2021/1/13/1277
2) Masayoshi reported a pgbench regression:
https://www.spinics.net/lists/linux-mm/msg252540.html

I wonder if you can run them (or at least one) and attach the result
to the series? It would be very helpful.


Actually, it was a bug reported filed by Masayoshi-san that triggered me 
to work on reducing the memory accounting overhead. He is also in the cc 
line and so is aware of that. I will cc Xing in my v2 patch.


Cheers,
Longman




Re: [PATCH 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp

2021-04-12 Thread Waiman Long

On 4/12/21 2:22 PM, Roman Gushchin wrote:

On Fri, Apr 09, 2021 at 07:18:40PM -0400, Waiman Long wrote:

Before the new slab memory controller with per object byte charging,
charging and vmstat data update happen only when new slab pages are
allocated or freed. Now they are done with every kmem_cache_alloc()
and kmem_cache_free(). This causes additional overhead for workloads
that generate a lot of alloc and free calls.

The memcg_stock_pcp is used to cache byte charge for a specific
obj_cgroup to reduce that overhead. To further reducing it, this patch
makes the vmstat data cached in the memcg_stock_pcp structure as well
until it accumulates a page size worth of update or when other cached
data change.

The idea makes total sense to me and also gives a hope to remove
byte-sized vmstats in the long-term.


On a 2-socket Cascade Lake server with instrumentation enabled and this
patch applied, it was found that about 17% (946796 out of 5515184) of the
time when __mod_obj_stock_state() is called leads to an actual call to
mod_objcg_state() after initial boot. When doing parallel kernel build,
the figure was about 16% (21894614 out of 139780628). So caching the
vmstat data reduces the number of calls to mod_objcg_state() by more
than 80%.

Signed-off-by: Waiman Long 
---
  mm/memcontrol.c | 78 +++--
  mm/slab.h   | 26 +++--
  2 files changed, 79 insertions(+), 25 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b19100c68aa0..539c3b632e47 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2220,7 +2220,10 @@ struct memcg_stock_pcp {
  
  #ifdef CONFIG_MEMCG_KMEM

struct obj_cgroup *cached_objcg;
+   struct pglist_data *cached_pgdat;
unsigned int nr_bytes;
+   int vmstat_idx;
+   int vmstat_bytes;
  #endif

Because vmstat_idx can realistically take only 3 values (slab_reclaimable,
slab_unreclaimable and percpu), I wonder if it's better to have
vmstat_bytes[3] and save a bit more on the reduced number of flushes?
It must be an often case when a complex (reclaimable) kernel object has
non-reclaimable parts (e.g. kmallocs) or percpu counters.
If the difference will be too small, maybe the current form is better.


I have thought about that too. However, that will make the code more 
complex. So I decided to cache just one for now. We can certainly play 
around with caching more in a later patch.


Cheers,
Longman




Re: [PATCH 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()

2021-04-12 Thread Waiman Long

On 4/12/21 2:04 PM, Roman Gushchin wrote:

On Fri, Apr 09, 2021 at 07:18:38PM -0400, Waiman Long wrote:

The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily
available. So both of them are now passed to mod_memcg_lruvec_state()
and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is
updated to allow either of the two parameters to be set to null. This
makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec
is null.

This patch seems to be correct, but it's a bit hard to understand why
it's required without looking into the rest of the series. Can you, please,
add a couple of words about it? E.g. we need it to handle stats which do not
exist on the lruvec level...

Otherwise,
Acked-by: Roman Gushchin 


Good point. Will update the commit log to indicate the change is needed 
for the subsequent patch.


Cheers,
Longman




Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

2021-04-12 Thread Waiman Long

On 4/12/21 1:47 PM, Roman Gushchin wrote:

On Mon, Apr 12, 2021 at 10:03:13AM -0400, Waiman Long wrote:

On 4/9/21 9:51 PM, Roman Gushchin wrote:

On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote:

With the recent introduction of the new slab memory controller, we
eliminate the need for having separate kmemcaches for each memory
cgroup and reduce overall kernel memory usage. However, we also add
additional memory accounting overhead to each call of kmem_cache_alloc()
and kmem_cache_free().

For workloads that require a lot of kmemcache allocations and
de-allocations, they may experience performance regression as illustrated
in [1].

With a simple kernel module that performs repeated loop of 100,000,000
kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
init. The execution time to load the kernel module with and without
memory accounting were:

with accounting = 6.798s
w/o  accounting = 1.758s

That is an increase of 5.04s (287%). With this patchset applied, the
execution time became 4.254s. So the memory accounting overhead is now
2.496s which is a 50% reduction.

Hi Waiman!

Thank you for working on it, it's indeed very useful!
A couple of questions:
1) did your config included lockdep or not?

The test kernel is based on a production kernel config and so lockdep isn't
enabled.

2) do you have a (rough) estimation how much each change contributes
 to the overall reduction?

I should have a better breakdown of the effect of individual patches. I
rerun the benchmarking module with turbo-boosting disabled to reduce
run-to-run variation. The execution times were:

Before patch: time = 10.800s (with memory accounting), 2.848s (w/o
accounting), overhead = 7.952s
After patch 2: time = 9.140s, overhead = 6.292s
After patch 3: time = 7.641s, overhead = 4.793s
After patch 5: time = 6.801s, overhead = 3.953s

Thank you! If there will be v2, I'd include this information into commit logs.


Yes, I am planning to send out v2 with these information in the 
cover-letter. I am just waiting a bit to see if there are more feedback.


-Longman




Patches 1 & 4 are preparatory patches that should affect performance.

So the memory accounting overhead was reduced by about half.


BTW, the benchmark that I used is kind of the best case behavior as it 
as all updates are to the percpu stocks. Real workloads will likely to 
have a certain amount of update to the memcg charges and vmstats. So the 
performance benefit will be less.


Cheers,
Longman




Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

2021-04-12 Thread Waiman Long

On 4/9/21 9:51 PM, Roman Gushchin wrote:

On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote:

With the recent introduction of the new slab memory controller, we
eliminate the need for having separate kmemcaches for each memory
cgroup and reduce overall kernel memory usage. However, we also add
additional memory accounting overhead to each call of kmem_cache_alloc()
and kmem_cache_free().

For workloads that require a lot of kmemcache allocations and
de-allocations, they may experience performance regression as illustrated
in [1].

With a simple kernel module that performs repeated loop of 100,000,000
kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
init. The execution time to load the kernel module with and without
memory accounting were:

   with accounting = 6.798s
   w/o  accounting = 1.758s

That is an increase of 5.04s (287%). With this patchset applied, the
execution time became 4.254s. So the memory accounting overhead is now
2.496s which is a 50% reduction.

Hi Waiman!

Thank you for working on it, it's indeed very useful!
A couple of questions:
1) did your config included lockdep or not?
The test kernel is based on a production kernel config and so lockdep 
isn't enabled.

2) do you have a (rough) estimation how much each change contributes
to the overall reduction?


I should have a better breakdown of the effect of individual patches. I 
rerun the benchmarking module with turbo-boosting disabled to reduce 
run-to-run variation. The execution times were:


Before patch: time = 10.800s (with memory accounting), 2.848s (w/o 
accounting), overhead = 7.952s

After patch 2: time = 9.140s, overhead = 6.292s
After patch 3: time = 7.641s, overhead = 4.793s
After patch 5: time = 6.801s, overhead = 3.953s

Patches 1 & 4 are preparatory patches that should affect performance.

So the memory accounting overhead was reduced by about half.

Cheers,
Longman



[PATCH 5/5] mm/memcg: Optimize user context object stock access

2021-04-09 Thread Waiman Long
Most kmem_cache_alloc() calls are from user context. With instrumentation
enabled, the measured amount of kmem_cache_alloc() calls from non-task
context was about 0.01% of the total.

The irq disable/enable sequence used in this case to access content
from object stock is slow.  To optimize for user context access, there
are now two object stocks for task context and interrupt context access
respectively.

The task context object stock can be accessed after disabling preemption
which is cheap in non-preempt kernel. The interrupt context object stock
can only be accessed after disabling interrupt. User context code can
access interrupt object stock, but not vice versa.

The mod_objcg_state() function is also modified to make sure that memcg
and lruvec stat updates are done with interrupted disabled.

The downside of this change is that there are more data stored in local
object stocks and not reflected in the charge counter and the vmstat
arrays.  However, this is a small price to pay for better performance.

Signed-off-by: Waiman Long 
---
 mm/memcontrol.c | 71 +++--
 1 file changed, 57 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 69f728383efe..00c9074e42e5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2229,7 +2229,8 @@ struct obj_stock {
 struct memcg_stock_pcp {
struct mem_cgroup *cached; /* this never be root cgroup */
unsigned int nr_pages;
-   struct obj_stock obj;
+   struct obj_stock task_obj;
+   struct obj_stock irq_obj;
 
struct work_struct work;
unsigned long flags;
@@ -2254,11 +2255,46 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
 }
 #endif
 
+/*
+ * Most kmem_cache_alloc() calls are from user context. The irq disable/enable
+ * sequence used in this case to access content from object stock is slow.
+ * To optimize for user context access, there are now two object stocks for
+ * task context and interrupt context access respectively.
+ *
+ * The task context object stock can be accessed by disabling preemption only
+ * which is cheap in non-preempt kernel. The interrupt context object stock
+ * can only be accessed after disabling interrupt. User context code can
+ * access interrupt object stock, but not vice versa.
+ */
 static inline struct obj_stock *current_obj_stock(void)
 {
struct memcg_stock_pcp *stock = this_cpu_ptr(_stock);
 
-   return >obj;
+   return in_task() ? >task_obj : >irq_obj;
+}
+
+#define get_obj_stock(flags)   \
+({ \
+   struct memcg_stock_pcp *stock;  \
+   struct obj_stock *obj_stock;\
+   \
+   if (in_task()) {\
+   preempt_disable();  \
+   (flags) = -1L;  \
+   obj_stock = >task_obj;   \
+   } else {\
+   local_irq_save(flags);  \
+   obj_stock = >irq_obj;\
+   }   \
+   obj_stock;  \
+})
+
+static inline void put_obj_stock(unsigned long flags)
+{
+   if (flags == -1L)
+   preempt_enable();
+   else
+   local_irq_restore(flags);
 }
 
 /**
@@ -2327,7 +2363,9 @@ static void drain_local_stock(struct work_struct *dummy)
local_irq_save(flags);
 
stock = this_cpu_ptr(_stock);
-   drain_obj_stock(>obj);
+   drain_obj_stock(>irq_obj);
+   if (in_task())
+   drain_obj_stock(>task_obj);
drain_stock(stock);
clear_bit(FLUSHING_CACHED_CHARGE, >flags);
 
@@ -3183,7 +3221,7 @@ static inline void mod_objcg_state(struct obj_cgroup 
*objcg,
memcg = obj_cgroup_memcg(objcg);
if (pgdat)
lruvec = mem_cgroup_lruvec(memcg, pgdat);
-   __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
+   mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
rcu_read_unlock();
 }
 
@@ -3193,7 +3231,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, 
unsigned int nr_bytes)
unsigned long flags;
bool ret = false;
 
-   local_irq_save(flags);
+   stock = get_obj_stock(flags);
 
stock = current_obj_stock();
if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
@@ -3201,7 +3239,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, 
unsigned int nr_bytes)
ret = true;
}
 
-   local_irq_restore(flags);
+   put_obj_stock(flags);
 
return ret;
 }
@@ -3254,8 +3292,13 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
 {
struct mem_cgroup *memcg;
 
-   if (stock->obj.cached_objcg) {
-   memcg = obj_cgroup_memcg(stock->obj.cached_objcg);
+  

[PATCH 4/5] mm/memcg: Separate out object stock data into its own struct

2021-04-09 Thread Waiman Long
The object stock data stored in struct memcg_stock_pcp are independent
of the other page based data stored there. Separating them out into
their own struct to highlight the independency.

Signed-off-by: Waiman Long 
---
 mm/memcontrol.c | 41 ++---
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 539c3b632e47..69f728383efe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2214,17 +2214,22 @@ void unlock_page_memcg(struct page *page)
 }
 EXPORT_SYMBOL(unlock_page_memcg);
 
-struct memcg_stock_pcp {
-   struct mem_cgroup *cached; /* this never be root cgroup */
-   unsigned int nr_pages;
-
+struct obj_stock {
 #ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *cached_objcg;
struct pglist_data *cached_pgdat;
unsigned int nr_bytes;
int vmstat_idx;
int vmstat_bytes;
+#else
+   int dummy[0];
 #endif
+};
+
+struct memcg_stock_pcp {
+   struct mem_cgroup *cached; /* this never be root cgroup */
+   unsigned int nr_pages;
+   struct obj_stock obj;
 
struct work_struct work;
unsigned long flags;
@@ -2234,12 +2239,12 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, 
memcg_stock);
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
-static void drain_obj_stock(struct memcg_stock_pcp *stock);
+static void drain_obj_stock(struct obj_stock *stock);
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
 struct mem_cgroup *root_memcg);
 
 #else
-static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
+static inline void drain_obj_stock(struct obj_stock *stock)
 {
 }
 static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
@@ -2249,6 +2254,13 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
 }
 #endif
 
+static inline struct obj_stock *current_obj_stock(void)
+{
+   struct memcg_stock_pcp *stock = this_cpu_ptr(_stock);
+
+   return >obj;
+}
+
 /**
  * consume_stock: Try to consume stocked charge on this cpu.
  * @memcg: memcg to consume from.
@@ -2315,7 +2327,7 @@ static void drain_local_stock(struct work_struct *dummy)
local_irq_save(flags);
 
stock = this_cpu_ptr(_stock);
-   drain_obj_stock(stock);
+   drain_obj_stock(>obj);
drain_stock(stock);
clear_bit(FLUSHING_CACHED_CHARGE, >flags);
 
@@ -3177,13 +3189,13 @@ static inline void mod_objcg_state(struct obj_cgroup 
*objcg,
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
-   struct memcg_stock_pcp *stock;
+   struct obj_stock *stock;
unsigned long flags;
bool ret = false;
 
local_irq_save(flags);
 
-   stock = this_cpu_ptr(_stock);
+   stock = current_obj_stock();
if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
stock->nr_bytes -= nr_bytes;
ret = true;
@@ -3194,7 +3206,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, 
unsigned int nr_bytes)
return ret;
 }
 
-static void drain_obj_stock(struct memcg_stock_pcp *stock)
+static void drain_obj_stock(struct obj_stock *stock)
 {
struct obj_cgroup *old = stock->cached_objcg;
 
@@ -3242,8 +3254,8 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
 {
struct mem_cgroup *memcg;
 
-   if (stock->cached_objcg) {
-   memcg = obj_cgroup_memcg(stock->cached_objcg);
+   if (stock->obj.cached_objcg) {
+   memcg = obj_cgroup_memcg(stock->obj.cached_objcg);
if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
return true;
}
@@ -3253,9 +3265,8 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
 
 static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
-   struct memcg_stock_pcp *stock;
+   struct obj_stock *stock = current_obj_stock();
 
-   stock = this_cpu_ptr(_stock);
if (stock->cached_objcg != objcg) { /* reset if necessary */
drain_obj_stock(stock);
obj_cgroup_get(objcg);
@@ -3280,7 +3291,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, 
unsigned int nr_bytes)
 static void __mod_obj_stock_state(struct obj_cgroup *objcg,
  struct pglist_data *pgdat, int idx, int nr)
 {
-   struct memcg_stock_pcp *stock = this_cpu_ptr(_stock);
+   struct obj_stock *stock = current_obj_stock();
 
if (stock->cached_objcg != objcg) {
/* Output the current data as is */
-- 
2.18.1



[PATCH 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp

2021-04-09 Thread Waiman Long
Before the new slab memory controller with per object byte charging,
charging and vmstat data update happen only when new slab pages are
allocated or freed. Now they are done with every kmem_cache_alloc()
and kmem_cache_free(). This causes additional overhead for workloads
that generate a lot of alloc and free calls.

The memcg_stock_pcp is used to cache byte charge for a specific
obj_cgroup to reduce that overhead. To further reducing it, this patch
makes the vmstat data cached in the memcg_stock_pcp structure as well
until it accumulates a page size worth of update or when other cached
data change.

On a 2-socket Cascade Lake server with instrumentation enabled and this
patch applied, it was found that about 17% (946796 out of 5515184) of the
time when __mod_obj_stock_state() is called leads to an actual call to
mod_objcg_state() after initial boot. When doing parallel kernel build,
the figure was about 16% (21894614 out of 139780628). So caching the
vmstat data reduces the number of calls to mod_objcg_state() by more
than 80%.

Signed-off-by: Waiman Long 
---
 mm/memcontrol.c | 78 +++--
 mm/slab.h   | 26 +++--
 2 files changed, 79 insertions(+), 25 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b19100c68aa0..539c3b632e47 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2220,7 +2220,10 @@ struct memcg_stock_pcp {
 
 #ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *cached_objcg;
+   struct pglist_data *cached_pgdat;
unsigned int nr_bytes;
+   int vmstat_idx;
+   int vmstat_bytes;
 #endif
 
struct work_struct work;
@@ -3157,6 +3160,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int 
order)
css_put(>css);
 }
 
+static inline void mod_objcg_state(struct obj_cgroup *objcg,
+  struct pglist_data *pgdat,
+  enum node_stat_item idx, int nr)
+{
+   struct mem_cgroup *memcg;
+   struct lruvec *lruvec = NULL;
+
+   rcu_read_lock();
+   memcg = obj_cgroup_memcg(objcg);
+   if (pgdat)
+   lruvec = mem_cgroup_lruvec(memcg, pgdat);
+   __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
+   rcu_read_unlock();
+}
+
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
struct memcg_stock_pcp *stock;
@@ -3207,6 +3225,14 @@ static void drain_obj_stock(struct memcg_stock_pcp 
*stock)
stock->nr_bytes = 0;
}
 
+   if (stock->vmstat_bytes) {
+   mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
+   stock->vmstat_bytes);
+   stock->vmstat_bytes = 0;
+   stock->vmstat_idx = 0;
+   stock->cached_pgdat = NULL;
+   }
+
obj_cgroup_put(old);
stock->cached_objcg = NULL;
 }
@@ -3251,6 +3277,48 @@ static void refill_obj_stock(struct obj_cgroup *objcg, 
unsigned int nr_bytes)
local_irq_restore(flags);
 }
 
+static void __mod_obj_stock_state(struct obj_cgroup *objcg,
+ struct pglist_data *pgdat, int idx, int nr)
+{
+   struct memcg_stock_pcp *stock = this_cpu_ptr(_stock);
+
+   if (stock->cached_objcg != objcg) {
+   /* Output the current data as is */
+   } else if (!stock->vmstat_bytes) {
+   /* Save the current data */
+   stock->vmstat_bytes = nr;
+   stock->vmstat_idx = idx;
+   stock->cached_pgdat = pgdat;
+   nr = 0;
+   } else if ((stock->cached_pgdat != pgdat) ||
+  (stock->vmstat_idx != idx)) {
+   /* Output the cached data & save the current data */
+   swap(nr, stock->vmstat_bytes);
+   swap(idx, stock->vmstat_idx);
+   swap(pgdat, stock->cached_pgdat);
+   } else {
+   stock->vmstat_bytes += nr;
+   if (abs(nr) > PAGE_SIZE) {
+   nr = stock->vmstat_bytes;
+   stock->vmstat_bytes = 0;
+   } else {
+   nr = 0;
+   }
+   }
+   if (nr)
+   mod_objcg_state(objcg, pgdat, idx, nr);
+}
+
+void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+int idx, int nr)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   __mod_obj_stock_state(objcg, pgdat, idx, nr);
+   local_irq_restore(flags);
+}
+
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
 {
struct mem_cgroup *memcg;
@@ -3300,18 +3368,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup 
*objcg, size_t size,
   struct pglist_data *pgdat, int idx)
 {
unsigned long flags;
-   struct mem_cgroup *memcg;
-   struct lruvec *lruvec = N

[PATCH 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

2021-04-09 Thread Waiman Long
In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
is followed by mod_objcg_state()/mod_memcg_state(). Each of these
function call goes through a separate irq_save/irq_restore cycle. That
is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
that combines them with a single irq_save/irq_restore cycle.

Signed-off-by: Waiman Long 
---
 include/linux/memcontrol.h |  2 ++
 mm/memcontrol.c| 31 +++
 mm/percpu.c|  9 ++---
 mm/slab.h  |  6 +++---
 4 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 95f12996e66c..6890f999c1a3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1592,6 +1592,8 @@ struct obj_cgroup *get_obj_cgroup_from_current(void);
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
+void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
+  struct pglist_data *pgdat, int idx);
 
 extern struct static_key_false memcg_kmem_enabled_key;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d66e1e38f8ac..b19100c68aa0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3225,12 +3225,9 @@ static bool obj_stock_flush_required(struct 
memcg_stock_pcp *stock,
return false;
 }
 
-static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 {
struct memcg_stock_pcp *stock;
-   unsigned long flags;
-
-   local_irq_save(flags);
 
stock = this_cpu_ptr(_stock);
if (stock->cached_objcg != objcg) { /* reset if necessary */
@@ -3243,7 +3240,14 @@ static void refill_obj_stock(struct obj_cgroup *objcg, 
unsigned int nr_bytes)
 
if (stock->nr_bytes > PAGE_SIZE)
drain_obj_stock(stock);
+}
+
+static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+{
+   unsigned long flags;
 
+   local_irq_save(flags);
+   __refill_obj_stock(objcg, nr_bytes);
local_irq_restore(flags);
 }
 
@@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, 
size_t size)
refill_obj_stock(objcg, size);
 }
 
+void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
+  struct pglist_data *pgdat, int idx)
+{
+   unsigned long flags;
+   struct mem_cgroup *memcg;
+   struct lruvec *lruvec = NULL;
+
+   local_irq_save(flags);
+   __refill_obj_stock(objcg, size);
+
+   rcu_read_lock();
+   memcg = obj_cgroup_memcg(objcg);
+   if (pgdat)
+   lruvec = mem_cgroup_lruvec(memcg, pgdat);
+   __mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size);
+   rcu_read_unlock();
+   local_irq_restore(flags);
+}
+
 #endif /* CONFIG_MEMCG_KMEM */
 
 /*
diff --git a/mm/percpu.c b/mm/percpu.c
index 6596a0a4286e..24ecd51c688c 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1631,13 +1631,8 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk 
*chunk, int off, size_t size)
objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
 
-   obj_cgroup_uncharge(objcg, size * num_possible_cpus());
-
-   rcu_read_lock();
-   mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
-   -(size * num_possible_cpus()));
-   rcu_read_unlock();
-
+   obj_cgroup_uncharge_mod_state(objcg, size * num_possible_cpus(),
+ NULL, MEMCG_PERCPU_B);
obj_cgroup_put(objcg);
 }
 
diff --git a/mm/slab.h b/mm/slab.h
index bc6c7545e487..677cdc52e641 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -366,9 +366,9 @@ static inline void memcg_slab_free_hook(struct kmem_cache 
*s_orig,
continue;
 
objcgs[off] = NULL;
-   obj_cgroup_uncharge(objcg, obj_full_size(s));
-   mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
-   -obj_full_size(s));
+   obj_cgroup_uncharge_mod_state(objcg, obj_full_size(s),
+ page_pgdat(page),
+ cache_vmstat_idx(s));
obj_cgroup_put(objcg);
}
 }
-- 
2.18.1



[PATCH 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()

2021-04-09 Thread Waiman Long
The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily
available. So both of them are now passed to mod_memcg_lruvec_state()
and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is
updated to allow either of the two parameters to be set to null. This
makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec
is null.

Signed-off-by: Waiman Long 
---
 include/linux/memcontrol.h | 12 +++-
 mm/memcontrol.c| 19 +--
 mm/slab.h  |  2 +-
 3 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0c04d39a7967..95f12996e66c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -955,8 +955,8 @@ static inline unsigned long lruvec_page_state_local(struct 
lruvec *lruvec,
return x;
 }
 
-void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
- int val);
+void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec,
+ enum node_stat_item idx, int val);
 void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val);
 
 static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
@@ -969,13 +969,14 @@ static inline void mod_lruvec_kmem_state(void *p, enum 
node_stat_item idx,
local_irq_restore(flags);
 }
 
-static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
+static inline void mod_memcg_lruvec_state(struct mem_cgroup *memcg,
+ struct lruvec *lruvec,
  enum node_stat_item idx, int val)
 {
unsigned long flags;
 
local_irq_save(flags);
-   __mod_memcg_lruvec_state(lruvec, idx, val);
+   __mod_memcg_lruvec_state(memcg, lruvec, idx, val);
local_irq_restore(flags);
 }
 
@@ -1369,7 +1370,8 @@ static inline unsigned long 
lruvec_page_state_local(struct lruvec *lruvec,
return node_page_state(lruvec_pgdat(lruvec), idx);
 }
 
-static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec,
+static inline void __mod_memcg_lruvec_state(struct mem_cgroup *memcg,
+   struct lruvec *lruvec,
enum node_stat_item idx, int val)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d850a..d66e1e38f8ac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -799,20 +799,27 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
return mem_cgroup_nodeinfo(parent, nid);
 }
 
-void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
- int val)
+/*
+ * Either one of memcg or lruvec can be NULL, but not both.
+ */
+void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec,
+ enum node_stat_item idx, int val)
 {
struct mem_cgroup_per_node *pn;
-   struct mem_cgroup *memcg;
long x, threshold = MEMCG_CHARGE_BATCH;
 
+   /* Update lruvec */
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
-   memcg = pn->memcg;
+
+   if (!memcg)
+   memcg = pn->memcg;
 
/* Update memcg */
__mod_memcg_state(memcg, idx, val);
 
-   /* Update lruvec */
+   if (!lruvec)
+   return;
+
__this_cpu_add(pn->lruvec_stat_local->count[idx], val);
 
if (vmstat_item_in_bytes(idx))
@@ -848,7 +855,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum 
node_stat_item idx,
 
/* Update memcg and lruvec */
if (!mem_cgroup_disabled())
-   __mod_memcg_lruvec_state(lruvec, idx, val);
+   __mod_memcg_lruvec_state(NULL, lruvec, idx, val);
 }
 
 void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
diff --git a/mm/slab.h b/mm/slab.h
index 076582f58f68..bc6c7545e487 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -293,7 +293,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg,
rcu_read_lock();
memcg = obj_cgroup_memcg(objcg);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
-   mod_memcg_lruvec_state(lruvec, idx, nr);
+   mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
rcu_read_unlock();
 }
 
-- 
2.18.1



[PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

2021-04-09 Thread Waiman Long
With the recent introduction of the new slab memory controller, we
eliminate the need for having separate kmemcaches for each memory
cgroup and reduce overall kernel memory usage. However, we also add
additional memory accounting overhead to each call of kmem_cache_alloc()
and kmem_cache_free().

For workloads that require a lot of kmemcache allocations and
de-allocations, they may experience performance regression as illustrated
in [1].

With a simple kernel module that performs repeated loop of 100,000,000
kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
init. The execution time to load the kernel module with and without
memory accounting were:

  with accounting = 6.798s
  w/o  accounting = 1.758s

That is an increase of 5.04s (287%). With this patchset applied, the
execution time became 4.254s. So the memory accounting overhead is now
2.496s which is a 50% reduction.

It was found that a major part of the memory accounting overhead
is caused by the local_irq_save()/local_irq_restore() sequences in
updating local stock charge bytes and vmstat array, at least in x86
systems. There are two such sequences in kmem_cache_alloc() and two
in kmem_cache_free(). This patchset tries to reduce the use of such
sequences as much as possible. In fact, it eliminates them in the common
case. Another part of this patchset to cache the vmstat data update in
the local stock as well which also helps.

[1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u

Waiman Long (5):
  mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
  mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  mm/memcg: Separate out object stock data into its own struct
  mm/memcg: Optimize user context object stock access

 include/linux/memcontrol.h |  14 ++-
 mm/memcontrol.c| 198 -
 mm/percpu.c|   9 +-
 mm/slab.h  |  32 +++---
 4 files changed, 195 insertions(+), 58 deletions(-)

-- 
2.18.1



Re: [OpenRISC] [PATCH v6 1/9] locking/qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32

2021-04-08 Thread Waiman Long

On 4/6/21 7:52 PM, Stafford Horne wrote:


For OpenRISC I did ack the patch to convert to
CONFIG_ARCH_USE_QUEUED_SPINLOCKS_XCHG32=y.  But I think you are right, the
generic code in xchg_tail and the xchg16 emulation code in produced by OpenRISC
using xchg32 would produce very similar code.  I have not compared instructions,
but it does seem like duplicate functionality.

Why doesn't RISC-V add the xchg16 emulation code similar to OpenRISC?  For
OpenRISC we added xchg16 and xchg8 emulation code to enable qspinlocks.  So
one thought is with CONFIG_ARCH_USE_QUEUED_SPINLOCKS_XCHG32=y, can we remove our
xchg16/xchg8 emulation code?


For the record, the latest qspinlock code doesn't use xchg8 anymore. It 
still need xchg16, though.


Cheers,
Longman



Re: [PATCH v4] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only

2021-04-06 Thread Waiman Long

On 4/6/21 5:15 AM, Peter Zijlstra wrote:

On Mon, Apr 05, 2021 at 07:42:03PM -0400, Waiman Long wrote:

The handling of sysrq key can be activated by echoing the key to
/proc/sysrq-trigger or via the magic key sequence typed into a terminal
that is connected to the system in some way (serial, USB or other mean).
In the former case, the handling is done in a user context. In the
latter case, it is likely to be in an interrupt context.
[ 7809.796281]  
[ 7809.796282]  _raw_spin_lock_irqsave+0x32/0x40
[ 7809.796283]  print_cpu+0x261/0x7c0
[ 7809.796283]  sysrq_sched_debug_show+0x34/0x50
[ 7809.796284]  sysrq_handle_showstate+0xc/0x20
[ 7809.796284]  __handle_sysrq.cold.11+0x48/0xfb
[ 7809.796285]  write_sysrq_trigger+0x2b/0x30
[ 7809.796285]  proc_reg_write+0x39/0x60
[ 7809.796286]  vfs_write+0xa5/0x1a0
[ 7809.796286]  ksys_write+0x4f/0xb0
[ 7809.796287]  do_syscall_64+0x5b/0x1a0
[ 7809.796287]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 7809.796288] RIP: 0033:0x7fabe4ceb648

The purpose of sched_debug_lock is to serialize the use of the global
cgroup_path[] buffer in print_cpu(). The rests of the printk calls don't
need serialization from sched_debug_lock.
The print_cpu() function has two callers - sched_debug_show() and
sysrq_sched_debug_show().

So what idiot is doing sysrq and that proc file at the same time? Why is
it a problem now?
We got a customer bug report on watchdog panic because a process somehow 
stay within the sched_debug_lock for too long. I don't know what exactly 
the customer was doing, but we can reproduce the panic just by having 2 
parallel "echo t > /proc/sysrq-trigger" commands. This happens on 
certain selected systems. I was using some Dell systems for my testing. 
Of course, it is not sensible to have more than one sysrq happening at 
the same time. However, a parallel thread accessing /proc/sched_debug is 
certainly possible. That invokes similar code path.



@@ -470,16 +468,49 @@ static void print_cfs_group_stats(struct seq_file *m, int 
cpu, struct task_group
  #endif
  
  #ifdef CONFIG_CGROUP_SCHED

+static DEFINE_SPINLOCK(sched_debug_lock);
  static char group_path[PATH_MAX];
+static enum {
+   TOKEN_NONE,
+   TOKEN_ACQUIRED,
+   TOKEN_NA/* Not applicable */
+} console_token = TOKEN_ACQUIRED;
+/*
+ * All the print_cpu() callers from sched_debug_show() will be allowed
+ * to contend for sched_debug_lock and use group_path[] as their SEQ_printf()
+ * calls will be much faster. However only one print_cpu() caller from
+ * sysrq_sched_debug_show() which outputs to the console will be allowed
+ * to use group_path[]. Another parallel console writer will have to use
+ * a shorter stack buffer instead. Since the console output will be garbled
+ * anyway, truncation of some cgroup paths shouldn't be a big issue.
+ */
+#define SEQ_printf_task_group_path(m, tg, fmt...)  \
+{  \
+   unsigned long flags;\
+   int token = m ? TOKEN_NA\
+ : xchg_acquire(_token, TOKEN_NONE);   \
+   \
+   if (token == TOKEN_NONE) {  \
+   char buf[128];  \
+   task_group_path(tg, buf, sizeof(buf));  \
+   SEQ_printf(m, fmt, buf);\
+   } else {\
+   spin_lock_irqsave(_debug_lock, flags);\
+   task_group_path(tg, group_path, sizeof(group_path));\
+   SEQ_printf(m, fmt, group_path); \
+   spin_unlock_irqrestore(_debug_lock, flags);   \
+   if (token == TOKEN_ACQUIRED)\
+   smp_store_release(_token, token);   \
+   }   \
  }

This is disgusting... you have an open-coded test-and-set lock like
thing *AND* a spinlock, what gives?


What's wrong with something simple like this?

---
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 4b49cc2af5c4..2ac2977f3b96 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -8,8 +8,6 @@
   */
  #include "sched.h"
  
-static DEFINE_SPINLOCK(sched_debug_lock);

-
  /*
   * This allows printing both to /proc/sched_debug and
   * to the console
@@ -470,6 +468,7 @@ static void print_cfs_group_stats(struct seq_file *m, int 
cpu, struct task_group
  #endif
  
  #ifdef CONFIG_CGROUP_SCHED

+static DEFINE_SPINLOCK(group_path_lock);
  static char group_path[PATH_MAX];
  
  static char *task_group_path(struct task_group *tg)

@@ -481,6 +480,22 @@ static char *task_group_path(struct task_group *tg)
  
  	return grou

Re: [PATCH v4] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only

2021-04-05 Thread Waiman Long

On 4/5/21 8:18 PM, Steven Rostedt wrote:

On Mon,  5 Apr 2021 19:42:03 -0400
Waiman Long  wrote:


+/*
+ * All the print_cpu() callers from sched_debug_show() will be allowed
+ * to contend for sched_debug_lock and use group_path[] as their SEQ_printf()
+ * calls will be much faster. However only one print_cpu() caller from
+ * sysrq_sched_debug_show() which outputs to the console will be allowed
+ * to use group_path[]. Another parallel console writer will have to use
+ * a shorter stack buffer instead. Since the console output will be garbled
+ * anyway, truncation of some cgroup paths shouldn't be a big issue.
+ */
+#define SEQ_printf_task_group_path(m, tg, fmt...)  \
+{  \
+   unsigned long flags;\
+   int token = m ? TOKEN_NA\
+ : xchg_acquire(_token, TOKEN_NONE);   \
+   \
+   if (token == TOKEN_NONE) {  \
+   char buf[128];  \
+   task_group_path(tg, buf, sizeof(buf));  \
+   SEQ_printf(m, fmt, buf);\
+   } else {\
+   spin_lock_irqsave(_debug_lock, flags);\
+   task_group_path(tg, group_path, sizeof(group_path));\
+   SEQ_printf(m, fmt, group_path); \
+   spin_unlock_irqrestore(_debug_lock, flags);   \
+   if (token == TOKEN_ACQUIRED)\
+   smp_store_release(_token, token);   \
+   }   \
  }
  #endif

And you said my suggestion was complex!

I'll let others review this.

This patch is actually inspired by your suggestion, though it is 
structured differently from your approach. I really want to thank you 
for your valuable feedback.


I realized that printing to a sequence file wasn't really a problem, 
only printing to console can be problematic. That is why I decided to 
allow unlimited use of group_path[] for those users and only one console 
writer is allowed to use it. As for calling touch_nmi_watchdog(), I am 
still thinking where will be the right place to do it, but it can be 
done with a separate patch, if needed.


Cheers,
Longman



[PATCH v4] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only

2021-04-05 Thread Waiman Long
The handling of sysrq key can be activated by echoing the key to
/proc/sysrq-trigger or via the magic key sequence typed into a terminal
that is connected to the system in some way (serial, USB or other mean).
In the former case, the handling is done in a user context. In the
latter case, it is likely to be in an interrupt context.

There should be no more than one instance of sysrq key processing via
a terminal, but multiple instances of /proc/sysrq-trigger is possible.

Currently in print_cpu() of kernel/sched/debug.c, sched_debug_lock is
taken with interrupt disabled for the whole duration of the calls to
print_*_stats() and print_rq() which could last for the quite some time
if the information dump happens on the serial console.

If the system has many cpus and the sched_debug_lock is somehow busy
(e.g. parallel sysrq-t), the system may hit a hard lockup panic
depending on the actually serial console implementation of the
system. For instance,

[ 7809.796262] Kernel panic - not syncing: Hard LOCKUP
[ 7809.796264] CPU: 13 PID: 79867 Comm: reproducer.sh Kdump: loaded Tainted: G  
I  - -  - 4.18.0-301.el8.x86_64 #1
[ 7809.796264] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.4.9 
06/29/2018
[ 7809.796265] Call Trace:
[ 7809.796265]  
[ 7809.796266]  dump_stack+0x5c/0x80
[ 7809.796266]  panic+0xe7/0x2a9
[ 7809.796267]  nmi_panic.cold.9+0xc/0xc
[ 7809.796267]  watchdog_overflow_callback.cold.7+0x5c/0x70
[ 7809.796268]  __perf_event_overflow+0x52/0xf0
[ 7809.796268]  handle_pmi_common+0x204/0x2a0
[ 7809.796269]  ? __set_pte_vaddr+0x32/0x50
[ 7809.796269]  ? __native_set_fixmap+0x24/0x30
[ 7809.796270]  ? ghes_copy_tofrom_phys+0xd3/0x1c0
[ 7809.796271]  intel_pmu_handle_irq+0xbf/0x160
[ 7809.796271]  perf_event_nmi_handler+0x2d/0x50
[ 7809.796272]  nmi_handle+0x63/0x110
[ 7809.796272]  default_do_nmi+0x49/0x100
[ 7809.796273]  do_nmi+0x17e/0x1e0
[ 7809.796273]  end_repeat_nmi+0x16/0x6f
[ 7809.796274] RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1d0
[ 7809.796275] Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 
30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 
c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00
[ 7809.796276] RSP: 0018:aa54cd887df8 EFLAGS: 0002
[ 7809.796277] RAX: 0101 RBX: 0246 RCX: 
[ 7809.796278] RDX:  RSI:  RDI: 936b66d0
[ 7809.796278] RBP: 9301fb40 R08: 0004 R09: 004f
[ 7809.796279] R10:  R11: aa54cd887cc0 R12: 907fd0a29ec0
[ 7809.796280] R13:  R14: 926ab7c0 R15: 
[ 7809.796280]  ? native_queued_spin_lock_slowpath+0x5b/0x1d0
[ 7809.796281]  ? native_queued_spin_lock_slowpath+0x5b/0x1d0
[ 7809.796281]  
[ 7809.796282]  _raw_spin_lock_irqsave+0x32/0x40
[ 7809.796283]  print_cpu+0x261/0x7c0
[ 7809.796283]  sysrq_sched_debug_show+0x34/0x50
[ 7809.796284]  sysrq_handle_showstate+0xc/0x20
[ 7809.796284]  __handle_sysrq.cold.11+0x48/0xfb
[ 7809.796285]  write_sysrq_trigger+0x2b/0x30
[ 7809.796285]  proc_reg_write+0x39/0x60
[ 7809.796286]  vfs_write+0xa5/0x1a0
[ 7809.796286]  ksys_write+0x4f/0xb0
[ 7809.796287]  do_syscall_64+0x5b/0x1a0
[ 7809.796287]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 7809.796288] RIP: 0033:0x7fabe4ceb648

The purpose of sched_debug_lock is to serialize the use of the global
cgroup_path[] buffer in print_cpu(). The rests of the printk calls don't
need serialization from sched_debug_lock.

Calling printk() with interrupt disabled can still be problematic if
multiple instances are running. Allocating a stack buffer of PATH_MAX
bytes is not feasible because of the limited size of the kernel stack.

The print_cpu() function has two callers - sched_debug_show() and
sysrq_sched_debug_show(). The solution implemented in this patch is to
allow all sched_debug_show() callers to contend for sched_debug_lock and
use the full size group_path[] as their SEQ_printf() calls will be much
faster. However only one sysrq_sched_debug_show() caller that output to
the slow console will be allowed to use group_path[]. Another parallel
console writer will have to use a shorter stack buffer instead. Since
the console output will be garbled anyway, truncation of some cgroup
paths shouldn't be a big issue.

Fixes: efe25c2c7b3a ("sched: Reinstate group names in /proc/sched_debug")
Signed-off-by: Waiman Long 
---
 kernel/sched/debug.c | 54 +---
 1 file changed, 41 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 486f403a778b..5d021b247998 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -8,8 +8,6 @@
  */
 #include "sched.h"
 
-static DEFINE_SPINLOCK(sched_debug_lock);
-
 /*
  * This allows printing both to /proc/sched_debug and
  * to the console
@@ -470,16 +468,49 @@ static void print_cfs_group_stats(st

Re: [PATCH v3] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only

2021-04-04 Thread Waiman Long

On 4/4/21 9:27 PM, Waiman Long wrote:

On 4/4/21 12:02 PM, Steven Rostedt wrote:

On Fri, 2 Apr 2021 23:09:09 -0400
Waiman Long  wrote:


The main problem with sched_debug_lock is that under certain
circumstances, a lock waiter may wait a long time to acquire the lock
(in seconds). We can't insert touch_nmi_watchdog() while the cpu is
waiting for the spinlock.

The problem I have with the patch is that it seems to be a hack (as it
doesn't fix the issue in all cases). Since sched_debug_lock is
"special", perhaps we can add wrappers to take it, and instead of doing
the spin_lock_irqsave(), do a trylock loop. Add lockdep annotation to
tell lockdep that this is not a try lock (so that it can still detect
deadlocks).

Then have the strategically placed touch_nmi_watchdog() also increment
a counter. Then in that trylock loop, if it sees the counter get
incremented, it knows that forward progress is being made by the lock
holder, and it too can call touch_nmi_watchdog().


Thanks for the suggestion, but it also sound complicated.

I think we can fix this lockup problem if we are willing to lose some 
information in case of contention. As you have suggested, a trylock 
will be used to acquire sched_debug_lock. If succeeded, all is good. 
Otherwise, a shorter stack buffer will be used for cgroup path. The 
path may be truncated in this case. If we detect that the full length 
of the buffer is used, we assume truncation and add, e.g. "...", to 
indicate there is more to the actual path.


Do you think this is an acceptable comprise? 


Actually, I don't really need to disable interrupt under this situation 
as deadlock can't happen.


Cheers,
Longman



Re: [PATCH v3] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only

2021-04-04 Thread Waiman Long

On 4/4/21 12:02 PM, Steven Rostedt wrote:

On Fri, 2 Apr 2021 23:09:09 -0400
Waiman Long  wrote:


The main problem with sched_debug_lock is that under certain
circumstances, a lock waiter may wait a long time to acquire the lock
(in seconds). We can't insert touch_nmi_watchdog() while the cpu is
waiting for the spinlock.

The problem I have with the patch is that it seems to be a hack (as it
doesn't fix the issue in all cases). Since sched_debug_lock is
"special", perhaps we can add wrappers to take it, and instead of doing
the spin_lock_irqsave(), do a trylock loop. Add lockdep annotation to
tell lockdep that this is not a try lock (so that it can still detect
deadlocks).

Then have the strategically placed touch_nmi_watchdog() also increment
a counter. Then in that trylock loop, if it sees the counter get
incremented, it knows that forward progress is being made by the lock
holder, and it too can call touch_nmi_watchdog().


Thanks for the suggestion, but it also sound complicated.

I think we can fix this lockup problem if we are willing to lose some 
information in case of contention. As you have suggested, a trylock will 
be used to acquire sched_debug_lock. If succeeded, all is good. 
Otherwise, a shorter stack buffer will be used for cgroup path. The path 
may be truncated in this case. If we detect that the full length of the 
buffer is used, we assume truncation and add, e.g. "...", to indicate 
there is more to the actual path.


Do you think this is an acceptable comprise?

Cheers,
Longman



Re: [PATCH v3] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only

2021-04-02 Thread Waiman Long

On 4/2/21 4:40 PM, Steven Rostedt wrote:

On Thu,  1 Apr 2021 14:10:30 -0400
Waiman Long  wrote:


The handling of sysrq key can be activated by echoing the key to
/proc/sysrq-trigger or via the magic key sequence typed into a terminal
that is connected to the system in some way (serial, USB or other mean).
In the former case, the handling is done in a user context. In the
latter case, it is likely to be in an interrupt context.

There should be no more than one instance of sysrq key processing via
a terminal, but multiple instances of /proc/sysrq-trigger is possible.

Currently in print_cpu() of kernel/sched/debug.c, sched_debug_lock is
taken with interrupt disabled for the whole duration of the calls to
print_*_stats() and print_rq() which could last for the quite some time
if the information dump happens on the serial console.

If the system has many cpus and the sched_debug_lock is somehow busy
(e.g. parallel sysrq-t), the system may hit a hard lockup panic
depending on the actually serial console implementation of the
system. For instance,

Wouldn't placing strategically located "touch_nmi_watchdog()"s around fix
this?

-- Steve

The main problem with sched_debug_lock is that under certain 
circumstances, a lock waiter may wait a long time to acquire the lock 
(in seconds). We can't insert touch_nmi_watchdog() while the cpu is 
waiting for the spinlock.


Cheers,
Longman



[PATCH v3] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only

2021-04-01 Thread Waiman Long
The handling of sysrq key can be activated by echoing the key to
/proc/sysrq-trigger or via the magic key sequence typed into a terminal
that is connected to the system in some way (serial, USB or other mean).
In the former case, the handling is done in a user context. In the
latter case, it is likely to be in an interrupt context.

There should be no more than one instance of sysrq key processing via
a terminal, but multiple instances of /proc/sysrq-trigger is possible.

Currently in print_cpu() of kernel/sched/debug.c, sched_debug_lock is
taken with interrupt disabled for the whole duration of the calls to
print_*_stats() and print_rq() which could last for the quite some time
if the information dump happens on the serial console.

If the system has many cpus and the sched_debug_lock is somehow busy
(e.g. parallel sysrq-t), the system may hit a hard lockup panic
depending on the actually serial console implementation of the
system. For instance,

[ 7809.796262] Kernel panic - not syncing: Hard LOCKUP
[ 7809.796264] CPU: 13 PID: 79867 Comm: reproducer.sh Kdump: loaded Tainted: G  
I  - -  - 4.18.0-301.el8.x86_64 #1
[ 7809.796264] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.4.9 
06/29/2018
[ 7809.796265] Call Trace:
[ 7809.796265]  
[ 7809.796266]  dump_stack+0x5c/0x80
[ 7809.796266]  panic+0xe7/0x2a9
[ 7809.796267]  nmi_panic.cold.9+0xc/0xc
[ 7809.796267]  watchdog_overflow_callback.cold.7+0x5c/0x70
[ 7809.796268]  __perf_event_overflow+0x52/0xf0
[ 7809.796268]  handle_pmi_common+0x204/0x2a0
[ 7809.796269]  ? __set_pte_vaddr+0x32/0x50
[ 7809.796269]  ? __native_set_fixmap+0x24/0x30
[ 7809.796270]  ? ghes_copy_tofrom_phys+0xd3/0x1c0
[ 7809.796271]  intel_pmu_handle_irq+0xbf/0x160
[ 7809.796271]  perf_event_nmi_handler+0x2d/0x50
[ 7809.796272]  nmi_handle+0x63/0x110
[ 7809.796272]  default_do_nmi+0x49/0x100
[ 7809.796273]  do_nmi+0x17e/0x1e0
[ 7809.796273]  end_repeat_nmi+0x16/0x6f
[ 7809.796274] RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1d0
[ 7809.796275] Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 
30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 
c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00
[ 7809.796276] RSP: 0018:aa54cd887df8 EFLAGS: 0002
[ 7809.796277] RAX: 0101 RBX: 0246 RCX: 
[ 7809.796278] RDX:  RSI:  RDI: 936b66d0
[ 7809.796278] RBP: 9301fb40 R08: 0004 R09: 004f
[ 7809.796279] R10:  R11: aa54cd887cc0 R12: 907fd0a29ec0
[ 7809.796280] R13:  R14: 926ab7c0 R15: 
[ 7809.796280]  ? native_queued_spin_lock_slowpath+0x5b/0x1d0
[ 7809.796281]  ? native_queued_spin_lock_slowpath+0x5b/0x1d0
[ 7809.796281]  
[ 7809.796282]  _raw_spin_lock_irqsave+0x32/0x40
[ 7809.796283]  print_cpu+0x261/0x7c0
[ 7809.796283]  sysrq_sched_debug_show+0x34/0x50
[ 7809.796284]  sysrq_handle_showstate+0xc/0x20
[ 7809.796284]  __handle_sysrq.cold.11+0x48/0xfb
[ 7809.796285]  write_sysrq_trigger+0x2b/0x30
[ 7809.796285]  proc_reg_write+0x39/0x60
[ 7809.796286]  vfs_write+0xa5/0x1a0
[ 7809.796286]  ksys_write+0x4f/0xb0
[ 7809.796287]  do_syscall_64+0x5b/0x1a0
[ 7809.796287]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 7809.796288] RIP: 0033:0x7fabe4ceb648

The purpose of sched_debug_lock is to serialize the use of the global
cgroup_path[] buffer in print_cpu(). The rests of the printk calls don't
need serialization from sched_debug_lock.

Calling printk() with interrupt disabled can still be problematic if
multiple instances are running. Allocating a stack buffer of PATH_MAX
bytes is not feasible. So a compromised solution is used where a small
stack buffer is allocated for pathname. If the actual pathname is
short enough, it is copied to the stack buffer with sched_debug_lock
release afterward before printk.  Otherwise, the global group_path[]
buffer will be used with sched_debug_lock held until after printk().

This patch does not completely solve the problem, but it will greatly
reduce the chance of its occurrence. To really fix it, we probably need
to wait until the printk rework is done so that printk() will take much
less time to execute than before.

Fixes: efe25c2c7b3a ("sched: Reinstate group names in /proc/sched_debug")
Signed-off-by: Waiman Long 
---
 kernel/sched/debug.c | 33 +
 1 file changed, 25 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 486f403a778b..191ccb111cb4 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -8,8 +8,6 @@
  */
 #include "sched.h"
 
-static DEFINE_SPINLOCK(sched_debug_lock);
-
 /*
  * This allows printing both to /proc/sched_debug and
  * to the console
@@ -470,6 +468,7 @@ static void print_cfs_group_stats(struct seq_file *m, int 
cpu, struct task_group
 #endif
 
 #ifdef CONFIG_CGROUP_SCH

Re: [PATCH v2] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only

2021-03-30 Thread Waiman Long

On 3/30/21 6:42 AM, Daniel Thompson wrote:

On Mon, Mar 29, 2021 at 03:32:35PM -0400, Waiman Long wrote:

The handling of sysrq keys should normally be done in an user context
except when MAGIC_SYSRQ_SERIAL is set and the magic sequence is typed
in a serial console.

This seems to be a poor summary of the typical calling context for
handle_sysrq() except in the trivial case of using
/proc/sysrq-trigger.

For example on my system then the backtrace when I do sysrq-h on a USB
keyboard shows us running from a softirq handler and with interrupts
locked. Note also that the interrupt lock is present even on systems that
handle keyboard input from a kthread due to the interrupt lock in
report_input_key().
I will reword this part of the patch. I don't have a deep understanding 
of how the different way of keyword input work and thanks for showing me 
that there are other ways of getting keyboard input.



Currently in print_cpu() of kernel/sched/debug.c, sched_debug_lock is taken
with interrupt disabled for the whole duration of the calls to print_*_stats()
and print_rq() which could last for the quite some time if the information dump
happens on the serial console.

If the system has many cpus and the sched_debug_lock is somehow busy
(e.g. parallel sysrq-t), the system may hit a hard lockup panic, like




The purpose of sched_debug_lock is to serialize the use of the global
cgroup_path[] buffer in print_cpu(). The rests of the printk() calls
don't need serialization from sched_debug_lock.

Calling printk() with interrupt disabled can still be/proc/sysrq-trigger
problematic. Allocating a stack buffer of PATH_MAX bytes is not
feasible. So a compromised solution is used where a small stack buffer
is allocated for pathname. If the actual pathname is short enough, it
is copied to the stack buffer with sched_debug_lock release afterward
before printk().  Otherwise, the global group_path[] buffer will be
used with sched_debug_lock held until after printk().

Does this actually fix the problem in any circumstance except when the
sysrq is triggered using /proc/sysrq-trigger?


I have a reproducer that generates hard lockup panic when there are 
multiple instances of sysrq-t via /proc/sysrq-trigger. This is probably 
less a problem on console as I don't think we can do multiple 
simultaneous sysrq-t there. Anyway, my goal is to limit the amount of 
time that irq is disabled. Doing a printk can take a while depending on 
whether there are contention in the underlying locks or resources. Even 
if I limit the the critical sections to just those printk() that outputs 
cgroup path, I can still cause the panic.


Cheers,
Longman

The approach used by this patch should minimize the chance of a panic 
happening. However, if there are many tasks with very long cgroup paths, 
I suppose that panic may still happen under some extreme conditions. So 
I won't say this will completely fix the problem until the printk() 
rework that makes printk work more like printk_deferred() is merged.





Re: [PATCH v4 3/4] locking/qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32

2021-03-30 Thread Waiman Long

On 3/29/21 11:13 PM, Guo Ren wrote:

On Mon, Mar 29, 2021 at 8:50 PM Peter Zijlstra  wrote:

On Mon, Mar 29, 2021 at 08:01:41PM +0800, Guo Ren wrote:

u32 a = 0x55aa66bb;
u16 *ptr = 

CPU0   CPU1
= =
xchg16(ptr, new) while(1)
 WRITE_ONCE(*(ptr + 1), x);

When we use lr.w/sc.w implement xchg16, it'll cause CPU0 deadlock.

Then I think your LL/SC is broken.

That also means you really don't want to build super complex locking
primitives on top, because that live-lock will percolate through.

Do you mean the below implementation has live-lock risk?
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+   u32 old, new, val = atomic_read(>val);
+
+   for (;;) {
+   new = (val & _Q_LOCKED_PENDING_MASK) | tail;
+   old = atomic_cmpxchg(>val, val, new);
+   if (old == val)
+   break;
+
+   val = old;
+   }
+   return old;
+}
If there is a continuous stream of incoming spinlock takers, it is 
possible that some cpus may have to wait a long time to set the tail 
right. However, this should only happen on artificial workload. I doubt 
it will happen with real workload or with limit number of cpus.



Step 1 would be to get your architecute fixed such that it can provide
fwd progress guarantees for LL/SC. Otherwise there's absolutely no point
in building complex systems with it.

Quote Waiman's comment [1] on xchg16 optimization:

"This optimization is needed to make the qspinlock achieve performance
parity with ticket spinlock at light load."

[1] 
https://lore.kernel.org/kvm/1429901803-29771-6-git-send-email-waiman.l...@hp.com/

So for a non-xhg16 machine:
  - ticket-lock for small numbers of CPUs
  - qspinlock for large numbers of CPUs

Okay, I'll put all of them into the next patch :P

It is true that qspinlock may not offer much advantage when the number 
of cpus is small. It shines for systems with many cpus. You may use 
NR_CPUS to determine if the default should be ticket or qspinlock with 
user override. To determine the right NR_CPUS threshold, you may need to 
run on real SMP RISCV systems to find out.


Cheers,
Longman



Re: [PATCH -tip] locking: Move CONFIG_LOCK_EVENT_COUNTS into Kernel hacking section

2021-03-29 Thread Waiman Long

On 3/29/21 10:06 PM, Davidlohr Bueso wrote:

It's a lot more intuitive to have it in the locking section of the kernel
hacking part rather than under "General architecture-dependent options".

Signed-off-by: Davidlohr Bueso 
---
  arch/Kconfig  | 9 -
  lib/Kconfig.debug | 9 +
  2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index ecfd3520b676..d6f9aeaaf9f2 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1113,15 +1113,6 @@ config HAVE_ARCH_PREL32_RELOCATIONS
  config ARCH_USE_MEMREMAP_PROT
bool
  
-config LOCK_EVENT_COUNTS

-   bool "Locking event counts collection"
-   depends on DEBUG_FS
-   help
- Enable light-weight counting of various locking related events
- in the system with minimal performance impact. This reduces
- the chance of application behavior change because of timing
- differences. The counts are reported via debugfs.
-
  # Select if the architecture has support for applying RELR relocations.
  config ARCH_HAS_RELR
bool
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 2779c29d9981..76639ff5998c 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1401,6 +1401,15 @@ config DEBUG_LOCKING_API_SELFTESTS
  The following locking APIs are covered: spinlocks, rwlocks,
  mutexes and rwsems.
  
+config LOCK_EVENT_COUNTS

+   bool "Locking event counts collection"
+   depends on DEBUG_FS
+   help
+ Enable light-weight counting of various locking related events
+ in the system with minimal performance impact. This reduces
+ the chance of application behavior change because of timing
+ differences. The counts are reported via debugfs.
+
  config LOCK_TORTURE_TEST
tristate "torture tests for locking"
depends on DEBUG_KERNEL


It looks good to me.

Acked-by: Waiman Long 



Re: [PATCH v2] x86/apic/vector: Move pr_warn() out of vector_lock

2021-03-29 Thread Waiman Long

On 3/29/21 8:42 AM, Thomas Gleixner wrote:

Waiman,

On Sun, Mar 28 2021 at 20:52, Waiman Long wrote:

It was found that the following circular locking dependency warning
could happen in some systems:

[  218.097878] ==
[  218.097879] WARNING: possible circular locking dependency detected
[  218.097880] 4.18.0-228.el8.x86_64+debug #1 Not tainted

Reports have to be against latest mainline and not against the random
distro frankenkernel of the day. That's nothing new.

Plus I was asking you to provide a full splat to look at so this can be
discussed _upfront_. Oh well...


That was the full splat that I can see except the following trailing data:

[  218.098064] RIP: 0033:0x7ff4ee7620d6
[  218.098066] Code: 89 54 24 08 e8 9b f4 ff ff 8b 74 24 0c 48 8b 3c 24 
41 89 c0 44 8b 54 24 08 b8 01 01 00 00 89 f2 48 89 fe bf 9c ff ff ff 0f 
05 <48> 3d 00 f0 ff ff 77 30 44 89 c7 89 44 24 08 e8 c6 f4 ff ff 8b 44
[  218.098067] RSP: 002b:7ffdda1116a0 EFLAGS: 0293 ORIG_RAX: 
0101
[  218.098069] RAX: ffda RBX: 564733953f70 RCX: 
7ff4ee7620d6
[  218.098070] RDX: 00080902 RSI: 5647339560e0 RDI: 
ff9c
[  218.098071] RBP: 0015 R08:  R09: 
004b
[  218.098072] R10:  R11: 0293 R12: 
00080902
[  218.098074] R13: 5647339560e0 R14: 564733953a90 R15: 
0002

[  218.098460] irq 3: Affinity broken due to vector space exhaustion.



[  218.097914] -> #2 (_desc_lock_class){-.-.}:
[  218.097917]_raw_spin_lock_irqsave+0x48/0x81
[  218.097918]__irq_get_desc_lock+0xcf/0x140
[  218.097919]__dble_irq_nosync+0x6e/0x110

This function does not even exist in mainline and never existed...


[  218.097967]
[  218.097967] Chain exists of:
[  218.097968]   console_oc_lock_class --> vector_lock
[  218.097972]
[  218.097973]  Possible unsafe locking scenario:
[  218.097973]
[  218.097974]CPU0CPU1
[  218.097975]
[  218.097975]   lock(vector_lock);
[  218.097977]lock(_desc_lock_class);
[  218.097980]lock(vector_lock);
[  218.097981]   lock(console_owner);
[  218.097983]
[  218.097984]  *** DEADLOCK ***
[  218.097984]
[  218.097985] 6 locks held by systemd/1:
[  218.097986]  #0: 88822b5cc1e8 (>legacy_mutex){+.+.}, at: 
tty_init_dev+0x79/0x440
[  218.097989]  #1: 88832ee00770 (>mutex){+.+.}, at: 
tty_port_open+0x85/0x190
[  218.097993]  #2: 88813be85a88 (>request_mutex){+.+.}, at: 
__setup_irq+0x249/0x1e60
[  218.097996]  #3: 88813be858c0 (_desc_lock_class){-.-.}, at: 
__setup_irq+0x2d9/0x1e60
[  218.098000]  #4: 84afca78 (vector_lock){-.-.}, at: 
x86_vector_activate+0xca/0xab0
[  218.098003]  #5: 84c27e20 (console_lock){+.+.}, at: 
vprintk_emit+0x13a/0x450

This is a more fundamental problem than just vector lock and the same
problem exists with any other printk over serial which is nested in the
interrupt activation chain not only on X86.


That is true. This problem is more generic than just that. I am hoping 
that the new printk rewrite may address this problem. I have been 
waiting for a while and that work is still not upstream yet. So what is 
your current timeline for that? If that will happen soon, I probably 
don't need this patch. I send this patch out as I am uncertain about it.




-static int activate_reserved(struct irq_data *irqd)
+static int activate_reserved(struct irq_data *irqd, char *wbuf, size_t wsize)
  {

...


if (!cpumask_subset(irq_data_get_effective_affinity_mask(irqd),
irq_data_get_affinity_mask(irqd))) {
-   pr_warn("irq %u: Affinity broken due to vector space 
exhaustion.\n",
-   irqd->irq);
+   snprintf(wbuf, wsize, KERN_WARNING
+"irq %u: Affinity broken due to vector space 
exhaustion.\n",
+irqd->irq);

This is not really any more tasteful than the previous one and it does
not fix the fundamental underlying problem.

But, because I'm curious and printk is a constant source of trouble, I
just added unconditional pr_warns into those functions under vector_lock
on 5.12-rc5.

Still waiting for the lockdep splat to show up while enjoying the
trickle of printks over serial.

If you really think this is an upstream problem then please provide a
corresponding lockdep splat on plain 5.12-rc5 along with a .config and
the scenario which triggers this. Not less, not more.


I will try to reproduce this problem with an upstream kernel.

Thanks,
Longman



[PATCH v2] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only

2021-03-29 Thread Waiman Long
The handling of sysrq keys should normally be done in an user context
except when MAGIC_SYSRQ_SERIAL is set and the magic sequence is typed
in a serial console.

Currently in print_cpu() of kernel/sched/debug.c, sched_debug_lock is taken
with interrupt disabled for the whole duration of the calls to print_*_stats()
and print_rq() which could last for the quite some time if the information dump
happens on the serial console.

If the system has many cpus and the sched_debug_lock is somehow busy
(e.g. parallel sysrq-t), the system may hit a hard lockup panic, like

[ 7809.796262] Kernel panic - not syncing: Hard LOCKUP
[ 7809.796264] CPU: 13 PID: 79867 Comm: reproducer.sh Kdump: loaded Tainted: G  
I  - -  - 4.18.0-301.el8.x86_64 #1
[ 7809.796264] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.4.9 
06/29/2018
[ 7809.796265] Call Trace:
[ 7809.796265]  
[ 7809.796266]  dump_stack+0x5c/0x80
[ 7809.796266]  panic+0xe7/0x2a9
[ 7809.796267]  nmi_panic.cold.9+0xc/0xc
[ 7809.796267]  watchdog_overflow_callback.cold.7+0x5c/0x70
[ 7809.796268]  __perf_event_overflow+0x52/0xf0
[ 7809.796268]  handle_pmi_common+0x204/0x2a0
[ 7809.796269]  ? __set_pte_vaddr+0x32/0x50
[ 7809.796269]  ? __native_set_fixmap+0x24/0x30
[ 7809.796270]  ? ghes_copy_tofrom_phys+0xd3/0x1c0
[ 7809.796271]  intel_pmu_handle_irq+0xbf/0x160
[ 7809.796271]  perf_event_nmi_handler+0x2d/0x50
[ 7809.796272]  nmi_handle+0x63/0x110
[ 7809.796272]  default_do_nmi+0x49/0x100
[ 7809.796273]  do_nmi+0x17e/0x1e0
[ 7809.796273]  end_repeat_nmi+0x16/0x6f
[ 7809.796274] RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1d0
[ 7809.796275] Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 
30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 
c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00
[ 7809.796276] RSP: 0018:aa54cd887df8 EFLAGS: 0002
[ 7809.796277] RAX: 0101 RBX: 0246 RCX: 
[ 7809.796278] RDX:  RSI:  RDI: 936b66d0
[ 7809.796278] RBP: 9301fb40 R08: 0004 R09: 004f
[ 7809.796279] R10:  R11: aa54cd887cc0 R12: 907fd0a29ec0
[ 7809.796280] R13:  R14: 926ab7c0 R15: 
[ 7809.796280]  ? native_queued_spin_lock_slowpath+0x5b/0x1d0
[ 7809.796281]  ? native_queued_spin_lock_slowpath+0x5b/0x1d0
[ 7809.796281]  
[ 7809.796282]  _raw_spin_lock_irqsave+0x32/0x40
[ 7809.796283]  print_cpu+0x261/0x7c0
[ 7809.796283]  sysrq_sched_debug_show+0x34/0x50
[ 7809.796284]  sysrq_handle_showstate+0xc/0x20
[ 7809.796284]  __handle_sysrq.cold.11+0x48/0xfb
[ 7809.796285]  write_sysrq_trigger+0x2b/0x30
[ 7809.796285]  proc_reg_write+0x39/0x60
[ 7809.796286]  vfs_write+0xa5/0x1a0
[ 7809.796286]  ksys_write+0x4f/0xb0
[ 7809.796287]  do_syscall_64+0x5b/0x1a0
[ 7809.796287]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 7809.796288] RIP: 0033:0x7fabe4ceb648

The purpose of sched_debug_lock is to serialize the use of the global
cgroup_path[] buffer in print_cpu(). The rests of the printk() calls
don't need serialization from sched_debug_lock.

Calling printk() with interrupt disabled can still be
problematic. Allocating a stack buffer of PATH_MAX bytes is not
feasible. So a compromised solution is used where a small stack buffer
is allocated for pathname. If the actual pathname is short enough, it
is copied to the stack buffer with sched_debug_lock release afterward
before printk().  Otherwise, the global group_path[] buffer will be
used with sched_debug_lock held until after printk().

Fixes: efe25c2c7b3a ("sched: Reinstate group names in /proc/sched_debug")
Signed-off-by: Waiman Long 
---
 kernel/sched/debug.c | 33 +
 1 file changed, 25 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 486f403a778b..191ccb111cb4 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -8,8 +8,6 @@
  */
 #include "sched.h"
 
-static DEFINE_SPINLOCK(sched_debug_lock);
-
 /*
  * This allows printing both to /proc/sched_debug and
  * to the console
@@ -470,6 +468,7 @@ static void print_cfs_group_stats(struct seq_file *m, int 
cpu, struct task_group
 #endif
 
 #ifdef CONFIG_CGROUP_SCHED
+static DEFINE_SPINLOCK(sched_debug_lock);
 static char group_path[PATH_MAX];
 
 static char *task_group_path(struct task_group *tg)
@@ -481,6 +480,27 @@ static char *task_group_path(struct task_group *tg)
 
return group_path;
 }
+
+#define SEQ_printf_task_group_path(m, tg, fmt...)  \
+{  \
+   unsigned long flags;\
+   char *path, buf[128];  

Re: [PATCH 1/2] sched/debug: Don't disable IRQ when acquiring sched_debug_lock

2021-03-29 Thread Waiman Long

On 3/29/21 6:23 AM, Daniel Thompson wrote:

On Sat, Mar 27, 2021 at 07:25:28PM -0400, Waiman Long wrote:

The sched_debug_lock was used only in print_cpu().  The
print_cpu() function has two callers - sched_debug_show() and
sysrq_sched_debug_show(). Both of them are invoked by user action
(sched_debug file and sysrq-t). As print_cpu() won't be called from
interrupt context at all, there is no point in disabling IRQ when
acquiring sched_debug_lock.

This looks like it introduces a deadlock risk if sysrq-t triggers from an
interrupt context. Has the behaviour of sysrq changed recently or will
tools like MAGIC_SYSRQ_SERIAL still trigger from interrupt context?


You are right. It looks like that if CONFIG_MAGIC_SYSRQ_SERIAL is set, 
it is possible for sysrq information dump to happen in an interrupt 
context. I missed that in my initial analysis. However, doing the time 
consuming info dump with interrupt disabled for an extended period of 
time is not a good idea.


Still with my second patch to minimize the size of the critical 
sections, it will minimize the chance of causing trouble except when 
doing it directly from the console.


Thanks,
Longman



[PATCH v2] x86/apic/vector: Move pr_warn() out of vector_lock

2021-03-28 Thread Waiman Long
.097981]   lock(console_owner);
[  218.097983]
[  218.097984]  *** DEADLOCK ***
[  218.097984]
[  218.097985] 6 locks held by systemd/1:
[  218.097986]  #0: 88822b5cc1e8 (>legacy_mutex){+.+.}, at: 
tty_init_dev+0x79/0x440
[  218.097989]  #1: 88832ee00770 (>mutex){+.+.}, at: 
tty_port_open+0x85/0x190
[  218.097993]  #2: 88813be85a88 (>request_mutex){+.+.}, at: 
__setup_irq+0x249/0x1e60
[  218.097996]  #3: 88813be858c0 (_desc_lock_class){-.-.}, at: 
__setup_irq+0x2d9/0x1e60
[  218.098000]  #4: 84afca78 (vector_lock){-.-.}, at: 
x86_vector_activate+0xca/0xab0
[  218.098003]  #5: 84c27e20 (console_lock){+.+.}, at: 
vprintk_emit+0x13a/0x450
[  218.098007]
[  218.098007] stack backtrace:
[  218.098009] CPU: 1 PID: 1 Comm: systemdt tainted 4.18.0-228.el8.x86_64+debug 
#1
[  218.098010] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 03/07/2016
[  218.098010] Call Trace:
[  218.098011]  dump_stack+0x9a/0xf0
[  218.098012]  check_noncircular+0x317/0x3c0
[  218.098013]  ? print_circular_bug+0x1e0/0x1e0
[  218.098014]  ? do_profile_hits.isra.4.cold.9+0x2d/0x2d
[  218.098015]  ? sched_clock+0x5/0x10
[  218.098015]  __lock_acquire+0x2894/0x7300
[  218.098016]  ? trace_hardirqs_on+0x10/0x10
[  218.098017]  ? lock_downgrade+0x6f0/0x6f0
[  218.098018]  lock_acquire+0x14f/0x3b0
[  218.098019]  ? console_unlock+0x3fb/0x9f0
[  218.098019]  console_unlock+0x45d/0x9f0
[  218.098020]  ? console_unlock+0x3fb/0x9f0
[  218.098021]  vprintk_emit+0x147/0x450
[  218.098022]  ? apic_update_irq_cfg+0x3f6/0x520
[  218.098023]  printk+0x9f/0xc5
[  218.098023]  ? kmsg_dump_rewind_nolock+0xd9/0xd9
[  218.098024]  x86_vector_activate+0x3f8/0xab0
[  218.098025]  ? find_held_lock+0x3a/0x1c0
[  218.098026]  __irq_domain_activate_irq+0xdb/0x160
[  218.098027]  __irq_domain_activate_irq+0x7d/0x160
[  218.098028]  __irq_domain_activate_irq+0x7d/0x160
[  218.098029]  irq_domain_activate_irq+0x95/0x130
[  218.098029]  ? __init_waitqueue_head+0x3a/0x90
[  218.098030]  __setup_irq+0x861/0x1e60
[  218.098031]  ? kmem_cache_alloc_trace+0x169/0x360
[  218.098032]  ? request_threaded_irq+0xf9/0x2c0
[  218.098033]  request_threaded_irq+0x1e7/0x2c0
[  218.098033]  univ8250_setup_irq+0x6ef/0x8b0
[  218.50_do_startup+0x7d4/0x1ed0
[  218.098035]  uart_startup.part.7+0x16f/0x4b0
[  218.098036]  tty_port_open+0x112/0x190
[  218.098037]  ? _raw_spin_unlock+0x1f/0x30
[  218.098037]  uart_open+0xe0/0x150
[  218.098038]  tty_open+0x229/0x9a0
[  218.098039]  ? lock_acquire+0x14f/0x3b0
[  218.098040]  ? tty_init_dev+0x440/0x440
[  218.098041]  ? selinux_file_open+0x2cf/0x400
[  218.098041]  chrdev_open+0x1e0/0x4e0
[  218.098042]  ? cdev_put.part.0+0x50/0x50
[  218.098043]  do_dentry_open+0x3d9/0xea0
[  218.098044]  ? cdev_put.part.0+0x50/0x50
[  218.098045]  ? devcgroup_check_permission+0x17c/0x2f0
[  218.098046]  ? __x64_sys_fchdir+0x180/0x180
[  218.098046]  ? security_inode_permission+0x79/0xc0
[  218.098047]  path_openat+0xb73/0x2690
[  218.098048]  ? save_stack+0x81/0xb0
[  218.098049]  ? path_lookupat+0x820/0x820
[  218.098049]  ? sched_clock+0x5/0x10
[  218.098050]  ? sched_clock_cpu+0x18/0x1e0
[  218.098051]  ? do_syscall_64+0xa5/0x4d0
[  218.098052]  ? entry_SYSCALL_64_after_hwframe+0x6a/0xdf
[  218.098053]  ? __lock_acquire+0x646/0x7300
[  218.098054]  ? sched_clock_cpu+0x18/0x1e0
[  218.098054]  ? find_held_lock+0x3a/0x1c0 filp_open+0x17c/0x250
[  218.098056]  ? may_open_dev+0xc0/0xc0
[  218.098057]  ? do_raw_spin_unlock+0x54/0x230
[  218.098058]  ? _raw_spin_unlock+0x1f/0x30
[  218.098058]  ? _raw_spin_unlock_irq+0x24/0x40
[  218.098059]  do_sys_open+0x1d9/0x360
[  218.098060]  ? filp_open+0x50/0x50
[  218.098061]  ? entry_SYSCALL_64_after_hwframe+0x7a/0xdf
[  218.098062]  ? do_syscall_64+0x22/0x4d0
[  218.098062]  do_syscall_64+0xa5/0x4d0
[  218.098063]  entry_SYSCALL_64_after_hwframe+0x6a/0xdf
[  218.098064] RIP: 0033:0x7ff4ee7620d6

This lockdep warning was causing by printing of the warning message:

[  218.095152] irq 3: Affinity broken due to vector space exhaustion.

To avoid this potential deadlock scenario, this patch moves all the
pr_warn() calls in the vector.c file out of the vector_lock critical
sections.

Signed-off-by: Waiman Long 
---
 arch/x86/kernel/apic/vector.c | 35 +--
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 3c9c7492252f..79f1ce3a9539 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -385,7 +385,7 @@ static void x86_vector_deactivate(struct irq_domain *dom, 
struct irq_data *irqd)
raw_spin_unlock_irqrestore(_lock, flags);
 }
 
-static int activate_reserved(struct irq_data *irqd)
+static int activate_reserved(struct irq_data *irqd, char *wbuf, size_t wsize)
 {
struct apic_chip_data *apicd = apic_chip_data(irqd);
int ret;
@@ -410,8 +410,9 @@ static int activate_reserved(struct irq_data *irqd)
 */
   

Re: [PATCH] x86/apic/vector: Move pr_warn() outside of vector_lock

2021-03-28 Thread Waiman Long

On 3/28/21 6:04 PM, Thomas Gleixner wrote:

Waiman,

On Sun, Mar 28 2021 at 15:58, Waiman Long wrote:

It was found that the following circular locking dependency warning
could happen in some systems:

[  218.097878] ==
[  218.097879] WARNING: possible circular locking dependency detected
[  218.097880] 4.18.0-228.el8.x86_64+debug #1 Not tainted
[  218.097881] --
[  218.097882] systemd/1 is trying to acquire lock:
[  218.097883] 84c27920 (console_owner){-.-.}, at: 
console_unlock+0x3fb/0x9f0
[  218.097886]
[  218.097887] but task is already holding lock:
[  218.097888] 84afca78 (vector_lock){-.-.}, at: 
x86_vector_activate+0xca/0xab0
[  218.097891]
[  218.097892] which lock already depends on the new lock.
 :
[  218.097966] other info that might help us debug this:
[  218.097967]
[  218.097967] Chain exists of:
[  218.097968]   console_oc_lock_class --> vector_lock
[  218.097972]
[  218.097973]  Possible unsafe locking scenario:
[  218.097973]
[  218.097974]CPU0CPU1
[  218.097975]
[  218.097975]   lock(vector_lock);
[  218.097977]lock(_desc_lock_class);
[  218.097980]lock(vector_lock);
[  218.097981]   lock(console_owner);
[  218.097983]
[  218.097984]  *** DEADLOCK ***

can you please post the full lockdep output?


Will do.



This lockdep warning was causing by printing of the warning message:

[  218.095152] irq 3: Affinity broken due to vector space exhaustion.

It looks that this warning message is relatively more common than
the other warnings in arch/x86/kernel/apic/vector.c. To avoid this
potential deadlock scenario, this patch moves all the pr_warn() calls
in the vector.c file outside of the vector_lock critical sections.

Definitely not.


-static int activate_reserved(struct irq_data *irqd)
+static int activate_reserved(struct irq_data *irqd, unsigned long flags,
+bool *unlocked)
  {
struct apic_chip_data *apicd = apic_chip_data(irqd);
int ret;
@@ -410,6 +411,8 @@ static int activate_reserved(struct irq_data *irqd)
 */
if (!cpumask_subset(irq_data_get_effective_affinity_mask(irqd),
irq_data_get_affinity_mask(irqd))) {
+   raw_spin_unlock_irqrestore(_lock, flags);
+   *unlocked = true;

What?


pr_warn("irq %u: Affinity broken due to vector space 
exhaustion.\n",
irqd->irq);
}
@@ -446,6 +449,7 @@ static int x86_vector_activate(struct irq_domain *dom, 
struct irq_data *irqd,
  {
struct apic_chip_data *apicd = apic_chip_data(irqd);
unsigned long flags;
+   bool unlocked = false;
int ret = 0;
  
  	trace_vector_activate(irqd->irq, apicd->is_managed,

@@ -459,8 +463,9 @@ static int x86_vector_activate(struct irq_domain *dom, 
struct irq_data *irqd,
else if (apicd->is_managed)
ret = activate_managed(irqd);
else if (apicd->has_reserved)
-   ret = activate_reserved(irqd);
-   raw_spin_unlock_irqrestore(_lock, flags);
+   ret = activate_reserved(irqd, flags, );
+   if (!unlocked)
+   raw_spin_unlock_irqrestore(_lock, flags);

Even moar what?


return ret;
  }

This turns that code into complete unreadable gunk. No way.


I am sorry that this part of the patch is sloppy. I will revise it to 
make it better.


Cheers,
Longman




[PATCH] x86/apic/vector: Move pr_warn() outside of vector_lock

2021-03-28 Thread Waiman Long
It was found that the following circular locking dependency warning
could happen in some systems:

[  218.097878] ==
[  218.097879] WARNING: possible circular locking dependency detected
[  218.097880] 4.18.0-228.el8.x86_64+debug #1 Not tainted
[  218.097881] --
[  218.097882] systemd/1 is trying to acquire lock:
[  218.097883] 84c27920 (console_owner){-.-.}, at: 
console_unlock+0x3fb/0x9f0
[  218.097886]
[  218.097887] but task is already holding lock:
[  218.097888] 84afca78 (vector_lock){-.-.}, at: 
x86_vector_activate+0xca/0xab0
[  218.097891]
[  218.097892] which lock already depends on the new lock.
:
[  218.097966] other info that might help us debug this:
[  218.097967]
[  218.097967] Chain exists of:
[  218.097968]   console_oc_lock_class --> vector_lock
[  218.097972]
[  218.097973]  Possible unsafe locking scenario:
[  218.097973]
[  218.097974]CPU0CPU1
[  218.097975]
[  218.097975]   lock(vector_lock);
[  218.097977]lock(_desc_lock_class);
[  218.097980]lock(vector_lock);
[  218.097981]   lock(console_owner);
[  218.097983]
[  218.097984]  *** DEADLOCK ***

This lockdep warning was causing by printing of the warning message:

[  218.095152] irq 3: Affinity broken due to vector space exhaustion.

It looks that this warning message is relatively more common than
the other warnings in arch/x86/kernel/apic/vector.c. To avoid this
potential deadlock scenario, this patch moves all the pr_warn() calls
in the vector.c file outside of the vector_lock critical sections.

Signed-off-by: Waiman Long 
---
 arch/x86/kernel/apic/vector.c | 33 -
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 3c9c7492252f..948ec592ef61 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -385,7 +385,8 @@ static void x86_vector_deactivate(struct irq_domain *dom, 
struct irq_data *irqd)
raw_spin_unlock_irqrestore(_lock, flags);
 }
 
-static int activate_reserved(struct irq_data *irqd)
+static int activate_reserved(struct irq_data *irqd, unsigned long flags,
+bool *unlocked)
 {
struct apic_chip_data *apicd = apic_chip_data(irqd);
int ret;
@@ -410,6 +411,8 @@ static int activate_reserved(struct irq_data *irqd)
 */
if (!cpumask_subset(irq_data_get_effective_affinity_mask(irqd),
irq_data_get_affinity_mask(irqd))) {
+   raw_spin_unlock_irqrestore(_lock, flags);
+   *unlocked = true;
pr_warn("irq %u: Affinity broken due to vector space 
exhaustion.\n",
irqd->irq);
}
@@ -446,6 +449,7 @@ static int x86_vector_activate(struct irq_domain *dom, 
struct irq_data *irqd,
 {
struct apic_chip_data *apicd = apic_chip_data(irqd);
unsigned long flags;
+   bool unlocked = false;
int ret = 0;
 
trace_vector_activate(irqd->irq, apicd->is_managed,
@@ -459,8 +463,9 @@ static int x86_vector_activate(struct irq_domain *dom, 
struct irq_data *irqd,
else if (apicd->is_managed)
ret = activate_managed(irqd);
else if (apicd->has_reserved)
-   ret = activate_reserved(irqd);
-   raw_spin_unlock_irqrestore(_lock, flags);
+   ret = activate_reserved(irqd, flags, );
+   if (!unlocked)
+   raw_spin_unlock_irqrestore(_lock, flags);
return ret;
 }
 
@@ -989,6 +994,7 @@ void irq_force_complete_move(struct irq_desc *desc)
struct apic_chip_data *apicd;
struct irq_data *irqd;
unsigned int vector;
+   bool warn_move_in_progress = false;
 
/*
 * The function is called for all descriptors regardless of which
@@ -1064,12 +1070,14 @@ void irq_force_complete_move(struct irq_desc *desc)
 * so we have the necessary information when a problem in that
 * area arises.
 */
-   pr_warn("IRQ fixup: irq %d move in progress, old vector %d\n",
-   irqd->irq, vector);
+   warn_move_in_progress = true;
}
free_moved_vector(apicd);
 unlock:
raw_spin_unlock(_lock);
+   if (warn_move_in_progress)
+   pr_warn("IRQ fixup: irq %d move in progress, old vector %d\n",
+   irqd->irq, vector);
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -1079,25 +1087,24 @@ void irq_force_complete_move(struct irq_desc *desc)
  */
 int lapic_can_unplug_cpu(void)
 {
-   unsigned int rsvd, avl, tomove, cpu = smp_processor_id();
+   unsigned int rsvd = 0, avl, tomove, cpu = smp_processor_id

[PATCH 1/2] sched/debug: Don't disable IRQ when acquiring sched_debug_lock

2021-03-27 Thread Waiman Long
The sched_debug_lock was used only in print_cpu().  The
print_cpu() function has two callers - sched_debug_show() and
sysrq_sched_debug_show(). Both of them are invoked by user action
(sched_debug file and sysrq-t). As print_cpu() won't be called from
interrupt context at all, there is no point in disabling IRQ when
acquiring sched_debug_lock.

Besides, if the system has many cpus and the sched_debug_lock is somehow
busy (e.g. parallel sysrq-t), the system may hit a hard lockup panic, like

[ 7809.796262] Kernel panic - not syncing: Hard LOCKUP
[ 7809.796264] CPU: 13 PID: 79867 Comm: reproducer.sh Kdump: loaded Tainted: G  
I  - -  - 4.18.0-301.el8.x86_64 #1
[ 7809.796264] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.4.9 
06/29/2018
[ 7809.796265] Call Trace:
[ 7809.796265]  
[ 7809.796266]  dump_stack+0x5c/0x80
[ 7809.796266]  panic+0xe7/0x2a9
[ 7809.796267]  nmi_panic.cold.9+0xc/0xc
[ 7809.796267]  watchdog_overflow_callback.cold.7+0x5c/0x70
[ 7809.796268]  __perf_event_overflow+0x52/0xf0
[ 7809.796268]  handle_pmi_common+0x204/0x2a0
[ 7809.796269]  ? __set_pte_vaddr+0x32/0x50
[ 7809.796269]  ? __native_set_fixmap+0x24/0x30
[ 7809.796270]  ? ghes_copy_tofrom_phys+0xd3/0x1c0
[ 7809.796271]  intel_pmu_handle_irq+0xbf/0x160
[ 7809.796271]  perf_event_nmi_handler+0x2d/0x50
[ 7809.796272]  nmi_handle+0x63/0x110
[ 7809.796272]  default_do_nmi+0x49/0x100
[ 7809.796273]  do_nmi+0x17e/0x1e0
[ 7809.796273]  end_repeat_nmi+0x16/0x6f
[ 7809.796274] RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1d0
[ 7809.796275] Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 
30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 
c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00
[ 7809.796276] RSP: 0018:aa54cd887df8 EFLAGS: 0002
[ 7809.796277] RAX: 0101 RBX: 0246 RCX: 
[ 7809.796278] RDX:  RSI:  RDI: 936b66d0
[ 7809.796278] RBP: 9301fb40 R08: 0004 R09: 004f
[ 7809.796279] R10:  R11: aa54cd887cc0 R12: 907fd0a29ec0
[ 7809.796280] R13:  R14: 926ab7c0 R15: 
[ 7809.796280]  ? native_queued_spin_lock_slowpath+0x5b/0x1d0
[ 7809.796281]  ? native_queued_spin_lock_slowpath+0x5b/0x1d0
[ 7809.796281]  
[ 7809.796282]  _raw_spin_lock_irqsave+0x32/0x40
[ 7809.796283]  print_cpu+0x261/0x7c0
[ 7809.796283]  sysrq_sched_debug_show+0x34/0x50
[ 7809.796284]  sysrq_handle_showstate+0xc/0x20
[ 7809.796284]  __handle_sysrq.cold.11+0x48/0xfb
[ 7809.796285]  write_sysrq_trigger+0x2b/0x30
[ 7809.796285]  proc_reg_write+0x39/0x60
[ 7809.796286]  vfs_write+0xa5/0x1a0
[ 7809.796286]  ksys_write+0x4f/0xb0
[ 7809.796287]  do_syscall_64+0x5b/0x1a0
[ 7809.796287]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 7809.796288] RIP: 0033:0x7fabe4ceb648

It is not to disable IRQ for so long that a hard lockup panic is
triggered. Fix that by not disabling IRQ when taking sched_debug_lock.

Fixes: efe25c2c7b3a ("sched: Reinstate group names in /proc/sched_debug")
Signed-off-by: Waiman Long 
---
 kernel/sched/debug.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 486f403a778b..c4ae8a0853a1 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -666,7 +666,6 @@ void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq 
*dl_rq)
 static void print_cpu(struct seq_file *m, int cpu)
 {
struct rq *rq = cpu_rq(cpu);
-   unsigned long flags;
 
 #ifdef CONFIG_X86
{
@@ -717,13 +716,13 @@ do {  
\
}
 #undef P
 
-   spin_lock_irqsave(_debug_lock, flags);
+   spin_lock(_debug_lock);
print_cfs_stats(m, cpu);
print_rt_stats(m, cpu);
print_dl_stats(m, cpu);
 
print_rq(m, rq, cpu);
-   spin_unlock_irqrestore(_debug_lock, flags);
+   spin_unlock(_debug_lock);
SEQ_printf(m, "\n");
 }
 
-- 
2.18.1



[PATCH 2/2] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only

2021-03-27 Thread Waiman Long
The purpose of sched_debug_lock is to serialize the use of the global
cgroup_path[] buffer in print_cpu(). The rests of the printf calls
don't need serialization from sched_debug_lock. The printing of sched
debug data to console can take quite a while, taking sched_debug_lock
at the print_cpu() level will unnecessarily block the progress of other
print_cpu() users. Fix that by holding sched_debug_lock only when using
cgroup_path[] via task_group_path().

Signed-off-by: Waiman Long 
---
 kernel/sched/debug.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index c4ae8a0853a1..87168d13714f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -8,8 +8,6 @@
  */
 #include "sched.h"
 
-static DEFINE_SPINLOCK(sched_debug_lock);
-
 /*
  * This allows printing both to /proc/sched_debug and
  * to the console
@@ -470,6 +468,7 @@ static void print_cfs_group_stats(struct seq_file *m, int 
cpu, struct task_group
 #endif
 
 #ifdef CONFIG_CGROUP_SCHED
+static DEFINE_SPINLOCK(sched_debug_lock);
 static char group_path[PATH_MAX];
 
 static char *task_group_path(struct task_group *tg)
@@ -506,7 +505,9 @@ print_task(struct seq_file *m, struct rq *rq, struct 
task_struct *p)
SEQ_printf(m, " %d %d", task_node(p), task_numa_group_id(p));
 #endif
 #ifdef CONFIG_CGROUP_SCHED
+   spin_lock(_debug_lock);
SEQ_printf(m, " %s", task_group_path(task_group(p)));
+   spin_unlock(_debug_lock);
 #endif
 
SEQ_printf(m, "\n");
@@ -543,7 +544,9 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct 
cfs_rq *cfs_rq)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
SEQ_printf(m, "\n");
+   spin_lock(_debug_lock);
SEQ_printf(m, "cfs_rq[%d]:%s\n", cpu, task_group_path(cfs_rq->tg));
+   spin_unlock(_debug_lock);
 #else
SEQ_printf(m, "\n");
SEQ_printf(m, "cfs_rq[%d]:\n", cpu);
@@ -614,7 +617,9 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq 
*rt_rq)
 {
 #ifdef CONFIG_RT_GROUP_SCHED
SEQ_printf(m, "\n");
+   spin_lock(_debug_lock);
SEQ_printf(m, "rt_rq[%d]:%s\n", cpu, task_group_path(rt_rq->tg));
+   spin_unlock(_debug_lock);
 #else
SEQ_printf(m, "\n");
SEQ_printf(m, "rt_rq[%d]:\n", cpu);
@@ -716,13 +721,11 @@ do {  
\
}
 #undef P
 
-   spin_lock(_debug_lock);
print_cfs_stats(m, cpu);
print_rt_stats(m, cpu);
print_dl_stats(m, cpu);
 
print_rq(m, rq, cpu);
-   spin_unlock(_debug_lock);
SEQ_printf(m, "\n");
 }
 
-- 
2.18.1



Re: [PATCH v4 3/4] locking/qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32

2021-03-27 Thread Waiman Long

On 3/27/21 2:06 PM, guo...@kernel.org wrote:

From: Guo Ren 

Some architectures don't have sub-word swap atomic instruction,
they only have the full word's one.

The sub-word swap only improve the performance when:
NR_CPUS < 16K
  *  0- 7: locked byte
  * 8: pending
  *  9-15: not used
  * 16-17: tail index
  * 18-31: tail cpu (+1)

The 9-15 bits are wasted to use xchg16 in xchg_tail.

Please let architecture select xchg16/xchg32 to implement
xchg_tail.

Signed-off-by: Guo Ren 
Cc: Peter Zijlstra 
Cc: Will Deacon 
Cc: Ingo Molnar 
Cc: Waiman Long 
Cc: Arnd Bergmann 
Cc: Anup Patel 
---
  kernel/Kconfig.locks   |  3 +++
  kernel/locking/qspinlock.c | 44 +-
  2 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 3de8fd11873b..d02f1261f73f 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -239,6 +239,9 @@ config LOCK_SPIN_ON_OWNER
  config ARCH_USE_QUEUED_SPINLOCKS
bool
  
+config ARCH_USE_QUEUED_SPINLOCKS_XCHG32

+   bool
+
  config QUEUED_SPINLOCKS
def_bool y if ARCH_USE_QUEUED_SPINLOCKS
depends on SMP
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index cbff6ba53d56..54de0632c6a8 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -163,26 +163,6 @@ static __always_inline void 
clear_pending_set_locked(struct qspinlock *lock)
WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
  }
  
-/*

- * xchg_tail - Put in the new queue tail code word & retrieve previous one
- * @lock : Pointer to queued spinlock structure
- * @tail : The new queue tail code word
- * Return: The previous queue tail code word
- *
- * xchg(lock, tail), which heads an address dependency
- *
- * p,*,* -> n,*,* ; prev = xchg(lock, node)
- */
-static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
-{
-   /*
-* We can use relaxed semantics since the caller ensures that the
-* MCS node is properly initialized before updating the tail.
-*/
-   return (u32)xchg_relaxed(>tail,
-tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
-}
-
  #else /* _Q_PENDING_BITS == 8 */
  
  /**

@@ -206,6 +186,30 @@ static __always_inline void 
clear_pending_set_locked(struct qspinlock *lock)
  {
atomic_add(-_Q_PENDING_VAL + _Q_LOCKED_VAL, >val);
  }
+#endif
+
+#if _Q_PENDING_BITS == 8 && !defined(CONFIG_ARCH_USE_QUEUED_SPINLOCKS_XCHG32)
+/*
+ * xchg_tail - Put in the new queue tail code word & retrieve previous one
+ * @lock : Pointer to queued spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail), which heads an address dependency
+ *
+ * p,*,* -> n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+   /*
+* We can use relaxed semantics since the caller ensures that the
+* MCS node is properly initialized before updating the tail.
+*/
+   return (u32)xchg_relaxed(>tail,
+tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
+}
+
+#else
  
  /**

   * xchg_tail - Put in the new queue tail code word & retrieve previous one


I don't have any problem adding a 
CONFIG_ARCH_USE_QUEUED_SPINLOCKS_XCHG32 config option to control that.


One minor nit:

#endif /* _Q_PENDING_BITS == 8 */

You should probably remove the comment at the trailing end of the 
corresponding "#endif" as it is now wrong.


Cheers,
Longman



Re: [PATCH] locking/mutex: Remove repeated declaration

2021-03-22 Thread Waiman Long

On 3/22/21 9:09 PM, Shaokun Zhang wrote:

Commit 0cd39f4600ed ("locking/seqlock, headers: Untangle the spaghetti monster")
introduces 'struct ww_acquire_ctx' again, remove the repeated declaration.

Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Will Deacon 
Cc: Waiman Long 
Cc: Boqun Feng 
Signed-off-by: Shaokun Zhang 
---
  include/linux/mutex.h | 2 --
  1 file changed, 2 deletions(-)

diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 0cd631a19727..d80c0e22c822 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -20,8 +20,6 @@
  #include 
  #include 
  
-struct ww_acquire_ctx;

-
  /*
   * Simple, straightforward mutexes with strict semantics:
   *


Yes, it is duplicated.

Acked-by: Waiman Long 



Re: [PATCH] lockdep: add a hint for "INFO: trying to register non-static key." message

2021-03-22 Thread Waiman Long

On 3/21/21 2:49 AM, Tetsuo Handa wrote:

Since this message is printed when dynamically allocated spinlocks (e.g.
kzalloc()) are used without initialization (e.g. spin_lock_init()),
suggest developers to check whether initialization functions for objects
are called, before making developers wonder what annotation is missing.

Signed-off-by: Tetsuo Handa 
---
  kernel/locking/lockdep.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index c6d0c1dc6253..44c549f5c061 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -931,6 +931,7 @@ static bool assign_lock_key(struct lockdep_map *lock)
debug_locks_off();
pr_err("INFO: trying to register non-static key.\n");
pr_err("the code is fine but needs lockdep annotation.\n");
+   pr_err("maybe you didn't initialize this object before you 
use?\n");
pr_err("turning off the locking correctness validator.\n");
dump_stack();
return false;


The only way this message is written is when the appropriate lock init 
function isn't called for locks in dynamically allocated objects. I 
think you can just say so without the word "maybe". Like "the code is 
fine but needs lockdep annotation by calling appropriate lock 
initialization function.".


Cheers,
Longman



  1   2   3   4   5   6   7   8   9   10   >