Re: [syzbot] [mm?] possible deadlock in __mmap_lock_do_trace_start_locking
On 6/16/24 10:05, syzbot wrote: syzbot has bisected this issue to: commit 21c38a3bd4ee3fb7337d013a638302fb5e5f9dc2 Author: Jesper Dangaard Brouer Date: Wed May 1 14:04:11 2024 + cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=1669526198 start commit: 36534d3c5453 tcp: use signed arithmetic in tcp_rtx_probe0_.. git tree: bpf final oops: https://syzkaller.appspot.com/x/report.txt?x=1569526198 console output: https://syzkaller.appspot.com/x/log.txt?x=1169526198 kernel config: https://syzkaller.appspot.com/x/.config?x=333ebe38d43c42e2 dashboard link: https://syzkaller.appspot.com/bug?extid=6ff90931779bcdfc840c syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1585acfa98 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=17bdb7ee98 Reported-by: syzbot+6ff90931779bcdfc8...@syzkaller.appspotmail.com Fixes: 21c38a3bd4ee ("cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints") For information about bisection process see: https://goo.gl/tpsmEJ#bisection +static __always_inline +unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu, + struct cgroup *cgrp, const bool fast_path) +{ + unsigned long flags; + bool contended; + + /* + * The _irqsave() is needed because cgroup_rstat_lock is + * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring + * this lock with the _irq() suffix only disables interrupts on + * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables + * interrupts on both configurations. The _irqsave() ensures + * that interrupts are always disabled and later restored. + */ + contended = !raw_spin_trylock_irqsave(cpu_lock, flags); + if (contended) { + if (fast_path) + trace_cgroup_rstat_cpu_lock_contended_fastpath(cgrp, cp> + else + trace_cgroup_rstat_cpu_lock_contended(cgrp, cpu, conten> + + raw_spin_lock_irqsave(cpu_lock, flags); + } I believe the problem may be caused by the fact that trace_cgroup_rstat_cpu_lock_contended*() can be called with IRQ enabled. I had suggested before IRQ should be disabled first before doing any trace operation. See https://lore.kernel.org/linux-mm/203fdb35-f4cf-4754-9709-3c024eeca...@redhat.com/ Doing so may be able to resolve this possible deadlock. Cheers, Longman
Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().
On 5/3/24 17:10, David Laight wrote: From: Waiman Long Sent: 03 May 2024 17:00 ... David, Could you respin the series based on the latest upstream code? I've just reapplied the patches to 'master' and they all apply cleanly and diffing the new patches to the old ones gives no differences. So I think they should still apply. Were you seeing a specific problem? I don't remember any suggested changed either. (Apart from a very local variable I used to keep a patch isolated.) No, I just want to make sure that your patches will still apply. Anyway, it will be easier for the maintainer to merge your remaining patches if you can send out a new version even if they are almost the same as the old ones. Thanks, Longman
Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().
On 12/31/23 23:14, Waiman Long wrote: On 12/31/23 16:55, David Laight wrote: per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number. This requires the cpu number be 64bit. However the value is osq_lock() comes from a 32bit xchg() and there isn't a way of telling gcc the high bits are zero (they are) so there will always be an instruction to clear the high bits. The cpu number is also offset by one (to make the initialiser 0) It seems to be impossible to get gcc to convert __per_cpu_offset[cpu_p1 - 1] into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the address). Converting the cpu number to 32bit unsigned prior to the decrement means that gcc knows the decrement has set the high bits to zero and doesn't add a register-register move (or cltq) to zero/sign extend the value. Not massive but saves two instructions. Signed-off-by: David Laight --- kernel/locking/osq_lock.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 35bb99e96697..37a4fa872989 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr) return cpu_nr + 1; } -static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) +static inline struct optimistic_spin_node *decode_cpu(unsigned int encoded_cpu_val) { - int cpu_nr = encoded_cpu_val - 1; - - return per_cpu_ptr(_node, cpu_nr); + return per_cpu_ptr(_node, encoded_cpu_val - 1); } /* You really like micro-optimization. Anyway, Reviewed-by: Waiman Long David, Could you respin the series based on the latest upstream code? Thanks, Longman
Re: [RFC][PATCH] tracing: Introduce restart_critical_timings()
On 3/20/24 12:20, Steven Rostedt wrote: From: Steven Rostedt (Google) I'm debugging some latency issues on a Chromebook and the preemptirqsoff tracer hit this: # tracer: preemptirqsoff # # preemptirqsoff latency trace v1.1.5 on 5.15.148-21853-g165fd2387469-dirty # # latency: 7813 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) #- #| task: init-1 (uid:0 nice:0 policy:0 rt_prio:0) #- # => started at: rwsem_optimistic_spin # => ended at: rwsem_optimistic_spin # # #_--=> CPU# # / _-=> irqs-off # | / _=> need-resched # || / _---=> hardirq/softirq # ||| / _--=> preempt-depth # / _-=> migrate-disable # | / delay # cmd pid || time | caller # \ /|| \|/ <...>-1 1...1.0us!: rwsem_optimistic_spin+0x20/0x194 <-rwsem_optimistic_spin+0x20/0x194 <...>-1 1.N.1. 7813us : rwsem_optimistic_spin+0x140/0x194 <-rwsem_optimistic_spin+0x140/0x194 <...>-1 1.N.1. 7814us+: tracer_preempt_on+0x4c/0x6a <-rwsem_optimistic_spin+0x140/0x194 <...>-1 1.N.1. 7824us : => rwsem_optimistic_spin+0x140/0x194 => rwsem_down_write_slowpath+0xc9/0x3fe => copy_process+0xd08/0x1b8a => kernel_clone+0x94/0x256 => __x64_sys_clone+0x7a/0x9a => do_syscall_64+0x51/0xa1 => entry_SYSCALL_64_after_hwframe+0x5c/0xc6 Which isn't a real issue, as it's in the rwsem_optimistic_spin() code that spins with interrupts enabled, preempt disabled, and checks for need_resched(). If it is true, it breaks out and schedules. Hence, it's hiding real issues, because it can spin for a very long time and this is not the source of the latency I'm tracking down. I would like to introduce restart_critical_timings() and place it in locations that have this behavior. Signed-off-by: Steven Rostedt (Google) I have no objection to that. However, there are now 2 function call overhead in each iteration if either CONFIG_IRQSOFF_TRACER or CONFIG_PREEMPT_TRACER is on. Is it possible to do it with just one function call? IOW, make restart_critical_timings() a real function. Cheers, Longman diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h index 147feebd508c..e9f97f60bfc0 100644 --- a/include/linux/irqflags.h +++ b/include/linux/irqflags.h @@ -145,6 +145,13 @@ do { \ # define start_critical_timings() do { } while (0) #endif +/* Used in spins that check need_resched() with preemption disabled */ +static inline void restart_critical_timings(void) +{ + stop_critical_timings(); + start_critical_timings(); +} + #ifdef CONFIG_DEBUG_IRQFLAGS extern void warn_bogus_irq_restore(void); #define raw_check_bogus_irq_restore() \ diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 2340b6d90ec6..fa7b43e53d20 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #ifndef CONFIG_PREEMPT_RT @@ -780,6 +781,7 @@ rwsem_spin_on_owner(struct rw_semaphore *sem) */ barrier(); + restart_critical_timings(); if (need_resched() || !owner_on_cpu(owner)) { state = OWNER_NONSPINNABLE; break; @@ -912,6 +914,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem) * a writer, need_resched() check needs to be done here. */ if (owner_state != OWNER_WRITER) { + restart_critical_timings(); if (need_resched()) break; if (rt_task(current) &&
Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().
On 12/31/23 16:55, David Laight wrote: per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number. This requires the cpu number be 64bit. However the value is osq_lock() comes from a 32bit xchg() and there isn't a way of telling gcc the high bits are zero (they are) so there will always be an instruction to clear the high bits. The cpu number is also offset by one (to make the initialiser 0) It seems to be impossible to get gcc to convert __per_cpu_offset[cpu_p1 - 1] into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the address). Converting the cpu number to 32bit unsigned prior to the decrement means that gcc knows the decrement has set the high bits to zero and doesn't add a register-register move (or cltq) to zero/sign extend the value. Not massive but saves two instructions. Signed-off-by: David Laight --- kernel/locking/osq_lock.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 35bb99e96697..37a4fa872989 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr) return cpu_nr + 1; } -static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) +static inline struct optimistic_spin_node *decode_cpu(unsigned int encoded_cpu_val) { - int cpu_nr = encoded_cpu_val - 1; - - return per_cpu_ptr(_node, cpu_nr); + return per_cpu_ptr(_node, encoded_cpu_val - 1); } /* You really like micro-optimization. Anyway, Reviewed-by: Waiman Long
Re: [PATCH next v2 4/5] locking/osq_lock: Avoid writing to node->next in the osq_lock() fast path.
On 12/31/23 16:54, David Laight wrote: When osq_lock() returns false or osq_unlock() returns static analysis shows that node->next should always be NULL. This means that it isn't necessary to explicitly set it to NULL prior to atomic_xchg(>tail, curr) on extry to osq_lock(). Just in case there a non-obvious race condition that can leave it non-NULL check with WARN_ON_ONCE() and NULL if set. Note that without this check the fast path (adding at the list head) doesn't need to to access the per-cpu osq_node at all. Signed-off-by: David Laight --- kernel/locking/osq_lock.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 27324b509f68..35bb99e96697 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -87,12 +87,17 @@ osq_wait_next(struct optimistic_spin_queue *lock, bool osq_lock(struct optimistic_spin_queue *lock) { - struct optimistic_spin_node *node = this_cpu_ptr(_node); - struct optimistic_spin_node *prev, *next; + struct optimistic_spin_node *node, *prev, *next; int curr = encode_cpu(smp_processor_id()); int prev_cpu; - node->next = NULL; + /* +* node->next should be NULL on entry. +* Check just in case there is a race somewhere. +* Note that this is probably an unnecessary cache miss in the fast path. +*/ + if (WARN_ON_ONCE(raw_cpu_read(osq_node.next) != NULL)) + raw_cpu_write(osq_node.next, NULL); /* * We need both ACQUIRE (pairs with corresponding RELEASE in @@ -104,8 +109,9 @@ bool osq_lock(struct optimistic_spin_queue *lock) if (prev_cpu == OSQ_UNLOCKED_VAL) return true; - node->prev_cpu = prev_cpu; + node = this_cpu_ptr(_node); prev = decode_cpu(prev_cpu); + node->prev_cpu = prev_cpu; node->locked = 0; /* Reviewed-by: Waiman Long
Re: [PATCH next v2 3/5] locking/osq_lock: Use node->prev_cpu instead of saving node->prev.
On 12/31/23 16:54, David Laight wrote: node->prev is only used to update 'prev' in the unlikely case of concurrent unqueues. This can be replaced by a check for node->prev_cpu changing and then calling decode_cpu() to get the changed 'prev' pointer. node->cpu (or more particularly) prev->cpu is only used for the osq_wait_next() call in the unqueue path. Normally this is exactly the value that the initial xchg() read from lock->tail (used to obtain 'prev'), but can get updated by concurrent unqueues. Both the 'prev' and 'cpu' members of optimistic_spin_node are now unused and can be deleted. Signed-off-by: David Laight --- kernel/locking/osq_lock.c | 31 +-- 1 file changed, 17 insertions(+), 14 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index eb8a6dfdb79d..27324b509f68 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -13,9 +13,8 @@ */ struct optimistic_spin_node { - struct optimistic_spin_node *next, *prev; + struct optimistic_spin_node *next; int locked;/* 1 if lock acquired */ - int cpu; /* encoded CPU # + 1 value */ int prev_cpu; /* encoded CPU # + 1 value */ }; @@ -91,10 +90,9 @@ bool osq_lock(struct optimistic_spin_queue *lock) struct optimistic_spin_node *node = this_cpu_ptr(_node); struct optimistic_spin_node *prev, *next; int curr = encode_cpu(smp_processor_id()); - int old; + int prev_cpu; node->next = NULL; - node->cpu = curr; /* * We need both ACQUIRE (pairs with corresponding RELEASE in @@ -102,13 +100,12 @@ bool osq_lock(struct optimistic_spin_queue *lock) * the node fields we just initialised) semantics when updating * the lock tail. */ - old = atomic_xchg(>tail, curr); - if (old == OSQ_UNLOCKED_VAL) + prev_cpu = atomic_xchg(>tail, curr); + if (prev_cpu == OSQ_UNLOCKED_VAL) return true; - node->prev_cpu = old; - prev = decode_cpu(old); - node->prev = prev; + node->prev_cpu = prev_cpu; + prev = decode_cpu(prev_cpu); node->locked = 0; /* @@ -174,9 +171,16 @@ bool osq_lock(struct optimistic_spin_queue *lock) /* * Or we race against a concurrent unqueue()'s step-B, in which -* case its step-C will write us a new @node->prev pointer. +* case its step-C will write us a new @node->prev_cpu value. */ - prev = READ_ONCE(node->prev); + { + int new_prev_cpu = READ_ONCE(node->prev_cpu); + + if (new_prev_cpu == prev_cpu) + continue; + prev_cpu = new_prev_cpu; + prev = decode_cpu(prev_cpu); + } Just a minor nit. It is not that common in the kernel to add another nesting level just to reduce the scope of new_prev_cpu auto variable. Anyway, Reviewed-by: Waiman Long
Re: [PATCH next v2 2/5] locking/osq_lock: Optimise the vcpu_is_preempted() check.
On 12/31/23 16:52, David Laight wrote: The vcpu_is_preempted() test stops osq_lock() spinning if a virtual cpu is no longer running. Although patched out for bare-metal the code still needs the cpu number. Reading this from 'prev->cpu' is a pretty much guaranteed have a cache miss when osq_unlock() is waking up the next cpu. Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead. Update in the osq_lock() 'unqueue' path when 'node->prev' is changed. This is simpler than checking for 'node->prev' changing and caching 'prev->cpu'. Signed-off-by: David Laight --- kernel/locking/osq_lock.c | 16 +++- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index e0bc74d85a76..eb8a6dfdb79d 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -14,8 +14,9 @@ struct optimistic_spin_node { struct optimistic_spin_node *next, *prev; - int locked; /* 1 if lock acquired */ - int cpu; /* encoded CPU # + 1 value */ + int locked;/* 1 if lock acquired */ + int cpu; /* encoded CPU # + 1 value */ + int prev_cpu; /* encoded CPU # + 1 value */ }; static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node); @@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr) return cpu_nr + 1; } -static inline int node_cpu(struct optimistic_spin_node *node) -{ - return node->cpu - 1; -} - static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) { int cpu_nr = encoded_cpu_val - 1; @@ -110,9 +106,10 @@ bool osq_lock(struct optimistic_spin_queue *lock) if (old == OSQ_UNLOCKED_VAL) return true; - node->locked = 0; + node->prev_cpu = old; prev = decode_cpu(old); node->prev = prev; + node->locked = 0; /* * osq_lock() unqueue @@ -144,7 +141,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) * polling, be careful. */ if (smp_cond_load_relaxed(>locked, VAL || need_resched() || - vcpu_is_preempted(node_cpu(node->prev + vcpu_is_preempted(READ_ONCE(node->prev_cpu) - 1))) return true; /* unqueue */ @@ -201,6 +198,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) * it will wait in Step-A. */ + WRITE_ONCE(next->prev_cpu, prev->cpu); WRITE_ONCE(next->prev, prev); WRITE_ONCE(prev->next, next); Reviewed-by: Waiman Long Reviewed-by: Waiman Long
Re: [PATCH next v2 1/5] locking/osq_lock: Defer clearing node->locked until the slow osq_lock() path.
On 12/31/23 16:51, David Laight wrote: Since node->locked cannot be set before the assignment to prev->next it is save to clear it in the slow path. Signed-off-by: David Laight --- kernel/locking/osq_lock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 75a6f6133866..e0bc74d85a76 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -97,7 +97,6 @@ bool osq_lock(struct optimistic_spin_queue *lock) int curr = encode_cpu(smp_processor_id()); int old; - node->locked = 0; node->next = NULL; node->cpu = curr; @@ -111,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) if (old == OSQ_UNLOCKED_VAL) return true; + node->locked = 0; prev = decode_cpu(old); node->prev = prev; Reviewed-by: Waiman Long
Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.
On 12/30/23 06:35, David Laight wrote: From: Ingo Molnar Sent: 30 December 2023 11:09 * Waiman Long wrote: On 12/29/23 15:57, David Laight wrote: this_cpu_ptr() is rather more expensive than raw_cpu_read() since the latter can use an 'offset from register' (%gs for x86-84). Add a 'self' field to 'struct optimistic_spin_node' that can be read with raw_cpu_read(), initialise on first call. Signed-off-by: David Laight --- kernel/locking/osq_lock.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 9bb3a077ba92..b60b0add0161 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -13,7 +13,7 @@ */ struct optimistic_spin_node { - struct optimistic_spin_node *next, *prev; + struct optimistic_spin_node *self, *next, *prev; int locked; /* 1 if lock acquired */ int cpu; /* encoded CPU # + 1 value */ }; @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock, bool osq_lock(struct optimistic_spin_queue *lock) { - struct optimistic_spin_node *node = this_cpu_ptr(_node); + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self); My gcc 11 compiler produces the following x86-64 code: 92 struct optimistic_spin_node *node = this_cpu_ptr(_node); 0x0029 <+25>: mov %rcx,%rdx 0x002c <+28>: add %gs:0x0(%rip),%rdx # 0x34 Which looks pretty optimized for me. Maybe older compiler may generate more complex code. However, I do have some doubt as to the benefit of this patch at the expense of making the code a bit more complex. My changed code is one instruction shorter! 18: 65 48 8b 15 00 00 00mov%gs:0x0(%rip),%rdx# 20 1f: 00 1c: R_X86_64_PC32 .data..percpu..shared_aligned-0x4 However is might have one less cache line miss. GCC-11 is plenty of a look-back window in terms of compiler efficiency: latest enterprise distros use GCC-11 or newer, while recent desktop distros use GCC-13. Anything older won't matter, because no major distribution is going to use new kernels with old compilers. There must be a difference in the header files as well. Possibly forced by the older compiler I'm using (7.5 from Ubuntu 18.04). But maybe based on some config option. I'm seeing this_cpu_ptr() converted to per_cpu_ptr(, smp_processor_id()) which necessitates an array lookup (indexed by cpu number). Whereas I think you are seeing it implemented as raw_cpu_read(per_cpu_data_base) + offset_to(xxx) So the old code generates (after the prologue): 10: 49 89 fdmov%rdi,%r13 13: 49 c7 c4 00 00 00 00mov$0x0,%r12 16: R_X86_64_32S.data..percpu..shared_aligned 1a: e8 00 00 00 00 callq 1f 1b: R_X86_64_PC32 debug_smp_processor_id-0x4 1f: 89 c0 mov%eax,%eax 21: 48 8b 1c c5 00 00 00mov0x0(,%rax,8),%rbx 28: 00 25: R_X86_64_32S__per_cpu_offset 29: e8 00 00 00 00 callq 2e 2a: R_X86_64_PC32 debug_smp_processor_id-0x4 2e: 4c 01 e3add%r12,%rbx 31: 83 c0 01add$0x1,%eax 34: c7 43 10 00 00 00 00movl $0x0,0x10(%rbx) 3b: 48 c7 03 00 00 00 00movq $0x0,(%rbx) 42: 89 43 14mov%eax,0x14(%rbx) 45: 41 87 45 00 xchg %eax,0x0(%r13) I was also surprised that smp_processor_id() is a real function rather than an offset from %gs. I have looked up definition of this_cpu_ptr() and gotten the following results: this_cpu_ptr() => raw_cpu_ptr() => arch_raw_cpu_ptr() /* * Compared to the generic __my_cpu_offset version, the following * saves one instruction and avoids clobbering a temp register. */ #define arch_raw_cpu_ptr(ptr) \ ({ \ unsigned long tcp_ptr__; \ asm ("add " __percpu_arg(1) ", %0" \ : "=r" (tcp_ptr__) \ : "m" (this_cpu_off), "0" (ptr)); \ (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \ }) The presence of debug_smp_processor_id in your compiled code is likely due to the setting of CONFIG_DEBUG_PREEMPT in your kernel config. #ifdef CONFIG_DEBUG_PREEMPT extern unsigned int debug_smp_processor_id(void); # define smp_processor_id() debug_smp_processor_id() #else # define smp_processor_id() __smp_processor_id() #endif I don't have that config entry in my kernel config and so I only get 2 instructions for this_cpu_ptr(). We are not going to optimize the code specifically for CONFIG_DEBUG_PREEMPT and so this patch should be dropped. Cheers, Longman
Re: [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code
On 12/30/23 17:39, David Laight wrote: From: Linus Torvalds Sent: 30 December 2023 19:41 On Fri, 29 Dec 2023 at 12:52, David Laight wrote: David Laight (5): Move the definition of optimistic_spin_node into osf_lock.c Clarify osq_wait_next() I took these two as preparatory independent patches, with that osq_wait_next() clarification split into two. I also did the renaming that Waiman asked for. Ok, I'll try to remove them from any respin. (Or at least put them first!) The first 2 patches have been included in Linus' master branch. So you should just remove them from a respin. Cheers, Longman
Re: [PATCH next 5/5] locking/osq_lock: Optimise vcpu_is_preempted() check.
On 12/29/23 22:13, Waiman Long wrote: On 12/29/23 15:58, David Laight wrote: The vcpu_is_preempted() test stops osq_lock() spinning if a virtual cpu is no longer running. Although patched out for bare-metal the code still needs the cpu number. Reading this from 'prev->cpu' is a pretty much guaranteed have a cache miss when osq_unlock() is waking up the next cpu. Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead. Update in the osq_lock() 'unqueue' path when 'node->prev' is changed. This is simpler than checking for 'node->prev' changing and caching 'prev->cpu'. Signed-off-by: David Laight --- kernel/locking/osq_lock.c | 14 ++ 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index b60b0add0161..89be63627434 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -14,8 +14,9 @@ struct optimistic_spin_node { struct optimistic_spin_node *self, *next, *prev; - int locked; /* 1 if lock acquired */ - int cpu; /* encoded CPU # + 1 value */ + int locked; /* 1 if lock acquired */ + int cpu; /* encoded CPU # + 1 value */ + int prev_cpu; /* actual CPU # for vpcu_is_preempted() */ }; static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node); @@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr) return cpu_nr + 1; } -static inline int node_cpu(struct optimistic_spin_node *node) -{ - return node->cpu - 1; -} - static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) { int cpu_nr = encoded_cpu_val - 1; @@ -114,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) if (old == OSQ_UNLOCKED_VAL) return true; + node->prev_cpu = old - 1; prev = decode_cpu(old); node->prev = prev; node->locked = 0; @@ -148,7 +145,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) * polling, be careful. */ if (smp_cond_load_relaxed(>locked, VAL || need_resched() || - vcpu_is_preempted(node_cpu(node->prev + vcpu_is_preempted(node->prev_cpu))) On second thought, I believe it is more correct to use READ_ONCE() to access "node->prev_cpu" as this field is subjected to change by a WRITE_ONCE(). Cheers, Longman
Re: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.
On 12/29/23 17:11, David Laight wrote: osq_lock() starts by setting node->next to NULL and node->locked to 0. Careful analysis shows that node->next is always NULL on entry. node->locked is set non-zero by another cpu to force a wakeup. This can only happen after the 'prev->next = node' assignment, so locked can be set to zero just before that (along with the assignment to node->prev). Only initialise node->cpu once, after that use its value instead of smp_processor_id() - which is probably a real function call. Should reduce cache-line bouncing a little. Signed-off-by: David Laight --- Re-send without the 'RE:' on the subject line. kernel/locking/osq_lock.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index d414eef4bec6..55f5db896c02 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock, struct optimistic_spin_node *prev) { struct optimistic_spin_node *next = NULL; - int curr = encode_cpu(smp_processor_id()); + int curr = node->cpu; int old; /* @@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock) { struct optimistic_spin_node *node = this_cpu_ptr(_node); struct optimistic_spin_node *prev, *next; - int curr = encode_cpu(smp_processor_id()); int old; - node->locked = 0; - node->next = NULL; I have some concern about not clearing node->next at the beginning. I know that next is supposed to be cleared at the end. I do have some worry that there may exist a race condition that leaves next not cleared before it is used again. So you either have to prove that such a race does not exist, or initializing it to NULL at the beginning like it is today. Cheers, Longman - node->cpu = curr; + if (unlikely(node->cpu == OSQ_UNLOCKED_VAL)) + node->cpu = encode_cpu(smp_processor_id()); /* * We need both ACQUIRE (pairs with corresponding RELEASE in @@ -111,12 +109,13 @@ bool osq_lock(struct optimistic_spin_queue *lock) * the node fields we just initialised) semantics when updating * the lock tail. */ - old = atomic_xchg(>tail, curr); + old = atomic_xchg(>tail, node->cpu); if (old == OSQ_UNLOCKED_VAL) return true; prev = decode_cpu(old); node->prev = prev; + node->locked = 0; /* * osq_lock() unqueue @@ -214,7 +213,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) void osq_unlock(struct optimistic_spin_queue *lock) { struct optimistic_spin_node *node, *next; - int curr = encode_cpu(smp_processor_id()); + int curr = raw_cpu_read(osq_node.cpu); /* * Fast path for the uncontended case.
Re: [PATCH next 5/5] locking/osq_lock: Optimise vcpu_is_preempted() check.
On 12/29/23 15:58, David Laight wrote: The vcpu_is_preempted() test stops osq_lock() spinning if a virtual cpu is no longer running. Although patched out for bare-metal the code still needs the cpu number. Reading this from 'prev->cpu' is a pretty much guaranteed have a cache miss when osq_unlock() is waking up the next cpu. Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead. Update in the osq_lock() 'unqueue' path when 'node->prev' is changed. This is simpler than checking for 'node->prev' changing and caching 'prev->cpu'. Signed-off-by: David Laight --- kernel/locking/osq_lock.c | 14 ++ 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index b60b0add0161..89be63627434 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -14,8 +14,9 @@ struct optimistic_spin_node { struct optimistic_spin_node *self, *next, *prev; - int locked; /* 1 if lock acquired */ - int cpu; /* encoded CPU # + 1 value */ + int locked;/* 1 if lock acquired */ + int cpu; /* encoded CPU # + 1 value */ + int prev_cpu; /* actual CPU # for vpcu_is_preempted() */ }; static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node); @@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr) return cpu_nr + 1; } -static inline int node_cpu(struct optimistic_spin_node *node) -{ - return node->cpu - 1; -} - static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) { int cpu_nr = encoded_cpu_val - 1; @@ -114,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) if (old == OSQ_UNLOCKED_VAL) return true; + node->prev_cpu = old - 1; prev = decode_cpu(old); node->prev = prev; node->locked = 0; @@ -148,7 +145,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) * polling, be careful. */ if (smp_cond_load_relaxed(>locked, VAL || need_resched() || - vcpu_is_preempted(node_cpu(node->prev + vcpu_is_preempted(node->prev_cpu))) return true; /* unqueue */ @@ -205,6 +202,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) * it will wait in Step-A. */ + WRITE_ONCE(next->prev_cpu, prev->cpu - 1); WRITE_ONCE(next->prev, prev); WRITE_ONCE(prev->next, next); Reviewed-by: Waiman Long
Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.
On 12/29/23 15:57, David Laight wrote: this_cpu_ptr() is rather more expensive than raw_cpu_read() since the latter can use an 'offset from register' (%gs for x86-84). Add a 'self' field to 'struct optimistic_spin_node' that can be read with raw_cpu_read(), initialise on first call. Signed-off-by: David Laight --- kernel/locking/osq_lock.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 9bb3a077ba92..b60b0add0161 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -13,7 +13,7 @@ */ struct optimistic_spin_node { - struct optimistic_spin_node *next, *prev; + struct optimistic_spin_node *self, *next, *prev; int locked; /* 1 if lock acquired */ int cpu; /* encoded CPU # + 1 value */ }; @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock, bool osq_lock(struct optimistic_spin_queue *lock) { - struct optimistic_spin_node *node = this_cpu_ptr(_node); + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self); My gcc 11 compiler produces the following x86-64 code: 92 struct optimistic_spin_node *node = this_cpu_ptr(_node); 0x0029 <+25>: mov %rcx,%rdx 0x002c <+28>: add %gs:0x0(%rip),%rdx # 0x34 Which looks pretty optimized for me. Maybe older compiler may generate more complex code. However, I do have some doubt as to the benefit of this patch at the expense of making the code a bit more complex. Cheers, Longman
Re: [PATCH next 3/5] locking/osq_lock: Clarify osq_wait_next()
On 12/29/23 15:56, David Laight wrote: osq_wait_next() is passed 'prev' from osq_lock() and NULL from osq_unlock() but only needs the 'cpu' value to write to lock->tail. Just pass prev->cpu or OSQ_UNLOCKED_VAL instead. Also directly return NULL or 'next' instead of breaking the loop. Should have no effect on the generated code since gcc manages to assume that 'prev != NULL' due to an earlier dereference. Signed-off-by: David Laight --- kernel/locking/osq_lock.c | 23 ++- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 55f5db896c02..9bb3a077ba92 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -48,18 +48,17 @@ static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) static inline struct optimistic_spin_node * osq_wait_next(struct optimistic_spin_queue *lock, struct optimistic_spin_node *node, - struct optimistic_spin_node *prev) + int old) Make the last argument name more descriptive, like "old_cpu" as the "int" type does not provide enough context to allow people to guess what "old" may be. Cheers, Longman
Re: [PATCH next 1/5] locking/osq_lock: Move the definition of optimistic_spin_node into osf_lock.c
On 12/29/23 15:53, David Laight wrote: struct optimistic_spin_node is private to the implementation. Move it into the C file to ensure nothing is accessing it. Signed-off-by: David Laight --- include/linux/osq_lock.h | 5 - kernel/locking/osq_lock.c | 7 +++ 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h index 5581dbd3bd34..ea8fb31379e3 100644 --- a/include/linux/osq_lock.h +++ b/include/linux/osq_lock.h @@ -6,11 +6,6 @@ * An MCS like lock especially tailored for optimistic spinning for sleeping * lock implementations (mutex, rwsem, etc). */ -struct optimistic_spin_node { - struct optimistic_spin_node *next, *prev; - int locked; /* 1 if lock acquired */ - int cpu; /* encoded CPU # + 1 value */ -}; struct optimistic_spin_queue { /* diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index d5610ad52b92..d414eef4bec6 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -11,6 +11,13 @@ * called from interrupt context and we have preemption disabled while * spinning. */ + +struct optimistic_spin_node { + struct optimistic_spin_node *next, *prev; + int locked; /* 1 if lock acquired */ + int cpu; /* encoded CPU # + 1 value */ +}; + static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node); /* Please correct the patch title "osf_lock.c" => "osq_lock.c". After the fix, you can add Acked-by: Waiman Long
Re: [PATCH v2 00/12] device-core: Enable device_lock() lockdep validation
On 4/13/22 02:01, Dan Williams wrote: Changes since v1 [1]: - Improve the clarity of the cover letter and changelogs of the major patches (Patch2 and Patch12) (Pierre, Kevin, and Dave) - Fix device_lock_interruptible() false negative deadlock detection (Kevin) - Fix off-by-one error in the device_set_lock_class() enable case (Kevin) - Spelling fixes in Patch2 changelog (Pierre) - Compilation fixes when both CONFIG_CXL_BUS=n and CONFIG_LIBNVDIMM=n. (0day robot) [1]: https://lore.kernel.org/all/164610292916.2682974.12924748003366352335.st...@dwillia2-desk3.amr.corp.intel.com/ --- The device_lock() is why the lockdep_set_novalidate_class() API exists. The lock is taken in too many disparate contexts, and lockdep by design assumes that all device_lock() acquisitions are identical. The lack of lockdep coverage leads to deadlock scenarios landing upstream. To mitigate that problem the lockdep_mutex was added [2]. The lockdep_mutex lets a subsystem mirror device_lock() acquisitions without lockdep_set_novalidate_class() to gain some limited lockdep coverage. The mirroring approach is limited to taking the device_lock() after-the-fact in a subsystem's 'struct bus_type' operations and fails to cover device_lock() acquisition in the driver-core. It also can only track the needs of one subsystem at a time so, for example the kernel needs to be recompiled between CONFIG_PROVE_NVDIMM_LOCKING and CONFIG_PROVE_CXL_LOCKING depending on which subsystem is being regression tested. Obviously that also means that intra-subsystem locking dependencies can not be validated. Instead of using a fake lockdep_mutex, maybe you can just use a unique lockdep key for each subsystem and call lockdep_set_class() in the device_initialize() if such key is present or lockdep_set_novalidate_class() otherwise. The unique key can be passed either as a parameter to device_initialize() or as part of the device structure. It is certainly less cumbersome that having a fake lockdep_mutex array in the device structure. Cheers, Longman
[PATCH-next v5 2/4] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
Before the new slab memory controller with per object byte charging, charging and vmstat data update happen only when new slab pages are allocated or freed. Now they are done with every kmem_cache_alloc() and kmem_cache_free(). This causes additional overhead for workloads that generate a lot of alloc and free calls. The memcg_stock_pcp is used to cache byte charge for a specific obj_cgroup to reduce that overhead. To further reducing it, this patch makes the vmstat data cached in the memcg_stock_pcp structure as well until it accumulates a page size worth of update or when other cached data change. Caching the vmstat data in the per-cpu stock eliminates two writes to non-hot cachelines for memcg specific as well as memcg-lruvecs specific vmstat data by a write to a hot local stock cacheline. On a 2-socket Cascade Lake server with instrumentation enabled and this patch applied, it was found that about 20% (634400 out of 3243830) of the time when mod_objcg_state() is called leads to an actual call to __mod_objcg_state() after initial boot. When doing parallel kernel build, the figure was about 17% (24329265 out of 142512465). So caching the vmstat data reduces the number of calls to __mod_objcg_state() by more than 80%. Signed-off-by: Waiman Long Reviewed-by: Shakeel Butt --- mm/memcontrol.c | 86 +++-- 1 file changed, 83 insertions(+), 3 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7cd7187a017c..292b4783b1a7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -782,8 +782,9 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) rcu_read_unlock(); } -void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, -enum node_stat_item idx, int nr) +static inline void mod_objcg_mlstate(struct obj_cgroup *objcg, +struct pglist_data *pgdat, +enum node_stat_item idx, int nr) { struct mem_cgroup *memcg; struct lruvec *lruvec; @@ -791,7 +792,7 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, rcu_read_lock(); memcg = obj_cgroup_memcg(objcg); lruvec = mem_cgroup_lruvec(memcg, pgdat); - mod_memcg_lruvec_state(lruvec, idx, nr); + __mod_memcg_lruvec_state(lruvec, idx, nr); rcu_read_unlock(); } @@ -2059,7 +2060,10 @@ struct memcg_stock_pcp { #ifdef CONFIG_MEMCG_KMEM struct obj_cgroup *cached_objcg; + struct pglist_data *cached_pgdat; unsigned int nr_bytes; + int nr_slab_reclaimable_b; + int nr_slab_unreclaimable_b; #endif struct work_struct work; @@ -3008,6 +3012,63 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) obj_cgroup_put(objcg); } +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, +enum node_stat_item idx, int nr) +{ + struct memcg_stock_pcp *stock; + unsigned long flags; + int *bytes; + + local_irq_save(flags); + stock = this_cpu_ptr(_stock); + + /* +* Save vmstat data in stock and skip vmstat array update unless +* accumulating over a page of vmstat data or when pgdat or idx +* changes. +*/ + if (stock->cached_objcg != objcg) { + drain_obj_stock(stock); + obj_cgroup_get(objcg); + stock->nr_bytes = atomic_read(>nr_charged_bytes) + ? atomic_xchg(>nr_charged_bytes, 0) : 0; + stock->cached_objcg = objcg; + stock->cached_pgdat = pgdat; + } else if (stock->cached_pgdat != pgdat) { + /* Flush the existing cached vmstat data */ + if (stock->nr_slab_reclaimable_b) { + mod_objcg_mlstate(objcg, pgdat, NR_SLAB_RECLAIMABLE_B, + stock->nr_slab_reclaimable_b); + stock->nr_slab_reclaimable_b = 0; + } + if (stock->nr_slab_unreclaimable_b) { + mod_objcg_mlstate(objcg, pgdat, NR_SLAB_UNRECLAIMABLE_B, + stock->nr_slab_unreclaimable_b); + stock->nr_slab_unreclaimable_b = 0; + } + stock->cached_pgdat = pgdat; + } + + bytes = (idx == NR_SLAB_RECLAIMABLE_B) ? >nr_slab_reclaimable_b + : >nr_slab_unreclaimable_b; + if (!*bytes) { + *bytes = nr; + nr = 0; + } else { + *bytes += nr; + if (abs(*bytes) > PAGE_SIZE) { + nr = *bytes; + *bytes = 0; + } else { + nr = 0; + } + } + if (nr) + mod_objcg_mlstate(objcg,
[PATCH-next v5 3/4] mm/memcg: Improve refill_obj_stock() performance
There are two issues with the current refill_obj_stock() code. First of all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to atomically flush out remaining bytes to obj_cgroup, clear cached_objcg and do a obj_cgroup_put(). It is likely that the same obj_cgroup will be used again which leads to another call to drain_obj_stock() and obj_cgroup_get() as well as atomically retrieve the available byte from obj_cgroup. That is costly. Instead, we should just uncharge the excess pages, reduce the stock bytes and be done with it. The drain_obj_stock() function should only be called when obj_cgroup changes. Secondly, when charging an object of size not less than a page in obj_cgroup_charge(), it is possible that the remaining bytes to be refilled to the stock will overflow a page and cause refill_obj_stock() to uncharge 1 page. To avoid the additional uncharge in this case, a new overfill flag is added to refill_obj_stock() which will be set when called from obj_cgroup_charge(). A multithreaded kmalloc+kfree microbenchmark on a 2-socket 48-core 96-thread x86-64 system with 96 testing threads were run. Before this patch, the total number of kilo kmalloc+kfree operations done for a 4k large object by all the testing threads per second were 4,304 kops/s (cgroup v1) and 8,478 kops/s (cgroup v2). After applying this patch, the number were 4,731 (cgroup v1) and 418,142 (cgroup v2) respectively. This represents a performance improvement of 1.10X (cgroup v1) and 49.3X (cgroup v2). Signed-off-by: Waiman Long --- mm/memcontrol.c | 20 ++-- 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 292b4783b1a7..2f87d0b05092 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3153,10 +3153,12 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, return false; } -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, +bool overfill) { struct memcg_stock_pcp *stock; unsigned long flags; + unsigned int nr_pages = 0; local_irq_save(flags); @@ -3165,14 +3167,20 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) drain_obj_stock(stock); obj_cgroup_get(objcg); stock->cached_objcg = objcg; - stock->nr_bytes = atomic_xchg(>nr_charged_bytes, 0); + stock->nr_bytes = atomic_read(>nr_charged_bytes) + ? atomic_xchg(>nr_charged_bytes, 0) : 0; } stock->nr_bytes += nr_bytes; - if (stock->nr_bytes > PAGE_SIZE) - drain_obj_stock(stock); + if (!overfill && (stock->nr_bytes > PAGE_SIZE)) { + nr_pages = stock->nr_bytes >> PAGE_SHIFT; + stock->nr_bytes &= (PAGE_SIZE - 1); + } local_irq_restore(flags); + + if (nr_pages) + obj_cgroup_uncharge_pages(objcg, nr_pages); } int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) @@ -3201,14 +3209,14 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) ret = obj_cgroup_charge_pages(objcg, gfp, nr_pages); if (!ret && nr_bytes) - refill_obj_stock(objcg, PAGE_SIZE - nr_bytes); + refill_obj_stock(objcg, PAGE_SIZE - nr_bytes, true); return ret; } void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size) { - refill_obj_stock(objcg, size); + refill_obj_stock(objcg, size, false); } #endif /* CONFIG_MEMCG_KMEM */ -- 2.18.1
[PATCH-next v5 0/4] mm/memcg: Reduce kmemcache memory accounting overhead
ontended cacheline in the shared page counter >kmem. So the improvement is only modest. Patch 4 helps in cgroup v2, but performs worse in cgroup v1 as eliminating the irq_disable/irq_enable overhead seems to aggravate the cacheline contention. [1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u [2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/ Waiman Long (4): mm/memcg: Move mod_objcg_state() to memcontrol.c mm/memcg: Cache vmstat data in percpu memcg_stock_pcp mm/memcg: Improve refill_obj_stock() performance mm/memcg: Optimize user context object stock access mm/memcontrol.c | 195 +--- mm/slab.h | 16 +--- 2 files changed, 171 insertions(+), 40 deletions(-) -- 2.18.1
[PATCH-next v5 1/4] mm/memcg: Move mod_objcg_state() to memcontrol.c
The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c so that further optimization can be done to it in later patches without exposing unnecessary details to other mm components. Signed-off-by: Waiman Long Acked-by: Johannes Weiner --- mm/memcontrol.c | 13 + mm/slab.h | 16 ++-- 2 files changed, 15 insertions(+), 14 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 64ada9e650a5..7cd7187a017c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -782,6 +782,19 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) rcu_read_unlock(); } +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, +enum node_stat_item idx, int nr) +{ + struct mem_cgroup *memcg; + struct lruvec *lruvec; + + rcu_read_lock(); + memcg = obj_cgroup_memcg(objcg); + lruvec = mem_cgroup_lruvec(memcg, pgdat); + mod_memcg_lruvec_state(lruvec, idx, nr); + rcu_read_unlock(); +} + /** * __count_memcg_events - account VM events in a cgroup * @memcg: the memory cgroup diff --git a/mm/slab.h b/mm/slab.h index a7268072f017..f1f26f22a94a 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -242,6 +242,8 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla #ifdef CONFIG_MEMCG_KMEM int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s, gfp_t gfp, bool new_page); +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, +enum node_stat_item idx, int nr); static inline void memcg_free_page_obj_cgroups(struct page *page) { @@ -286,20 +288,6 @@ static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s, return true; } -static inline void mod_objcg_state(struct obj_cgroup *objcg, - struct pglist_data *pgdat, - enum node_stat_item idx, int nr) -{ - struct mem_cgroup *memcg; - struct lruvec *lruvec; - - rcu_read_lock(); - memcg = obj_cgroup_memcg(objcg); - lruvec = mem_cgroup_lruvec(memcg, pgdat); - mod_memcg_lruvec_state(lruvec, idx, nr); - rcu_read_unlock(); -} - static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s, struct obj_cgroup *objcg, gfp_t flags, size_t size, -- 2.18.1
[PATCH-next v5 4/4] mm/memcg: Optimize user context object stock access
Most kmem_cache_alloc() calls are from user context. With instrumentation enabled, the measured amount of kmem_cache_alloc() calls from non-task context was about 0.01% of the total. The irq disable/enable sequence used in this case to access content from object stock is slow. To optimize for user context access, there are now two sets of object stocks (in the new obj_stock structure) for task context and interrupt context access respectively. The task context object stock can be accessed after disabling preemption which is cheap in non-preempt kernel. The interrupt context object stock can only be accessed after disabling interrupt. User context code can access interrupt object stock, but not vice versa. The downside of this change is that there are more data stored in local object stocks and not reflected in the charge counter and the vmstat arrays. However, this is a small price to pay for better performance. Signed-off-by: Waiman Long Acked-by: Roman Gushchin Reviewed-by: Shakeel Butt --- mm/memcontrol.c | 94 +++-- 1 file changed, 68 insertions(+), 26 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2f87d0b05092..c79b9837cb85 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -782,6 +782,10 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) rcu_read_unlock(); } +/* + * mod_objcg_mlstate() may be called with irq enabled, so + * mod_memcg_lruvec_state() should be used. + */ static inline void mod_objcg_mlstate(struct obj_cgroup *objcg, struct pglist_data *pgdat, enum node_stat_item idx, int nr) @@ -792,7 +796,7 @@ static inline void mod_objcg_mlstate(struct obj_cgroup *objcg, rcu_read_lock(); memcg = obj_cgroup_memcg(objcg); lruvec = mem_cgroup_lruvec(memcg, pgdat); - __mod_memcg_lruvec_state(lruvec, idx, nr); + mod_memcg_lruvec_state(lruvec, idx, nr); rcu_read_unlock(); } @@ -2054,17 +2058,23 @@ void unlock_page_memcg(struct page *page) } EXPORT_SYMBOL(unlock_page_memcg); -struct memcg_stock_pcp { - struct mem_cgroup *cached; /* this never be root cgroup */ - unsigned int nr_pages; - +struct obj_stock { #ifdef CONFIG_MEMCG_KMEM struct obj_cgroup *cached_objcg; struct pglist_data *cached_pgdat; unsigned int nr_bytes; int nr_slab_reclaimable_b; int nr_slab_unreclaimable_b; +#else + int dummy[0]; #endif +}; + +struct memcg_stock_pcp { + struct mem_cgroup *cached; /* this never be root cgroup */ + unsigned int nr_pages; + struct obj_stock task_obj; + struct obj_stock irq_obj; struct work_struct work; unsigned long flags; @@ -2074,12 +2084,12 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock); static DEFINE_MUTEX(percpu_charge_mutex); #ifdef CONFIG_MEMCG_KMEM -static void drain_obj_stock(struct memcg_stock_pcp *stock); +static void drain_obj_stock(struct obj_stock *stock); static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, struct mem_cgroup *root_memcg); #else -static inline void drain_obj_stock(struct memcg_stock_pcp *stock) +static inline void drain_obj_stock(struct obj_stock *stock) { } static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, @@ -2089,6 +2099,40 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, } #endif +/* + * Most kmem_cache_alloc() calls are from user context. The irq disable/enable + * sequence used in this case to access content from object stock is slow. + * To optimize for user context access, there are now two object stocks for + * task context and interrupt context access respectively. + * + * The task context object stock can be accessed by disabling preemption only + * which is cheap in non-preempt kernel. The interrupt context object stock + * can only be accessed after disabling interrupt. User context code can + * access interrupt object stock, but not vice versa. + */ +static inline struct obj_stock *get_obj_stock(unsigned long *pflags) +{ + struct memcg_stock_pcp *stock; + + if (likely(in_task())) { + preempt_disable(); + stock = this_cpu_ptr(_stock); + return >task_obj; + } else { + local_irq_save(*pflags); + stock = this_cpu_ptr(_stock); + return >irq_obj; + } +} + +static inline void put_obj_stock(unsigned long flags) +{ + if (likely(in_task())) + preempt_enable(); + else + local_irq_restore(flags); +} + /** * consume_stock: Try to consume stocked charge on this cpu. * @memcg: memcg to consume from. @@ -2155,7 +2199,9 @@ static void drain_local_stock(struct work_struct *dummy) local_irq_save(flags); stock = this_cpu_ptr(_stock); - drain_obj
Re: [PATCH v4 4/5] mm/memcg: Save both reclaimable & unreclaimable bytes in object stock
On 4/19/21 12:55 PM, Johannes Weiner wrote: On Sun, Apr 18, 2021 at 08:00:31PM -0400, Waiman Long wrote: Currently, the object stock structure caches either reclaimable vmstat bytes or unreclaimable vmstat bytes in its object stock structure. The hit rate can be improved if both types of vmstat data can be cached especially for single-node system. This patch supports the cacheing of both type of vmstat data, though at the expense of a slightly increased complexity in the caching code. For large object (>= PAGE_SIZE), vmstat array is done directly without going through the stock caching step. On a 2-socket Cascade Lake server with instrumentation enabled, the miss rates are shown in the table below. Initial bootup: Kernel __mod_objcg_statemod_objcg_state%age -- ---- Before patch 634400 324383019.6% After patch 419810 318242413.2% Parallel kernel build: Kernel __mod_objcg_statemod_objcg_state%age -- ---- Before patch 24329265 142512465 17.1% After patch 24051721 142445825 16.9% There was a decrease of miss rate after initial system bootup. However, the miss rate for parallel kernel build remained about the same probably because most of the touched kmemcache objects were reclaimable inodes and dentries. Signed-off-by: Waiman Long --- mm/memcontrol.c | 79 +++-- 1 file changed, 51 insertions(+), 28 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c13502eab282..a6dd18f6d8a8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2212,8 +2212,8 @@ struct obj_stock { struct obj_cgroup *cached_objcg; struct pglist_data *cached_pgdat; unsigned int nr_bytes; - int vmstat_idx; - int vmstat_bytes; + int reclaimable_bytes; /* NR_SLAB_RECLAIMABLE_B */ + int unreclaimable_bytes;/* NR_SLAB_UNRECLAIMABLE_B */ How about int nr_slab_reclaimable_b; int nr_slab_unreclaimable_b; so you don't need the comments? Sure, will make the change. #else int dummy[0]; #endif @@ -3217,40 +3217,56 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, enum node_stat_item idx, int nr) { unsigned long flags; - struct obj_stock *stock = get_obj_stock(); + struct obj_stock *stock; + int *bytes, *alt_bytes, alt_idx; + + /* +* Directly update vmstat array for big object. +*/ + if (unlikely(abs(nr) >= PAGE_SIZE)) + goto update_vmstat; This looks like an optimization independent of the vmstat item split? It may not be that helpful. I am going to take it out in the next version. + stock = get_obj_stock(); + if (idx == NR_SLAB_RECLAIMABLE_B) { + bytes = >reclaimable_bytes; + alt_bytes = >unreclaimable_bytes; + alt_idx = NR_SLAB_UNRECLAIMABLE_B; + } else { + bytes = >unreclaimable_bytes; + alt_bytes = >reclaimable_bytes; + alt_idx = NR_SLAB_RECLAIMABLE_B; + } /* -* Save vmstat data in stock and skip vmstat array update unless -* accumulating over a page of vmstat data or when pgdat or idx +* Try to save vmstat data in stock and skip vmstat array update +* unless accumulating over a page of vmstat data or when pgdat * changes. */ if (stock->cached_objcg != objcg) { /* Output the current data as is */ - } else if (!stock->vmstat_bytes) { - /* Save the current data */ - stock->vmstat_bytes = nr; - stock->vmstat_idx = idx; - stock->cached_pgdat = pgdat; - nr = 0; - } else if ((stock->cached_pgdat != pgdat) || - (stock->vmstat_idx != idx)) { - /* Output the cached data & save the current data */ - swap(nr, stock->vmstat_bytes); - swap(idx, stock->vmstat_idx); + } else if (stock->cached_pgdat != pgdat) { + /* Save the current data and output cached data, if any */ + swap(nr, *bytes); swap(pgdat, stock->cached_pgdat); + if (*alt_bytes) { + __mod_objcg_state(objcg, pgdat, alt_idx, + *alt_bytes); + *alt_bytes = 0; + } As per the other email, I really don't think optimizing the pgdat switch (in a percpu cache) is worth this level of complexity. I am going to simplify it in the next version. } else { - stock->vmstat_bytes += nr
Re: [PATCH v4 2/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
On 4/19/21 12:38 PM, Johannes Weiner wrote: On Sun, Apr 18, 2021 at 08:00:29PM -0400, Waiman Long wrote: Before the new slab memory controller with per object byte charging, charging and vmstat data update happen only when new slab pages are allocated or freed. Now they are done with every kmem_cache_alloc() and kmem_cache_free(). This causes additional overhead for workloads that generate a lot of alloc and free calls. The memcg_stock_pcp is used to cache byte charge for a specific obj_cgroup to reduce that overhead. To further reducing it, this patch makes the vmstat data cached in the memcg_stock_pcp structure as well until it accumulates a page size worth of update or when other cached data change. Caching the vmstat data in the per-cpu stock eliminates two writes to non-hot cachelines for memcg specific as well as memcg-lruvecs specific vmstat data by a write to a hot local stock cacheline. On a 2-socket Cascade Lake server with instrumentation enabled and this patch applied, it was found that about 20% (634400 out of 3243830) of the time when mod_objcg_state() is called leads to an actual call to __mod_objcg_state() after initial boot. When doing parallel kernel build, the figure was about 17% (24329265 out of 142512465). So caching the vmstat data reduces the number of calls to __mod_objcg_state() by more than 80%. Signed-off-by: Waiman Long Reviewed-by: Shakeel Butt --- mm/memcontrol.c | 64 ++--- 1 file changed, 61 insertions(+), 3 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index dc9032f28f2e..693453f95d99 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2213,7 +2213,10 @@ struct memcg_stock_pcp { #ifdef CONFIG_MEMCG_KMEM struct obj_cgroup *cached_objcg; + struct pglist_data *cached_pgdat; unsigned int nr_bytes; + int vmstat_idx; + int vmstat_bytes; #endif struct work_struct work; @@ -3150,8 +3153,9 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) css_put(>css); } -void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, -enum node_stat_item idx, int nr) +static inline void __mod_objcg_state(struct obj_cgroup *objcg, +struct pglist_data *pgdat, +enum node_stat_item idx, int nr) This naming is dangerous, as the __mod_foo naming scheme we use everywhere else suggests it's the same function as mod_foo() just with preemption/irqs disabled. I will change its name to, say, mod_objcg_mlstate() to indicate that it is something different. Actually, it is hard to come up with a good name which is not too long. @@ -3159,10 +3163,53 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, rcu_read_lock(); memcg = obj_cgroup_memcg(objcg); lruvec = mem_cgroup_lruvec(memcg, pgdat); - mod_memcg_lruvec_state(lruvec, idx, nr); + __mod_memcg_lruvec_state(lruvec, idx, nr); rcu_read_unlock(); } +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, +enum node_stat_item idx, int nr) +{ + struct memcg_stock_pcp *stock; + unsigned long flags; + + local_irq_save(flags); + stock = this_cpu_ptr(_stock); + + /* +* Save vmstat data in stock and skip vmstat array update unless +* accumulating over a page of vmstat data or when pgdat or idx +* changes. +*/ + if (stock->cached_objcg != objcg) { + /* Output the current data as is */ When you get here with the wrong objcg and hit the cold path, it's usually immediately followed by an uncharge -> refill_obj_stock() that will then flush and reset cached_objcg. Instead of doing two cold paths, why not flush the old objcg right away and set the new so that refill_obj_stock() can use the fast path? That is a good idea. Will do that. + } else if (!stock->vmstat_bytes) { + /* Save the current data */ + stock->vmstat_bytes = nr; + stock->vmstat_idx = idx; + stock->cached_pgdat = pgdat; + nr = 0; + } else if ((stock->cached_pgdat != pgdat) || + (stock->vmstat_idx != idx)) { + /* Output the cached data & save the current data */ + swap(nr, stock->vmstat_bytes); + swap(idx, stock->vmstat_idx); + swap(pgdat, stock->cached_pgdat); Is this optimization worth doing? You later split vmstat_bytes and idx doesn't change anymore. I am going to merge patch 2 and patch 4 to avoid the confusion. How often does the pgdat change? This is a per-cpu cache after all, and the numa node a given cpu allocates from tends to not change that often. Even with interleaving mode, which I think is pretty rare, the interleaving happens at t
Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
On 4/19/21 5:11 PM, Johannes Weiner wrote: BTW, have you ever thought of moving the cgroup-v1 specific functions out into a separate memcontrol-v1.c file just like kernel/cgroup/cgroup-v1.c? I thought of that before, but memcontrol.c is a frequently changed file and so a bit hard to do. I haven't looked too deeply at it so far, but I think it would make sense to try. There are indeed many of the entry paths from the MM code that are shared between cgroup1 and cgroup2, with smaller branches here and there to adjust behavior. Those would throw conflicts, but those we should probably keep in the main memcontrol.c for readability anyway. But there is also plenty of code that is exclusively about cgroup1, and which actually doesn't change much in a long time. Moving that elsewhere shouldn't create difficult conflicts - maybe a few line offset warnings or fuzz-- Rafael in the diff context of unrelated changes: - the soft limit tree and soft limit reclaim - the threshold and oom event notification stuff - the charge moving code - remaining v1 interface files, as well as their helper functions From a quick scan, this adds up to ~2,500 lines of old code with no actual dependencies from the common code or from v2, and which could be moved out of the way without disrupting ongoing development much. Right. Currently memcontrol.c has over 7000 lines of code and keep growing. That makes it harder to read, navigate and update. If we can cut out 2000 lines or more from memcontrol.c, it will make it more manageable. Cheers, Longman
Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
On 4/19/21 1:19 PM, Waiman Long wrote: On 4/19/21 1:13 PM, Johannes Weiner wrote: On Mon, Apr 19, 2021 at 12:18:29PM -0400, Waiman Long wrote: On 4/19/21 11:21 AM, Waiman Long wrote: On 4/19/21 11:14 AM, Johannes Weiner wrote: On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote: The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c so that further optimization can be done to it in later patches without exposing unnecessary details to other mm components. Signed-off-by: Waiman Long --- mm/memcontrol.c | 13 + mm/slab.h | 16 ++-- 2 files changed, 15 insertions(+), 14 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e064ac0d850a..dc9032f28f2e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) css_put(>css); } +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, + enum node_stat_item idx, int nr) +{ + struct mem_cgroup *memcg; + struct lruvec *lruvec = NULL; + + rcu_read_lock(); + memcg = obj_cgroup_memcg(objcg); + lruvec = mem_cgroup_lruvec(memcg, pgdat); + mod_memcg_lruvec_state(lruvec, idx, nr); + rcu_read_unlock(); +} It would be more naturally placed next to the others, e.g. below __mod_lruvec_kmem_state(). But no deal breaker if there isn't another revision. Acked-by: Johannes Weiner Yes, there will be another revision by rebasing patch series on the linux-next. I will move the function then. OK, it turns out that mod_objcg_state() is only defined if CONFIG_MEMCG_KMEM. That was why I put it in the CONFIG_MEMCG_KMEM block with the other obj_stock functions. I think I will keep it there. The CONFIG_MEMCG_KMEM has become sort of useless now. It used to be configurable, but now it just means CONFIG_MEMCG && !CONFIG_SLOB. And even that doesn't make sense because while slob doesn't support slab object tracking, we can still do all the other stuff we do under KMEM. I have a patch in the works to remove the symbol and ifdefs. With that in mind, it's better to group the functions based on what they do rather than based on CONFIG_MEMCG_KMEM. It's easier to just remove another ifdef later than it is to reorder the functions. OK, I will make the move then. Thanks for the explanation. BTW, have you ever thought of moving the cgroup-v1 specific functions out into a separate memcontrol-v1.c file just like kernel/cgroup/cgroup-v1.c? I thought of that before, but memcontrol.c is a frequently changed file and so a bit hard to do. Cheers, Longman
Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
On 4/19/21 1:13 PM, Johannes Weiner wrote: On Mon, Apr 19, 2021 at 12:18:29PM -0400, Waiman Long wrote: On 4/19/21 11:21 AM, Waiman Long wrote: On 4/19/21 11:14 AM, Johannes Weiner wrote: On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote: The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c so that further optimization can be done to it in later patches without exposing unnecessary details to other mm components. Signed-off-by: Waiman Long --- mm/memcontrol.c | 13 + mm/slab.h | 16 ++-- 2 files changed, 15 insertions(+), 14 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e064ac0d850a..dc9032f28f2e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) css_put(>css); } +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, + enum node_stat_item idx, int nr) +{ + struct mem_cgroup *memcg; + struct lruvec *lruvec = NULL; + + rcu_read_lock(); + memcg = obj_cgroup_memcg(objcg); + lruvec = mem_cgroup_lruvec(memcg, pgdat); + mod_memcg_lruvec_state(lruvec, idx, nr); + rcu_read_unlock(); +} It would be more naturally placed next to the others, e.g. below __mod_lruvec_kmem_state(). But no deal breaker if there isn't another revision. Acked-by: Johannes Weiner Yes, there will be another revision by rebasing patch series on the linux-next. I will move the function then. OK, it turns out that mod_objcg_state() is only defined if CONFIG_MEMCG_KMEM. That was why I put it in the CONFIG_MEMCG_KMEM block with the other obj_stock functions. I think I will keep it there. The CONFIG_MEMCG_KMEM has become sort of useless now. It used to be configurable, but now it just means CONFIG_MEMCG && !CONFIG_SLOB. And even that doesn't make sense because while slob doesn't support slab object tracking, we can still do all the other stuff we do under KMEM. I have a patch in the works to remove the symbol and ifdefs. With that in mind, it's better to group the functions based on what they do rather than based on CONFIG_MEMCG_KMEM. It's easier to just remove another ifdef later than it is to reorder the functions. OK, I will make the move then. Thanks for the explanation. Cheers, Longman
Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
On 4/19/21 11:21 AM, Waiman Long wrote: On 4/19/21 11:14 AM, Johannes Weiner wrote: On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote: The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c so that further optimization can be done to it in later patches without exposing unnecessary details to other mm components. Signed-off-by: Waiman Long --- mm/memcontrol.c | 13 + mm/slab.h | 16 ++-- 2 files changed, 15 insertions(+), 14 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e064ac0d850a..dc9032f28f2e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) css_put(>css); } +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, + enum node_stat_item idx, int nr) +{ + struct mem_cgroup *memcg; + struct lruvec *lruvec = NULL; + + rcu_read_lock(); + memcg = obj_cgroup_memcg(objcg); + lruvec = mem_cgroup_lruvec(memcg, pgdat); + mod_memcg_lruvec_state(lruvec, idx, nr); + rcu_read_unlock(); +} It would be more naturally placed next to the others, e.g. below __mod_lruvec_kmem_state(). But no deal breaker if there isn't another revision. Acked-by: Johannes Weiner Yes, there will be another revision by rebasing patch series on the linux-next. I will move the function then. OK, it turns out that mod_objcg_state() is only defined if CONFIG_MEMCG_KMEM. That was why I put it in the CONFIG_MEMCG_KMEM block with the other obj_stock functions. I think I will keep it there. Thanks, Longman
Re: [External] [PATCH v4 5/5] mm/memcg: Improve refill_obj_stock() performance
On 4/19/21 2:06 AM, Muchun Song wrote: On Mon, Apr 19, 2021 at 8:01 AM Waiman Long wrote: There are two issues with the current refill_obj_stock() code. First of all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to atomically flush out remaining bytes to obj_cgroup, clear cached_objcg and do a obj_cgroup_put(). It is likely that the same obj_cgroup will be used again which leads to another call to drain_obj_stock() and obj_cgroup_get() as well as atomically retrieve the available byte from obj_cgroup. That is costly. Instead, we should just uncharge the excess pages, reduce the stock bytes and be done with it. The drain_obj_stock() function should only be called when obj_cgroup changes. Secondly, when charging an object of size not less than a page in obj_cgroup_charge(), it is possible that the remaining bytes to be refilled to the stock will overflow a page and cause refill_obj_stock() to uncharge 1 page. To avoid the additional uncharge in this case, a new overfill flag is added to refill_obj_stock() which will be set when called from obj_cgroup_charge(). Signed-off-by: Waiman Long --- mm/memcontrol.c | 23 +-- 1 file changed, 17 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a6dd18f6d8a8..d13961352eef 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, return false; } -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, +bool overfill) { unsigned long flags; struct obj_stock *stock = get_obj_stock(); + unsigned int nr_pages = 0; if (stock->cached_objcg != objcg) { /* reset if necessary */ - drain_obj_stock(stock); + if (stock->cached_objcg) + drain_obj_stock(stock); obj_cgroup_get(objcg); stock->cached_objcg = objcg; stock->nr_bytes = atomic_xchg(>nr_charged_bytes, 0); } stock->nr_bytes += nr_bytes; - if (stock->nr_bytes > PAGE_SIZE) - drain_obj_stock(stock); + if (!overfill && (stock->nr_bytes > PAGE_SIZE)) { + nr_pages = stock->nr_bytes >> PAGE_SHIFT; + stock->nr_bytes &= (PAGE_SIZE - 1); + } put_obj_stock(flags); + + if (nr_pages) { + rcu_read_lock(); + __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages); + rcu_read_unlock(); + } It is not safe to call __memcg_kmem_uncharge() under rcu lock and without holding a reference to memcg. More details can refer to the following link. https://lore.kernel.org/linux-mm/20210319163821.20704-2-songmuc...@bytedance.com/ In the above patchset, we introduce obj_cgroup_uncharge_pages to uncharge some pages from object cgroup. You can use this safe API. Thanks for the comment. Will update my patch with call to obj_cgroup_uncharge_pages(). Cheers, Longman
Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
On 4/19/21 11:14 AM, Johannes Weiner wrote: On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote: The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c so that further optimization can be done to it in later patches without exposing unnecessary details to other mm components. Signed-off-by: Waiman Long --- mm/memcontrol.c | 13 + mm/slab.h | 16 ++-- 2 files changed, 15 insertions(+), 14 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e064ac0d850a..dc9032f28f2e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) css_put(>css); } +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, +enum node_stat_item idx, int nr) +{ + struct mem_cgroup *memcg; + struct lruvec *lruvec = NULL; + + rcu_read_lock(); + memcg = obj_cgroup_memcg(objcg); + lruvec = mem_cgroup_lruvec(memcg, pgdat); + mod_memcg_lruvec_state(lruvec, idx, nr); + rcu_read_unlock(); +} It would be more naturally placed next to the others, e.g. below __mod_lruvec_kmem_state(). But no deal breaker if there isn't another revision. Acked-by: Johannes Weiner Yes, there will be another revision by rebasing patch series on the linux-next. I will move the function then. Cheers, Longman
Re: [External] [PATCH v4 5/5] mm/memcg: Improve refill_obj_stock() performance
On 4/19/21 11:00 AM, Shakeel Butt wrote: On Sun, Apr 18, 2021 at 11:07 PM Muchun Song wrote: On Mon, Apr 19, 2021 at 8:01 AM Waiman Long wrote: There are two issues with the current refill_obj_stock() code. First of all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to atomically flush out remaining bytes to obj_cgroup, clear cached_objcg and do a obj_cgroup_put(). It is likely that the same obj_cgroup will be used again which leads to another call to drain_obj_stock() and obj_cgroup_get() as well as atomically retrieve the available byte from obj_cgroup. That is costly. Instead, we should just uncharge the excess pages, reduce the stock bytes and be done with it. The drain_obj_stock() function should only be called when obj_cgroup changes. Secondly, when charging an object of size not less than a page in obj_cgroup_charge(), it is possible that the remaining bytes to be refilled to the stock will overflow a page and cause refill_obj_stock() to uncharge 1 page. To avoid the additional uncharge in this case, a new overfill flag is added to refill_obj_stock() which will be set when called from obj_cgroup_charge(). Signed-off-by: Waiman Long --- mm/memcontrol.c | 23 +-- 1 file changed, 17 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a6dd18f6d8a8..d13961352eef 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, return false; } -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, +bool overfill) { unsigned long flags; struct obj_stock *stock = get_obj_stock(); + unsigned int nr_pages = 0; if (stock->cached_objcg != objcg) { /* reset if necessary */ - drain_obj_stock(stock); + if (stock->cached_objcg) + drain_obj_stock(stock); obj_cgroup_get(objcg); stock->cached_objcg = objcg; stock->nr_bytes = atomic_xchg(>nr_charged_bytes, 0); } stock->nr_bytes += nr_bytes; - if (stock->nr_bytes > PAGE_SIZE) - drain_obj_stock(stock); + if (!overfill && (stock->nr_bytes > PAGE_SIZE)) { + nr_pages = stock->nr_bytes >> PAGE_SHIFT; + stock->nr_bytes &= (PAGE_SIZE - 1); + } put_obj_stock(flags); + + if (nr_pages) { + rcu_read_lock(); + __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages); + rcu_read_unlock(); + } It is not safe to call __memcg_kmem_uncharge() under rcu lock and without holding a reference to memcg. More details can refer to the following link. https://lore.kernel.org/linux-mm/20210319163821.20704-2-songmuc...@bytedance.com/ In the above patchset, we introduce obj_cgroup_uncharge_pages to uncharge some pages from object cgroup. You can use this safe API. I would recommend just rebase the patch series over the latest mm tree. I see, I will rebase it to the latest mm tree. Thanks, Longman
[PATCH v4 4/5] mm/memcg: Save both reclaimable & unreclaimable bytes in object stock
Currently, the object stock structure caches either reclaimable vmstat bytes or unreclaimable vmstat bytes in its object stock structure. The hit rate can be improved if both types of vmstat data can be cached especially for single-node system. This patch supports the cacheing of both type of vmstat data, though at the expense of a slightly increased complexity in the caching code. For large object (>= PAGE_SIZE), vmstat array is done directly without going through the stock caching step. On a 2-socket Cascade Lake server with instrumentation enabled, the miss rates are shown in the table below. Initial bootup: Kernel __mod_objcg_statemod_objcg_state%age -- ---- Before patch 634400 324383019.6% After patch 419810 318242413.2% Parallel kernel build: Kernel __mod_objcg_statemod_objcg_state%age -- ---- Before patch 24329265 142512465 17.1% After patch 24051721 142445825 16.9% There was a decrease of miss rate after initial system bootup. However, the miss rate for parallel kernel build remained about the same probably because most of the touched kmemcache objects were reclaimable inodes and dentries. Signed-off-by: Waiman Long --- mm/memcontrol.c | 79 +++-- 1 file changed, 51 insertions(+), 28 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c13502eab282..a6dd18f6d8a8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2212,8 +2212,8 @@ struct obj_stock { struct obj_cgroup *cached_objcg; struct pglist_data *cached_pgdat; unsigned int nr_bytes; - int vmstat_idx; - int vmstat_bytes; + int reclaimable_bytes; /* NR_SLAB_RECLAIMABLE_B */ + int unreclaimable_bytes;/* NR_SLAB_UNRECLAIMABLE_B */ #else int dummy[0]; #endif @@ -3217,40 +3217,56 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, enum node_stat_item idx, int nr) { unsigned long flags; - struct obj_stock *stock = get_obj_stock(); + struct obj_stock *stock; + int *bytes, *alt_bytes, alt_idx; + + /* +* Directly update vmstat array for big object. +*/ + if (unlikely(abs(nr) >= PAGE_SIZE)) + goto update_vmstat; + + stock = get_obj_stock(); + if (idx == NR_SLAB_RECLAIMABLE_B) { + bytes = >reclaimable_bytes; + alt_bytes = >unreclaimable_bytes; + alt_idx = NR_SLAB_UNRECLAIMABLE_B; + } else { + bytes = >unreclaimable_bytes; + alt_bytes = >reclaimable_bytes; + alt_idx = NR_SLAB_RECLAIMABLE_B; + } /* -* Save vmstat data in stock and skip vmstat array update unless -* accumulating over a page of vmstat data or when pgdat or idx +* Try to save vmstat data in stock and skip vmstat array update +* unless accumulating over a page of vmstat data or when pgdat * changes. */ if (stock->cached_objcg != objcg) { /* Output the current data as is */ - } else if (!stock->vmstat_bytes) { - /* Save the current data */ - stock->vmstat_bytes = nr; - stock->vmstat_idx = idx; - stock->cached_pgdat = pgdat; - nr = 0; - } else if ((stock->cached_pgdat != pgdat) || - (stock->vmstat_idx != idx)) { - /* Output the cached data & save the current data */ - swap(nr, stock->vmstat_bytes); - swap(idx, stock->vmstat_idx); + } else if (stock->cached_pgdat != pgdat) { + /* Save the current data and output cached data, if any */ + swap(nr, *bytes); swap(pgdat, stock->cached_pgdat); + if (*alt_bytes) { + __mod_objcg_state(objcg, pgdat, alt_idx, + *alt_bytes); + *alt_bytes = 0; + } } else { - stock->vmstat_bytes += nr; - if (abs(stock->vmstat_bytes) > PAGE_SIZE) { - nr = stock->vmstat_bytes; - stock->vmstat_bytes = 0; + *bytes += nr; + if (abs(*bytes) > PAGE_SIZE) { + nr = *bytes; + *bytes = 0; } else { nr = 0; } } - if (nr) - __mod_objcg_state(objcg, pgdat, idx, nr); - put_obj_stock(flags); + if (!nr) + return; +update_vmstat: + __mod_objcg_state(objcg, pgdat, idx, nr);
[PATCH v4 5/5] mm/memcg: Improve refill_obj_stock() performance
There are two issues with the current refill_obj_stock() code. First of all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to atomically flush out remaining bytes to obj_cgroup, clear cached_objcg and do a obj_cgroup_put(). It is likely that the same obj_cgroup will be used again which leads to another call to drain_obj_stock() and obj_cgroup_get() as well as atomically retrieve the available byte from obj_cgroup. That is costly. Instead, we should just uncharge the excess pages, reduce the stock bytes and be done with it. The drain_obj_stock() function should only be called when obj_cgroup changes. Secondly, when charging an object of size not less than a page in obj_cgroup_charge(), it is possible that the remaining bytes to be refilled to the stock will overflow a page and cause refill_obj_stock() to uncharge 1 page. To avoid the additional uncharge in this case, a new overfill flag is added to refill_obj_stock() which will be set when called from obj_cgroup_charge(). Signed-off-by: Waiman Long --- mm/memcontrol.c | 23 +-- 1 file changed, 17 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a6dd18f6d8a8..d13961352eef 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, return false; } -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, +bool overfill) { unsigned long flags; struct obj_stock *stock = get_obj_stock(); + unsigned int nr_pages = 0; if (stock->cached_objcg != objcg) { /* reset if necessary */ - drain_obj_stock(stock); + if (stock->cached_objcg) + drain_obj_stock(stock); obj_cgroup_get(objcg); stock->cached_objcg = objcg; stock->nr_bytes = atomic_xchg(>nr_charged_bytes, 0); } stock->nr_bytes += nr_bytes; - if (stock->nr_bytes > PAGE_SIZE) - drain_obj_stock(stock); + if (!overfill && (stock->nr_bytes > PAGE_SIZE)) { + nr_pages = stock->nr_bytes >> PAGE_SHIFT; + stock->nr_bytes &= (PAGE_SIZE - 1); + } put_obj_stock(flags); + + if (nr_pages) { + rcu_read_lock(); + __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages); + rcu_read_unlock(); + } } int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) @@ -3410,7 +3421,7 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) ret = __memcg_kmem_charge(memcg, gfp, nr_pages); if (!ret && nr_bytes) - refill_obj_stock(objcg, PAGE_SIZE - nr_bytes); + refill_obj_stock(objcg, PAGE_SIZE - nr_bytes, true); css_put(>css); return ret; @@ -3418,7 +3429,7 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size) { - refill_obj_stock(objcg, size); + refill_obj_stock(objcg, size, false); } #endif /* CONFIG_MEMCG_KMEM */ -- 2.18.1
[PATCH v4 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
v4: - Drop v3 patch 1 as well as modification to mm/percpu.c as the percpu vmstat update isn't frequent enough to worth caching it. - Add a new patch 1 to move Move mod_objcg_state() to memcontrol.c instead. - Combine v3 patches 4 & 5 into a single patch (patch 3). - Add a new patch 4 to cache both reclaimable & unreclaimable vmstat updates. - Add a new patch 5 to improve refill_obj_stock() performance. v3: - Add missing "inline" qualifier to the alternate mod_obj_stock_state() in patch 3. - Remove redundant current_obj_stock() call in patch 5. v2: - Fix bug found by test robot in patch 5. - Update cover letter and commit logs. With the recent introduction of the new slab memory controller, we eliminate the need for having separate kmemcaches for each memory cgroup and reduce overall kernel memory usage. However, we also add additional memory accounting overhead to each call of kmem_cache_alloc() and kmem_cache_free(). For workloads that require a lot of kmemcache allocations and de-allocations, they may experience performance regression as illustrated in [1] and [2]. A simple kernel module that performs repeated loop of 100,000,000 kmem_cache_alloc() and kmem_cache_free() of either a small 32-byte object or a big 4k object at module init time is used for benchmarking. The test was run on a CascadeLake server with turbo-boosting disable to reduce run-to-run variation. With memory accounting disable, the run time was 2.848s with small object and 2.890s for the big object. With memory accounting enabled, the run times with the application of various patches in the patchset were: Applied patches Run time Accounting overhead %age 1 %age 2 --- --- -- -- Small 32-byte object: None 10.570s 7.722s 100.0% 271.1% 1-2 8.560s 5.712s 74.0% 200.6% 1-3 6.592s 3.744s 48.5% 131.5% 1-4 7.154s 4.306s 55.8% 151.2% 1-5 7.192s 4.344s 56.3% 152.5% Large 4k object: None 20.612s17.722s 100.0% 613.2% 1-2 20.354s17.464s 98.5% 604.3% 1-3 19.395s16.505s 93.1% 571.1% 1-4 19.094s16.204s 91.4% 560.7% 1-5 13.576s10.686s 60.3% 369.8% N.B. %age 1 = overhead/unpatched overhead %age 2 = overhead/accounting disable time The small object test exercises mainly the object stock charging and vmstat update code paths. The large object test also exercises the refill_obj_stock() and __memcg_kmem_charge()/__memcg_kmem_uncharge() code paths. The vmstat data stock caching helps in the small object test, but not so much on the large object test. Similarly, eliminating irq_disable/irq_enable helps in the small object test and less so in the large object test. Caching both reclaimable and non-reclaimable vmstat data actually regresses performance a bit in this particular small object test. The final patch to optimize refill_obj_stock() has negligible impact on the small object test as this code path isn't being exercised. The large object test, however, sees a pretty good performance improvement with this patch. [1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u [2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/ Waiman Long (5): mm/memcg: Move mod_objcg_state() to memcontrol.c mm/memcg: Cache vmstat data in percpu memcg_stock_pcp mm/memcg: Optimize user context object stock access mm/memcg: Save both reclaimable & unreclaimable bytes in object stock mm/memcg: Improve refill_obj_stock() performance mm/memcontrol.c | 199 +--- mm/slab.h | 16 +--- 2 files changed, 175 insertions(+), 40 deletions(-) -- 2.18.1
[PATCH v4 3/5] mm/memcg: Optimize user context object stock access
Most kmem_cache_alloc() calls are from user context. With instrumentation enabled, the measured amount of kmem_cache_alloc() calls from non-task context was about 0.01% of the total. The irq disable/enable sequence used in this case to access content from object stock is slow. To optimize for user context access, there are now two sets of object stocks (in the new obj_stock structure) for task context and interrupt context access respectively. The task context object stock can be accessed after disabling preemption which is cheap in non-preempt kernel. The interrupt context object stock can only be accessed after disabling interrupt. User context code can access interrupt object stock, but not vice versa. The downside of this change is that there are more data stored in local object stocks and not reflected in the charge counter and the vmstat arrays. However, this is a small price to pay for better performance. Signed-off-by: Waiman Long Acked-by: Roman Gushchin Reviewed-by: Shakeel Butt --- mm/memcontrol.c | 94 +++-- 1 file changed, 68 insertions(+), 26 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 693453f95d99..c13502eab282 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2207,17 +2207,23 @@ void unlock_page_memcg(struct page *page) } EXPORT_SYMBOL(unlock_page_memcg); -struct memcg_stock_pcp { - struct mem_cgroup *cached; /* this never be root cgroup */ - unsigned int nr_pages; - +struct obj_stock { #ifdef CONFIG_MEMCG_KMEM struct obj_cgroup *cached_objcg; struct pglist_data *cached_pgdat; unsigned int nr_bytes; int vmstat_idx; int vmstat_bytes; +#else + int dummy[0]; #endif +}; + +struct memcg_stock_pcp { + struct mem_cgroup *cached; /* this never be root cgroup */ + unsigned int nr_pages; + struct obj_stock task_obj; + struct obj_stock irq_obj; struct work_struct work; unsigned long flags; @@ -2227,12 +2233,12 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock); static DEFINE_MUTEX(percpu_charge_mutex); #ifdef CONFIG_MEMCG_KMEM -static void drain_obj_stock(struct memcg_stock_pcp *stock); +static void drain_obj_stock(struct obj_stock *stock); static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, struct mem_cgroup *root_memcg); #else -static inline void drain_obj_stock(struct memcg_stock_pcp *stock) +static inline void drain_obj_stock(struct obj_stock *stock) { } static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, @@ -2242,6 +2248,40 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, } #endif +/* + * Most kmem_cache_alloc() calls are from user context. The irq disable/enable + * sequence used in this case to access content from object stock is slow. + * To optimize for user context access, there are now two object stocks for + * task context and interrupt context access respectively. + * + * The task context object stock can be accessed by disabling preemption only + * which is cheap in non-preempt kernel. The interrupt context object stock + * can only be accessed after disabling interrupt. User context code can + * access interrupt object stock, but not vice versa. + */ +static inline struct obj_stock *get_obj_stock(unsigned long *pflags) +{ + struct memcg_stock_pcp *stock; + + if (likely(in_task())) { + preempt_disable(); + stock = this_cpu_ptr(_stock); + return >task_obj; + } else { + local_irq_save(*pflags); + stock = this_cpu_ptr(_stock); + return >irq_obj; + } +} + +static inline void put_obj_stock(unsigned long flags) +{ + if (likely(in_task())) + preempt_enable(); + else + local_irq_restore(flags); +} + /** * consume_stock: Try to consume stocked charge on this cpu. * @memcg: memcg to consume from. @@ -2308,7 +2348,9 @@ static void drain_local_stock(struct work_struct *dummy) local_irq_save(flags); stock = this_cpu_ptr(_stock); - drain_obj_stock(stock); + drain_obj_stock(>irq_obj); + if (in_task()) + drain_obj_stock(>task_obj); drain_stock(stock); clear_bit(FLUSHING_CACHED_CHARGE, >flags); @@ -3153,6 +3195,10 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) css_put(>css); } +/* + * __mod_objcg_state() may be called with irq enabled, so + * mod_memcg_lruvec_state() should be used. + */ static inline void __mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, enum node_stat_item idx, int nr) @@ -3163,18 +3209,15 @@ static inline void __mod_objcg_state(struct obj_cgroup *objcg, rcu_read_lock(); memcg = obj_cgroup_memcg(objc
[PATCH v4 2/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
Before the new slab memory controller with per object byte charging, charging and vmstat data update happen only when new slab pages are allocated or freed. Now they are done with every kmem_cache_alloc() and kmem_cache_free(). This causes additional overhead for workloads that generate a lot of alloc and free calls. The memcg_stock_pcp is used to cache byte charge for a specific obj_cgroup to reduce that overhead. To further reducing it, this patch makes the vmstat data cached in the memcg_stock_pcp structure as well until it accumulates a page size worth of update or when other cached data change. Caching the vmstat data in the per-cpu stock eliminates two writes to non-hot cachelines for memcg specific as well as memcg-lruvecs specific vmstat data by a write to a hot local stock cacheline. On a 2-socket Cascade Lake server with instrumentation enabled and this patch applied, it was found that about 20% (634400 out of 3243830) of the time when mod_objcg_state() is called leads to an actual call to __mod_objcg_state() after initial boot. When doing parallel kernel build, the figure was about 17% (24329265 out of 142512465). So caching the vmstat data reduces the number of calls to __mod_objcg_state() by more than 80%. Signed-off-by: Waiman Long Reviewed-by: Shakeel Butt --- mm/memcontrol.c | 64 ++--- 1 file changed, 61 insertions(+), 3 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index dc9032f28f2e..693453f95d99 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2213,7 +2213,10 @@ struct memcg_stock_pcp { #ifdef CONFIG_MEMCG_KMEM struct obj_cgroup *cached_objcg; + struct pglist_data *cached_pgdat; unsigned int nr_bytes; + int vmstat_idx; + int vmstat_bytes; #endif struct work_struct work; @@ -3150,8 +3153,9 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) css_put(>css); } -void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, -enum node_stat_item idx, int nr) +static inline void __mod_objcg_state(struct obj_cgroup *objcg, +struct pglist_data *pgdat, +enum node_stat_item idx, int nr) { struct mem_cgroup *memcg; struct lruvec *lruvec = NULL; @@ -3159,10 +3163,53 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, rcu_read_lock(); memcg = obj_cgroup_memcg(objcg); lruvec = mem_cgroup_lruvec(memcg, pgdat); - mod_memcg_lruvec_state(lruvec, idx, nr); + __mod_memcg_lruvec_state(lruvec, idx, nr); rcu_read_unlock(); } +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, +enum node_stat_item idx, int nr) +{ + struct memcg_stock_pcp *stock; + unsigned long flags; + + local_irq_save(flags); + stock = this_cpu_ptr(_stock); + + /* +* Save vmstat data in stock and skip vmstat array update unless +* accumulating over a page of vmstat data or when pgdat or idx +* changes. +*/ + if (stock->cached_objcg != objcg) { + /* Output the current data as is */ + } else if (!stock->vmstat_bytes) { + /* Save the current data */ + stock->vmstat_bytes = nr; + stock->vmstat_idx = idx; + stock->cached_pgdat = pgdat; + nr = 0; + } else if ((stock->cached_pgdat != pgdat) || + (stock->vmstat_idx != idx)) { + /* Output the cached data & save the current data */ + swap(nr, stock->vmstat_bytes); + swap(idx, stock->vmstat_idx); + swap(pgdat, stock->cached_pgdat); + } else { + stock->vmstat_bytes += nr; + if (abs(stock->vmstat_bytes) > PAGE_SIZE) { + nr = stock->vmstat_bytes; + stock->vmstat_bytes = 0; + } else { + nr = 0; + } + } + if (nr) + __mod_objcg_state(objcg, pgdat, idx, nr); + + local_irq_restore(flags); +} + static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) { struct memcg_stock_pcp *stock; @@ -3213,6 +3260,17 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock) stock->nr_bytes = 0; } + /* +* Flush the vmstat data in current stock +*/ + if (stock->vmstat_bytes) { + __mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx, + stock->vmstat_bytes); + stock->cached_pgdat = NULL; + stock->vmstat_bytes = 0; + stock->vmstat_idx = 0; + } + obj_cgroup_put(old); stock->cached_objcg = NULL; } -- 2.18.1
[PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c
The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c so that further optimization can be done to it in later patches without exposing unnecessary details to other mm components. Signed-off-by: Waiman Long --- mm/memcontrol.c | 13 + mm/slab.h | 16 ++-- 2 files changed, 15 insertions(+), 14 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e064ac0d850a..dc9032f28f2e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) css_put(>css); } +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, +enum node_stat_item idx, int nr) +{ + struct mem_cgroup *memcg; + struct lruvec *lruvec = NULL; + + rcu_read_lock(); + memcg = obj_cgroup_memcg(objcg); + lruvec = mem_cgroup_lruvec(memcg, pgdat); + mod_memcg_lruvec_state(lruvec, idx, nr); + rcu_read_unlock(); +} + static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) { struct memcg_stock_pcp *stock; diff --git a/mm/slab.h b/mm/slab.h index 076582f58f68..ae8b85875426 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -239,6 +239,8 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla #ifdef CONFIG_MEMCG_KMEM int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s, gfp_t gfp, bool new_page); +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, +enum node_stat_item idx, int nr); static inline void memcg_free_page_obj_cgroups(struct page *page) { @@ -283,20 +285,6 @@ static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s, return true; } -static inline void mod_objcg_state(struct obj_cgroup *objcg, - struct pglist_data *pgdat, - enum node_stat_item idx, int nr) -{ - struct mem_cgroup *memcg; - struct lruvec *lruvec; - - rcu_read_lock(); - memcg = obj_cgroup_memcg(objcg); - lruvec = mem_cgroup_lruvec(memcg, pgdat); - mod_memcg_lruvec_state(lruvec, idx, nr); - rcu_read_unlock(); -} - static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s, struct obj_cgroup *objcg, gfp_t flags, size_t size, -- 2.18.1
[PATCH v5] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only
The handling of sysrq key can be activated by echoing the key to /proc/sysrq-trigger or via the magic key sequence typed into a terminal that is connected to the system in some way (serial, USB or other mean). In the former case, the handling is done in a user context. In the latter case, it is likely to be in an interrupt context. Currently in print_cpu() of kernel/sched/debug.c, sched_debug_lock is taken with interrupt disabled for the whole duration of the calls to print_*_stats() and print_rq() which could last for the quite some time if the information dump happens on the serial console. If the system has many cpus and the sched_debug_lock is somehow busy (e.g. parallel sysrq-t), the system may hit a hard lockup panic depending on the actually serial console implementation of the system. For instance, [ 7809.796262] Kernel panic - not syncing: Hard LOCKUP [ 7809.796264] CPU: 13 PID: 79867 Comm: reproducer.sh Kdump: loaded Tainted: G I - - - 4.18.0-301.el8.x86_64 #1 [ 7809.796264] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.4.9 06/29/2018 [ 7809.796265] Call Trace: [ 7809.796265] [ 7809.796266] dump_stack+0x5c/0x80 [ 7809.796266] panic+0xe7/0x2a9 [ 7809.796267] nmi_panic.cold.9+0xc/0xc [ 7809.796267] watchdog_overflow_callback.cold.7+0x5c/0x70 [ 7809.796268] __perf_event_overflow+0x52/0xf0 [ 7809.796268] handle_pmi_common+0x204/0x2a0 [ 7809.796269] ? __set_pte_vaddr+0x32/0x50 [ 7809.796269] ? __native_set_fixmap+0x24/0x30 [ 7809.796270] ? ghes_copy_tofrom_phys+0xd3/0x1c0 [ 7809.796271] intel_pmu_handle_irq+0xbf/0x160 [ 7809.796271] perf_event_nmi_handler+0x2d/0x50 [ 7809.796272] nmi_handle+0x63/0x110 [ 7809.796272] default_do_nmi+0x49/0x100 [ 7809.796273] do_nmi+0x17e/0x1e0 [ 7809.796273] end_repeat_nmi+0x16/0x6f [ 7809.796274] RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1d0 [ 7809.796275] Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00 [ 7809.796276] RSP: 0018:aa54cd887df8 EFLAGS: 0002 [ 7809.796277] RAX: 0101 RBX: 0246 RCX: [ 7809.796278] RDX: RSI: RDI: 936b66d0 [ 7809.796278] RBP: 9301fb40 R08: 0004 R09: 004f [ 7809.796279] R10: R11: aa54cd887cc0 R12: 907fd0a29ec0 [ 7809.796280] R13: R14: 926ab7c0 R15: [ 7809.796280] ? native_queued_spin_lock_slowpath+0x5b/0x1d0 [ 7809.796281] ? native_queued_spin_lock_slowpath+0x5b/0x1d0 [ 7809.796281] [ 7809.796282] _raw_spin_lock_irqsave+0x32/0x40 [ 7809.796283] print_cpu+0x261/0x7c0 [ 7809.796283] sysrq_sched_debug_show+0x34/0x50 [ 7809.796284] sysrq_handle_showstate+0xc/0x20 [ 7809.796284] __handle_sysrq.cold.11+0x48/0xfb [ 7809.796285] write_sysrq_trigger+0x2b/0x30 [ 7809.796285] proc_reg_write+0x39/0x60 [ 7809.796286] vfs_write+0xa5/0x1a0 [ 7809.796286] ksys_write+0x4f/0xb0 [ 7809.796287] do_syscall_64+0x5b/0x1a0 [ 7809.796287] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 7809.796288] RIP: 0033:0x7fabe4ceb648 The purpose of sched_debug_lock is to serialize the use of the global cgroup_path[] buffer in print_cpu(). The rests of the printk calls don't need serialization from sched_debug_lock. Calling printk() with interrupt disabled can still be problematic if multiple instances are running. Allocating a stack buffer of PATH_MAX bytes is not feasible because of the limited size of the kernel stack. The solution implemented in this patch is to allow only one caller at a time to use the full size group_path[], while other simultaneous callers will have to use shorter stack buffers with the possibility of path name truncation. A "..." suffix will be printed if truncation may have happened. The cgroup path name is provided for informational purpose only, so occasional path name truncation should not be a big problem. Fixes: efe25c2c7b3a ("sched: Reinstate group names in /proc/sched_debug") Suggested-by: Peter Zijlstra Signed-off-by: Waiman Long --- kernel/sched/debug.c | 42 +- 1 file changed, 29 insertions(+), 13 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 486f403a778b..9c8b3ed2199a 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -8,8 +8,6 @@ */ #include "sched.h" -static DEFINE_SPINLOCK(sched_debug_lock); - /* * This allows printing both to /proc/sched_debug and * to the console @@ -470,16 +468,37 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group #endif #ifdef CONFIG_CGROUP_SCHED +static DEFINE_SPINLOCK(sched_debug_lock); static char group_path[PATH_MAX]; -static char *task_group_path(struct task_group *tg) +static void task_group_path(struct
Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
On 4/15/21 3:40 PM, Johannes Weiner wrote: On Thu, Apr 15, 2021 at 02:47:31PM -0400, Waiman Long wrote: On 4/15/21 2:10 PM, Johannes Weiner wrote: On Thu, Apr 15, 2021 at 12:35:45PM -0400, Waiman Long wrote: On 4/15/21 12:30 PM, Johannes Weiner wrote: On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote: In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge() is followed by mod_objcg_state()/mod_memcg_state(). Each of these function call goes through a separate irq_save/irq_restore cycle. That is inefficient. Introduce a new function obj_cgroup_uncharge_mod_state() that combines them with a single irq_save/irq_restore cycle. @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size) refill_obj_stock(objcg, size); } +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size, + struct pglist_data *pgdat, int idx) The optimization makes sense. But please don't combine independent operations like this into a single function. It makes for an unclear parameter list, it's a pain in the behind to change the constituent operations later on, and it has a habit of attracting more random bools over time. E.g. what if the caller already has irqs disabled? What if it KNOWS that irqs are enabled and it could use local_irq_disable() instead of save? Just provide an __obj_cgroup_uncharge() that assumes irqs are disabled, combine with the existing __mod_memcg_lruvec_state(), and bubble the irq handling up to those callsites which know better. That will also work. However, the reason I did that was because of patch 5 in the series. I could put the get_obj_stock() and put_obj_stock() code in slab.h and allowed them to be used directly in various places, but hiding in one function is easier. Yeah it's more obvious after getting to patch 5. But with the irq disabling gone entirely, is there still an incentive to combine the atomic section at all? Disabling preemption is pretty cheap, so it wouldn't matter to just do it twice. I.e. couldn't the final sequence in slab code simply be objcg_uncharge() mod_objcg_state() again and each function disables preemption (and in the rare case irqs) as it sees fit? You lose the irqsoff batching in the cold path, but as you say, hit rates are pretty good, and it doesn't seem worth complicating the code for the cold path. That does make sense, though a little bit of performance may be lost. I will try that out to see how it work out performance wise. Thanks. Even if we still end up doing it, it's great to have that cost isolated, so we know how much extra code complexity corresponds to how much performance gain. It seems the task/irq split could otherwise be a pretty localized change with no API implications. I still want to move mod_objcg_state() function to memcontrol.c though as I don't want to put any obj_stock stuff in mm/slab.h. Cheers, Longman
Re: [PATCH v3 5/5] mm/memcg: Optimize user context object stock access
On 4/15/21 2:53 PM, Johannes Weiner wrote: On Thu, Apr 15, 2021 at 02:16:17PM -0400, Waiman Long wrote: On 4/15/21 1:53 PM, Johannes Weiner wrote: On Tue, Apr 13, 2021 at 09:20:27PM -0400, Waiman Long wrote: Most kmem_cache_alloc() calls are from user context. With instrumentation enabled, the measured amount of kmem_cache_alloc() calls from non-task context was about 0.01% of the total. The irq disable/enable sequence used in this case to access content from object stock is slow. To optimize for user context access, there are now two object stocks for task context and interrupt context access respectively. The task context object stock can be accessed after disabling preemption which is cheap in non-preempt kernel. The interrupt context object stock can only be accessed after disabling interrupt. User context code can access interrupt object stock, but not vice versa. The mod_objcg_state() function is also modified to make sure that memcg and lruvec stat updates are done with interrupted disabled. The downside of this change is that there are more data stored in local object stocks and not reflected in the charge counter and the vmstat arrays. However, this is a small price to pay for better performance. Signed-off-by: Waiman Long Acked-by: Roman Gushchin Reviewed-by: Shakeel Butt This makes sense, and also explains the previous patch a bit better. But please merge those two. The reason I broke it into two is so that the patches are individually easier to review. I prefer to update the commit log of patch 4 to explain why the obj_stock structure is introduced instead of merging the two. Well I did not find them easier to review separately. @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy) local_irq_save(flags); stock = this_cpu_ptr(_stock); - drain_obj_stock(>obj); + drain_obj_stock(>irq_obj); + if (in_task()) + drain_obj_stock(>task_obj); drain_stock(stock); clear_bit(FLUSHING_CACHED_CHARGE, >flags); @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg, memcg = obj_cgroup_memcg(objcg); if (pgdat) lruvec = mem_cgroup_lruvec(memcg, pgdat); - __mod_memcg_lruvec_state(memcg, lruvec, idx, nr); + mod_memcg_lruvec_state(memcg, lruvec, idx, nr); rcu_read_unlock(); This is actually a bug introduced in the earlier patch, isn't it? Calling __mod_memcg_lruvec_state() without irqs disabled... Not really, in patch 3, mod_objcg_state() is called only in the stock update context where interrupt had already been disabled. But now, that is no longer the case, that is why i need to update mod_objcg_state() to make sure irq is disabled before updating vmstat data array. Oh, I see it now. Man, that's subtle. We've had several very hard to track down preemption bugs in those stats, because they manifest as counter imbalances and you have no idea if there is a leak somewhere. The convention for these functions is that the __ prefix indicates that preemption has been suitably disabled. Please always follow this convention, even if the semantic change is temporary. I see. I will fix that in the next version. Btw, is there a reason why the stock caching isn't just part of mod_objcg_state()? Why does the user need to choose if they want the caching or not? It's not like we ask for this when charging, either. Yes, I can revert it to make mod_objcg_state() to call mod_obj_stock_state() internally instead of the other way around. Cheers, Longman
Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
On 4/15/21 2:10 PM, Johannes Weiner wrote: On Thu, Apr 15, 2021 at 12:35:45PM -0400, Waiman Long wrote: On 4/15/21 12:30 PM, Johannes Weiner wrote: On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote: In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge() is followed by mod_objcg_state()/mod_memcg_state(). Each of these function call goes through a separate irq_save/irq_restore cycle. That is inefficient. Introduce a new function obj_cgroup_uncharge_mod_state() that combines them with a single irq_save/irq_restore cycle. @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size) refill_obj_stock(objcg, size); } +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size, + struct pglist_data *pgdat, int idx) The optimization makes sense. But please don't combine independent operations like this into a single function. It makes for an unclear parameter list, it's a pain in the behind to change the constituent operations later on, and it has a habit of attracting more random bools over time. E.g. what if the caller already has irqs disabled? What if it KNOWS that irqs are enabled and it could use local_irq_disable() instead of save? Just provide an __obj_cgroup_uncharge() that assumes irqs are disabled, combine with the existing __mod_memcg_lruvec_state(), and bubble the irq handling up to those callsites which know better. That will also work. However, the reason I did that was because of patch 5 in the series. I could put the get_obj_stock() and put_obj_stock() code in slab.h and allowed them to be used directly in various places, but hiding in one function is easier. Yeah it's more obvious after getting to patch 5. But with the irq disabling gone entirely, is there still an incentive to combine the atomic section at all? Disabling preemption is pretty cheap, so it wouldn't matter to just do it twice. I.e. couldn't the final sequence in slab code simply be objcg_uncharge() mod_objcg_state() again and each function disables preemption (and in the rare case irqs) as it sees fit? You lose the irqsoff batching in the cold path, but as you say, hit rates are pretty good, and it doesn't seem worth complicating the code for the cold path. That does make sense, though a little bit of performance may be lost. I will try that out to see how it work out performance wise. Cheers, Longman
Re: [PATCH v3 5/5] mm/memcg: Optimize user context object stock access
On 4/15/21 1:53 PM, Johannes Weiner wrote: On Tue, Apr 13, 2021 at 09:20:27PM -0400, Waiman Long wrote: Most kmem_cache_alloc() calls are from user context. With instrumentation enabled, the measured amount of kmem_cache_alloc() calls from non-task context was about 0.01% of the total. The irq disable/enable sequence used in this case to access content from object stock is slow. To optimize for user context access, there are now two object stocks for task context and interrupt context access respectively. The task context object stock can be accessed after disabling preemption which is cheap in non-preempt kernel. The interrupt context object stock can only be accessed after disabling interrupt. User context code can access interrupt object stock, but not vice versa. The mod_objcg_state() function is also modified to make sure that memcg and lruvec stat updates are done with interrupted disabled. The downside of this change is that there are more data stored in local object stocks and not reflected in the charge counter and the vmstat arrays. However, this is a small price to pay for better performance. Signed-off-by: Waiman Long Acked-by: Roman Gushchin Reviewed-by: Shakeel Butt This makes sense, and also explains the previous patch a bit better. But please merge those two. The reason I broke it into two is so that the patches are individually easier to review. I prefer to update the commit log of patch 4 to explain why the obj_stock structure is introduced instead of merging the two. @@ -2229,7 +2229,8 @@ struct obj_stock { struct memcg_stock_pcp { struct mem_cgroup *cached; /* this never be root cgroup */ unsigned int nr_pages; - struct obj_stock obj; + struct obj_stock task_obj; + struct obj_stock irq_obj; struct work_struct work; unsigned long flags; @@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, } #endif +/* + * Most kmem_cache_alloc() calls are from user context. The irq disable/enable + * sequence used in this case to access content from object stock is slow. + * To optimize for user context access, there are now two object stocks for + * task context and interrupt context access respectively. + * + * The task context object stock can be accessed by disabling preemption only + * which is cheap in non-preempt kernel. The interrupt context object stock + * can only be accessed after disabling interrupt. User context code can + * access interrupt object stock, but not vice versa. + */ static inline struct obj_stock *current_obj_stock(void) { struct memcg_stock_pcp *stock = this_cpu_ptr(_stock); - return >obj; + return in_task() ? >task_obj : >irq_obj; +} + +#define get_obj_stock(flags) \ +({ \ + struct memcg_stock_pcp *stock; \ + struct obj_stock *obj_stock;\ + \ + if (in_task()) {\ + preempt_disable(); \ + (flags) = -1L; \ + stock = this_cpu_ptr(_stock); \ + obj_stock = >task_obj;\ + } else {\ + local_irq_save(flags); \ + stock = this_cpu_ptr(_stock); \ + obj_stock = >irq_obj; \ + } \ + obj_stock; \ +}) + +static inline void put_obj_stock(unsigned long flags) +{ + if (flags == -1L) + preempt_enable(); + else + local_irq_restore(flags); } Please make them both functions and use 'unsigned long *flags'. Sure, I can do that. Also I'm not sure doing in_task() twice would actually be more expensive than the == -1 special case, and easier to understand. I can make that change too. Either way is fine with me. @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy) local_irq_save(flags); stock = this_cpu_ptr(_stock); - drain_obj_stock(>obj); + drain_obj_stock(>irq_obj); + if (in_task()) + drain_obj_stock(>task_obj); drain_stock(stock); clear_bit(FLUSHING_CACHED_CHARGE, >flags); @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg, memcg = obj_cgroup_memcg(objcg); if (pgdat) lruvec = mem_cgroup_lruvec(memcg, pgdat); - __mod_memcg_lruvec_state(memcg, lruvec, idx, nr); + mod_memcg_lruvec_state(memcg, lruvec, idx, nr); rcu_read_unlock(); This is actually a bug introduced in the earlier patch, isn't it? Calling __mod_memcg_lruvec_state() without irqs disabled... Not really, in patch 3,
Re: [PATCH v2] locking/qrwlock: Fix ordering in queued_write_lock_slowpath
On 4/15/21 1:27 PM, Ali Saidi wrote: While this code is executed with the wait_lock held, a reader can acquire the lock without holding wait_lock. The writer side loops checking the value with the atomic_cond_read_acquire(), but only truly acquires the lock when the compare-and-exchange is completed successfully which isn’t ordered. This exposes the window between the acquire and the cmpxchg to an A-B-A problem which allows reads following the lock acquisition to observe values speculatively before the write lock is truly acquired. We've seen a problem in epoll where the reader does a xchg while holding the read lock, but the writer can see a value change out from under it. Writer | Reader 2 ep_scan_ready_list() | |- write_lock_irq() | |- queued_write_lock_slowpath() | |- atomic_cond_read_acquire() | | read_lock_irqsave(>lock, flags); --> (observes value before unlock)| chain_epi_lockless() | |epi->next = xchg(>ovflist, epi); | | read_unlock_irqrestore(>lock, flags); | | | atomic_cmpxchg_relaxed()| |-- READ_ONCE(ep->ovflist); | A core can order the read of the ovflist ahead of the atomic_cmpxchg_relaxed(). Switching the cmpxchg to use acquire semantics addresses this issue at which point the atomic_cond_read can be switched to use relaxed semantics. Fixes: b519b56e378ee ("locking/qrwlock: Use atomic_cond_read_acquire() when spinning in qrwlock") Signed-off-by: Ali Saidi Cc: sta...@vger.kernel.org Acked-by: Will Deacon Tested-by: Steve Capper Reviewed-by: Steve Capper --- kernel/locking/qrwlock.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c index 4786dd271b45..10770f6ac4d9 100644 --- a/kernel/locking/qrwlock.c +++ b/kernel/locking/qrwlock.c @@ -73,8 +73,8 @@ void queued_write_lock_slowpath(struct qrwlock *lock) /* When no more readers or writers, set the locked flag */ do { - atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING); - } while (atomic_cmpxchg_relaxed(>cnts, _QW_WAITING, + atomic_cond_read_relaxed(>cnts, VAL == _QW_WAITING); + } while (atomic_cmpxchg_acquire(>cnts, _QW_WAITING, _QW_LOCKED) != _QW_WAITING); unlock: arch_spin_unlock(>wait_lock); Acked-by: Waiman Long
Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
On 4/15/21 1:10 PM, Matthew Wilcox wrote: On Tue, Apr 13, 2021 at 09:20:22PM -0400, Waiman Long wrote: With memory accounting disable, the run time was 2.848s. With memory accounting enabled, the run times with the application of various patches in the patchset were: Applied patches Run time Accounting overhead Overhead %age --- --- - None 10.800s 7.952s 100.0% 1-2 9.140s 6.292s 79.1% 1-3 7.641s 4.793s 60.3% 1-5 6.801s 3.953s 49.7% I think this is a misleading way to report the overhead. I would have said: 10.800s 7.952s 279.2% 9.140s 6.292s 220.9% 7.641s 4.793s 168.3% 6.801s 3.953s 138.8% What I want to emphasize is the reduction in the accounting overhead part of execution time. Your percentage used the accounting disable time as the denominator. I think both are valid, I will be more clear about that in my version of the patch. Thanks, Longman
Re: [PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
On 4/15/21 12:50 PM, Johannes Weiner wrote: On Tue, Apr 13, 2021 at 09:20:25PM -0400, Waiman Long wrote: Before the new slab memory controller with per object byte charging, charging and vmstat data update happen only when new slab pages are allocated or freed. Now they are done with every kmem_cache_alloc() and kmem_cache_free(). This causes additional overhead for workloads that generate a lot of alloc and free calls. The memcg_stock_pcp is used to cache byte charge for a specific obj_cgroup to reduce that overhead. To further reducing it, this patch makes the vmstat data cached in the memcg_stock_pcp structure as well until it accumulates a page size worth of update or when other cached data change. On a 2-socket Cascade Lake server with instrumentation enabled and this patch applied, it was found that about 17% (946796 out of 5515184) of the time when __mod_obj_stock_state() is called leads to an actual call to mod_objcg_state() after initial boot. When doing parallel kernel build, the figure was about 16% (21894614 out of 139780628). So caching the vmstat data reduces the number of calls to mod_objcg_state() by more than 80%. Right, but mod_objcg_state() is itself already percpu-cached. What's the benefit of avoiding calls to it with another percpu cache? There are actually 2 set of vmstat data that have to be updated. One is associated with the memcg and other one is for each lruvec within the cgroup. Caching it in obj_stock, we replace 2 writes to two colder cachelines with one write to a hot cacheline. If you look at patch 5, I break obj_stock into two - one for task context and one for irq context. Interrupt disable is no longer needed in task context, but that is not possible when writing to the actual vmstat data arrays. Cheers, Longman
Re: [PATCH v3 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
On 4/15/21 12:40 PM, Johannes Weiner wrote: On Tue, Apr 13, 2021 at 09:20:23PM -0400, Waiman Long wrote: The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily available. So both of them are now passed to mod_memcg_lruvec_state() and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is updated to allow either of the two parameters to be set to null. This makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec is null. The new __mod_memcg_lruvec_state() function will be used in the next patch as a replacement of mod_memcg_state() in mm/percpu.c for the consolidation of the memory uncharge and vmstat update functions in the kmem_cache_free() path. This requires users who want both to pass a pgdat that can be derived from the lruvec. This is error prone, and we just acked a patch that removes this very thing from mem_cgroup_page_lruvec(). With the suggestion for patch 2, this shouldn't be necessary anymore, though. And sort of underlines my point around that combined function creating akwward code above and below it. The reason of passing in the pgdat is because of the caching of vmstat data. lruvec may be gone if the corresponding memory cgroup is removed, but pgdat should stay put. That is why I put pgdat in the obj_stock for caching. I could also put the node id instead of pgdat. Cheers, Longman
Re: [PATCH] locking/qrwlock: Fix ordering in queued_write_lock_slowpath
On 4/15/21 12:45 PM, Will Deacon wrote: With that in mind, it would probably be a good idea to eyeball the qspinlock slowpath as well, as that uses both atomic_cond_read_acquire() and atomic_try_cmpxchg_relaxed(). It seems plausible that the same thing could occur here in qspinlock: if ((val & _Q_TAIL_MASK) == tail) { if (atomic_try_cmpxchg_relaxed(>val, , _Q_LOCKED_VAL)) goto release; /* No contention */ } Just been thinking about this, but I don't see an issue here because everybody is queuing the same way (i.e. we don't have a mechanism to jump the queue like we do for qrwlock) and the tail portion of the lock word isn't susceptible to ABA. That is, once we're at the head of the queue and we've seen the lock become unlocked via atomic_cond_read_acquire(), then we know we hold it. So qspinlock looks fine to me, but I'd obviously value anybody else's opinion on that. I agree with your assessment of qspinlock. I think qspinlock is fine. Cheers, Longman
Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
On 4/15/21 12:30 PM, Johannes Weiner wrote: On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote: In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge() is followed by mod_objcg_state()/mod_memcg_state(). Each of these function call goes through a separate irq_save/irq_restore cycle. That is inefficient. Introduce a new function obj_cgroup_uncharge_mod_state() that combines them with a single irq_save/irq_restore cycle. @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size) refill_obj_stock(objcg, size); } +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size, + struct pglist_data *pgdat, int idx) The optimization makes sense. But please don't combine independent operations like this into a single function. It makes for an unclear parameter list, it's a pain in the behind to change the constituent operations later on, and it has a habit of attracting more random bools over time. E.g. what if the caller already has irqs disabled? What if it KNOWS that irqs are enabled and it could use local_irq_disable() instead of save? Just provide an __obj_cgroup_uncharge() that assumes irqs are disabled, combine with the existing __mod_memcg_lruvec_state(), and bubble the irq handling up to those callsites which know better. That will also work. However, the reason I did that was because of patch 5 in the series. I could put the get_obj_stock() and put_obj_stock() code in slab.h and allowed them to be used directly in various places, but hiding in one function is easier. Anyway, I can change the patch if you think that is the right way. Cheers, Longman
Re: [PATCH] locking/qrwlock: Fix ordering in queued_write_lock_slowpath
On 4/15/21 10:25 AM, Ali Saidi wrote: While this code is executed with the wait_lock held, a reader can acquire the lock without holding wait_lock. The writer side loops checking the value with the atomic_cond_read_acquire(), but only truly acquires the lock when the compare-and-exchange is completed successfully which isn’t ordered. The other atomic operations from this point are release-ordered and thus reads after the lock acquisition can be completed before the lock is truly acquired which violates the guarantees the lock should be making. Fixes: b519b56e378ee ("locking/qrwlock: Use atomic_cond_read_acquire() when spinning in qrwloc") Signed-off-by: Ali Saidi Cc: sta...@vger.kernel.org --- kernel/locking/qrwlock.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c index 4786dd271b45..10770f6ac4d9 100644 --- a/kernel/locking/qrwlock.c +++ b/kernel/locking/qrwlock.c @@ -73,8 +73,8 @@ void queued_write_lock_slowpath(struct qrwlock *lock) /* When no more readers or writers, set the locked flag */ do { - atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING); - } while (atomic_cmpxchg_relaxed(>cnts, _QW_WAITING, + atomic_cond_read_relaxed(>cnts, VAL == _QW_WAITING); + } while (atomic_cmpxchg_acquire(>cnts, _QW_WAITING, _QW_LOCKED) != _QW_WAITING); unlock: arch_spin_unlock(>wait_lock); I think the original code isn't wrong. The read_acquire provides the acquire barrier for cmpxchg. Because of conditional dependency, the wait_lock unlock won't happen until the cmpxchg succeeds. Without doing a read_acquire, there may be a higher likelihood that the cmpxchg may fail. Anyway, I will let Will or Peter chime in on this as I am not as proficient as them on this topic. Cheers, Longman
Re: [PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
On 4/14/21 11:26 PM, Masayoshi Mizuma wrote: Hi Longman, Thank you for your patches. I rerun the benchmark with your patches, it seems that the reduction is small... The total duration of sendto() and recvfrom() system call during the benchmark are as follows. - sendto - v5.8 vanilla: 2576.056 msec (100%) - v5.12-rc7 vanilla: 2988.911 msec (116%) - v5.12-rc7 with your patches (1-5): 2984.307 msec (115%) - recvfrom - v5.8 vanilla: 2113.156 msec (100%) - v5.12-rc7 vanilla: 2305.810 msec (109%) - v5.12-rc7 with your patches (1-5): 2287.351 msec (108%) kmem_cache_alloc()/kmem_cache_free() are called around 1,400,000 times during the benchmark. I ran a loop in a kernel module as following. The duration is reduced by your patches actually. --- dummy_cache = KMEM_CACHE(dummy, SLAB_ACCOUNT); for (i = 0; i < 140; i++) { p = kmem_cache_alloc(dummy_cache, GFP_KERNEL); kmem_cache_free(dummy_cache, p); } --- - v5.12-rc7 vanilla: 110 msec (100%) - v5.12-rc7 with your patches (1-5): 85 msec (77%) It seems that the reduction is small for the benchmark though... Anyway, I can see your patches reduce the overhead. Please feel free to add: Tested-by: Masayoshi Mizuma Thanks! Masa Thanks for the testing. I was focusing on your kernel module benchmark in testing my patch. I will try out your pgbench benchmark to see if there can be other tuning that can be done. BTW, how many numa nodes does your test machine? I did my testing with a 2-socket system. The vmstat caching part may be less effective on systems with more numa nodes. I will try to find a larger 4-socket systems for testing. Cheers, Longman
Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA
On 4/14/21 1:26 PM, Andi Kleen wrote: The CNA code, if enabled, will be in vmlinux, not in a kernel module. As a result, I think a module parameter will be no different from a kernel command line parameter in this regard. You can still change it in /sys at runtime, even if it's in the vmlinux. I see, thank for the clarification. Cheers, Longman
Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA
On 4/13/21 5:01 PM, Alex Kogan wrote: Hi, Andi. Thanks for your comments! On Apr 13, 2021, at 2:03 AM, Andi Kleen wrote: Alex Kogan writes: + numa_spinlock_threshold=[NUMA, PV_OPS] + Set the time threshold in milliseconds for the + number of intra-node lock hand-offs before the + NUMA-aware spinlock is forced to be passed to + a thread on another NUMA node. Valid values + are in the [1..100] range. Smaller values result + in a more fair, but less performant spinlock, + and vice versa. The default value is 10. ms granularity seems very coarse grained for this. Surely at some point of spinning you can afford a ktime_get? But ok. We are reading time when we are at the head of the (main) queue, but don’t have the lock yet. Not sure about the latency of ktime_get(), but anything reasonably fast but not necessarily precise should work. Could you turn that into a moduleparm which can be changed at runtime? Would be strange to have to reboot just to play with this parameter Yes, good suggestion, thanks. This would also make the code a lot shorter I guess. So you don’t think we need the command-line parameter, just the module_param? The CNA code, if enabled, will be in vmlinux, not in a kernel module. As a result, I think a module parameter will be no different from a kernel command line parameter in this regard. Cheers, Longman
[PATCH v3 4/5] mm/memcg: Separate out object stock data into its own struct
The object stock data stored in struct memcg_stock_pcp are independent of the other page based data stored there. Separating them out into their own struct to highlight the independency. Signed-off-by: Waiman Long Acked-by: Roman Gushchin Reviewed-by: Shakeel Butt --- mm/memcontrol.c | 41 ++--- 1 file changed, 26 insertions(+), 15 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 539c3b632e47..69f728383efe 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2214,17 +2214,22 @@ void unlock_page_memcg(struct page *page) } EXPORT_SYMBOL(unlock_page_memcg); -struct memcg_stock_pcp { - struct mem_cgroup *cached; /* this never be root cgroup */ - unsigned int nr_pages; - +struct obj_stock { #ifdef CONFIG_MEMCG_KMEM struct obj_cgroup *cached_objcg; struct pglist_data *cached_pgdat; unsigned int nr_bytes; int vmstat_idx; int vmstat_bytes; +#else + int dummy[0]; #endif +}; + +struct memcg_stock_pcp { + struct mem_cgroup *cached; /* this never be root cgroup */ + unsigned int nr_pages; + struct obj_stock obj; struct work_struct work; unsigned long flags; @@ -2234,12 +2239,12 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock); static DEFINE_MUTEX(percpu_charge_mutex); #ifdef CONFIG_MEMCG_KMEM -static void drain_obj_stock(struct memcg_stock_pcp *stock); +static void drain_obj_stock(struct obj_stock *stock); static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, struct mem_cgroup *root_memcg); #else -static inline void drain_obj_stock(struct memcg_stock_pcp *stock) +static inline void drain_obj_stock(struct obj_stock *stock) { } static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, @@ -2249,6 +2254,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, } #endif +static inline struct obj_stock *current_obj_stock(void) +{ + struct memcg_stock_pcp *stock = this_cpu_ptr(_stock); + + return >obj; +} + /** * consume_stock: Try to consume stocked charge on this cpu. * @memcg: memcg to consume from. @@ -2315,7 +2327,7 @@ static void drain_local_stock(struct work_struct *dummy) local_irq_save(flags); stock = this_cpu_ptr(_stock); - drain_obj_stock(stock); + drain_obj_stock(>obj); drain_stock(stock); clear_bit(FLUSHING_CACHED_CHARGE, >flags); @@ -3177,13 +3189,13 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg, static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) { - struct memcg_stock_pcp *stock; + struct obj_stock *stock; unsigned long flags; bool ret = false; local_irq_save(flags); - stock = this_cpu_ptr(_stock); + stock = current_obj_stock(); if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) { stock->nr_bytes -= nr_bytes; ret = true; @@ -3194,7 +3206,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) return ret; } -static void drain_obj_stock(struct memcg_stock_pcp *stock) +static void drain_obj_stock(struct obj_stock *stock) { struct obj_cgroup *old = stock->cached_objcg; @@ -3242,8 +3254,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, { struct mem_cgroup *memcg; - if (stock->cached_objcg) { - memcg = obj_cgroup_memcg(stock->cached_objcg); + if (stock->obj.cached_objcg) { + memcg = obj_cgroup_memcg(stock->obj.cached_objcg); if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) return true; } @@ -3253,9 +3265,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) { - struct memcg_stock_pcp *stock; + struct obj_stock *stock = current_obj_stock(); - stock = this_cpu_ptr(_stock); if (stock->cached_objcg != objcg) { /* reset if necessary */ drain_obj_stock(stock); obj_cgroup_get(objcg); @@ -3280,7 +3291,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) static void __mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, int idx, int nr) { - struct memcg_stock_pcp *stock = this_cpu_ptr(_stock); + struct obj_stock *stock = current_obj_stock(); if (stock->cached_objcg != objcg) { /* Output the current data as is */ -- 2.18.1
[PATCH v3 5/5] mm/memcg: Optimize user context object stock access
Most kmem_cache_alloc() calls are from user context. With instrumentation enabled, the measured amount of kmem_cache_alloc() calls from non-task context was about 0.01% of the total. The irq disable/enable sequence used in this case to access content from object stock is slow. To optimize for user context access, there are now two object stocks for task context and interrupt context access respectively. The task context object stock can be accessed after disabling preemption which is cheap in non-preempt kernel. The interrupt context object stock can only be accessed after disabling interrupt. User context code can access interrupt object stock, but not vice versa. The mod_objcg_state() function is also modified to make sure that memcg and lruvec stat updates are done with interrupted disabled. The downside of this change is that there are more data stored in local object stocks and not reflected in the charge counter and the vmstat arrays. However, this is a small price to pay for better performance. Signed-off-by: Waiman Long Acked-by: Roman Gushchin Reviewed-by: Shakeel Butt --- mm/memcontrol.c | 74 +++-- 1 file changed, 59 insertions(+), 15 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 69f728383efe..8875e896e52b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2229,7 +2229,8 @@ struct obj_stock { struct memcg_stock_pcp { struct mem_cgroup *cached; /* this never be root cgroup */ unsigned int nr_pages; - struct obj_stock obj; + struct obj_stock task_obj; + struct obj_stock irq_obj; struct work_struct work; unsigned long flags; @@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, } #endif +/* + * Most kmem_cache_alloc() calls are from user context. The irq disable/enable + * sequence used in this case to access content from object stock is slow. + * To optimize for user context access, there are now two object stocks for + * task context and interrupt context access respectively. + * + * The task context object stock can be accessed by disabling preemption only + * which is cheap in non-preempt kernel. The interrupt context object stock + * can only be accessed after disabling interrupt. User context code can + * access interrupt object stock, but not vice versa. + */ static inline struct obj_stock *current_obj_stock(void) { struct memcg_stock_pcp *stock = this_cpu_ptr(_stock); - return >obj; + return in_task() ? >task_obj : >irq_obj; +} + +#define get_obj_stock(flags) \ +({ \ + struct memcg_stock_pcp *stock; \ + struct obj_stock *obj_stock;\ + \ + if (in_task()) {\ + preempt_disable(); \ + (flags) = -1L; \ + stock = this_cpu_ptr(_stock); \ + obj_stock = >task_obj; \ + } else {\ + local_irq_save(flags); \ + stock = this_cpu_ptr(_stock); \ + obj_stock = >irq_obj;\ + } \ + obj_stock; \ +}) + +static inline void put_obj_stock(unsigned long flags) +{ + if (flags == -1L) + preempt_enable(); + else + local_irq_restore(flags); } /** @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy) local_irq_save(flags); stock = this_cpu_ptr(_stock); - drain_obj_stock(>obj); + drain_obj_stock(>irq_obj); + if (in_task()) + drain_obj_stock(>task_obj); drain_stock(stock); clear_bit(FLUSHING_CACHED_CHARGE, >flags); @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg, memcg = obj_cgroup_memcg(objcg); if (pgdat) lruvec = mem_cgroup_lruvec(memcg, pgdat); - __mod_memcg_lruvec_state(memcg, lruvec, idx, nr); + mod_memcg_lruvec_state(memcg, lruvec, idx, nr); rcu_read_unlock(); } @@ -3193,15 +3233,14 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) unsigned long flags; bool ret = false; - local_irq_save(flags); + stock = get_obj_stock(flags); - stock = current_obj_stock(); if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) { stock->nr_bytes -= nr_bytes; ret = true; } - local_irq_restore(flags); + put_obj_stock(flags); return ret; } @@ -3254,8 +3293,13 @@ static bool o
[PATCH v3 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily available. So both of them are now passed to mod_memcg_lruvec_state() and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is updated to allow either of the two parameters to be set to null. This makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec is null. The new __mod_memcg_lruvec_state() function will be used in the next patch as a replacement of mod_memcg_state() in mm/percpu.c for the consolidation of the memory uncharge and vmstat update functions in the kmem_cache_free() path. Signed-off-by: Waiman Long Acked-by: Roman Gushchin Reviewed-by: Shakeel Butt --- include/linux/memcontrol.h | 12 +++- mm/memcontrol.c| 19 +-- mm/slab.h | 2 +- 3 files changed, 21 insertions(+), 12 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0c04d39a7967..95f12996e66c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -955,8 +955,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, return x; } -void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, - int val); +void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec, + enum node_stat_item idx, int val); void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val); static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, @@ -969,13 +969,14 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, local_irq_restore(flags); } -static inline void mod_memcg_lruvec_state(struct lruvec *lruvec, +static inline void mod_memcg_lruvec_state(struct mem_cgroup *memcg, + struct lruvec *lruvec, enum node_stat_item idx, int val) { unsigned long flags; local_irq_save(flags); - __mod_memcg_lruvec_state(lruvec, idx, val); + __mod_memcg_lruvec_state(memcg, lruvec, idx, val); local_irq_restore(flags); } @@ -1369,7 +1370,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, return node_page_state(lruvec_pgdat(lruvec), idx); } -static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec, +static inline void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, + struct lruvec *lruvec, enum node_stat_item idx, int val) { } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e064ac0d850a..d66e1e38f8ac 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -799,20 +799,27 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid) return mem_cgroup_nodeinfo(parent, nid); } -void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, - int val) +/* + * Either one of memcg or lruvec can be NULL, but not both. + */ +void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec, + enum node_stat_item idx, int val) { struct mem_cgroup_per_node *pn; - struct mem_cgroup *memcg; long x, threshold = MEMCG_CHARGE_BATCH; + /* Update lruvec */ pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); - memcg = pn->memcg; + + if (!memcg) + memcg = pn->memcg; /* Update memcg */ __mod_memcg_state(memcg, idx, val); - /* Update lruvec */ + if (!lruvec) + return; + __this_cpu_add(pn->lruvec_stat_local->count[idx], val); if (vmstat_item_in_bytes(idx)) @@ -848,7 +855,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, /* Update memcg and lruvec */ if (!mem_cgroup_disabled()) - __mod_memcg_lruvec_state(lruvec, idx, val); + __mod_memcg_lruvec_state(NULL, lruvec, idx, val); } void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx, diff --git a/mm/slab.h b/mm/slab.h index 076582f58f68..bc6c7545e487 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -293,7 +293,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg, rcu_read_lock(); memcg = obj_cgroup_memcg(objcg); lruvec = mem_cgroup_lruvec(memcg, pgdat); - mod_memcg_lruvec_state(lruvec, idx, nr); + mod_memcg_lruvec_state(memcg, lruvec, idx, nr); rcu_read_unlock(); } -- 2.18.1
[PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
Before the new slab memory controller with per object byte charging, charging and vmstat data update happen only when new slab pages are allocated or freed. Now they are done with every kmem_cache_alloc() and kmem_cache_free(). This causes additional overhead for workloads that generate a lot of alloc and free calls. The memcg_stock_pcp is used to cache byte charge for a specific obj_cgroup to reduce that overhead. To further reducing it, this patch makes the vmstat data cached in the memcg_stock_pcp structure as well until it accumulates a page size worth of update or when other cached data change. On a 2-socket Cascade Lake server with instrumentation enabled and this patch applied, it was found that about 17% (946796 out of 5515184) of the time when __mod_obj_stock_state() is called leads to an actual call to mod_objcg_state() after initial boot. When doing parallel kernel build, the figure was about 16% (21894614 out of 139780628). So caching the vmstat data reduces the number of calls to mod_objcg_state() by more than 80%. Signed-off-by: Waiman Long Reviewed-by: Shakeel Butt --- mm/memcontrol.c | 78 +++-- mm/slab.h | 26 +++-- 2 files changed, 79 insertions(+), 25 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b19100c68aa0..539c3b632e47 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2220,7 +2220,10 @@ struct memcg_stock_pcp { #ifdef CONFIG_MEMCG_KMEM struct obj_cgroup *cached_objcg; + struct pglist_data *cached_pgdat; unsigned int nr_bytes; + int vmstat_idx; + int vmstat_bytes; #endif struct work_struct work; @@ -3157,6 +3160,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) css_put(>css); } +static inline void mod_objcg_state(struct obj_cgroup *objcg, + struct pglist_data *pgdat, + enum node_stat_item idx, int nr) +{ + struct mem_cgroup *memcg; + struct lruvec *lruvec = NULL; + + rcu_read_lock(); + memcg = obj_cgroup_memcg(objcg); + if (pgdat) + lruvec = mem_cgroup_lruvec(memcg, pgdat); + __mod_memcg_lruvec_state(memcg, lruvec, idx, nr); + rcu_read_unlock(); +} + static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) { struct memcg_stock_pcp *stock; @@ -3207,6 +3225,14 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock) stock->nr_bytes = 0; } + if (stock->vmstat_bytes) { + mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx, + stock->vmstat_bytes); + stock->vmstat_bytes = 0; + stock->vmstat_idx = 0; + stock->cached_pgdat = NULL; + } + obj_cgroup_put(old); stock->cached_objcg = NULL; } @@ -3251,6 +3277,48 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) local_irq_restore(flags); } +static void __mod_obj_stock_state(struct obj_cgroup *objcg, + struct pglist_data *pgdat, int idx, int nr) +{ + struct memcg_stock_pcp *stock = this_cpu_ptr(_stock); + + if (stock->cached_objcg != objcg) { + /* Output the current data as is */ + } else if (!stock->vmstat_bytes) { + /* Save the current data */ + stock->vmstat_bytes = nr; + stock->vmstat_idx = idx; + stock->cached_pgdat = pgdat; + nr = 0; + } else if ((stock->cached_pgdat != pgdat) || + (stock->vmstat_idx != idx)) { + /* Output the cached data & save the current data */ + swap(nr, stock->vmstat_bytes); + swap(idx, stock->vmstat_idx); + swap(pgdat, stock->cached_pgdat); + } else { + stock->vmstat_bytes += nr; + if (abs(nr) > PAGE_SIZE) { + nr = stock->vmstat_bytes; + stock->vmstat_bytes = 0; + } else { + nr = 0; + } + } + if (nr) + mod_objcg_state(objcg, pgdat, idx, nr); +} + +void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, +int idx, int nr) +{ + unsigned long flags; + + local_irq_save(flags); + __mod_obj_stock_state(objcg, pgdat, idx, nr); + local_irq_restore(flags); +} + int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) { struct mem_cgroup *memcg; @@ -3300,18 +3368,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size, struct pglist_data *pgdat, int idx) { unsigned long flags; - struct mem_cgroup
[PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge() is followed by mod_objcg_state()/mod_memcg_state(). Each of these function call goes through a separate irq_save/irq_restore cycle. That is inefficient. Introduce a new function obj_cgroup_uncharge_mod_state() that combines them with a single irq_save/irq_restore cycle. Signed-off-by: Waiman Long Reviewed-by: Shakeel Butt Acked-by: Roman Gushchin --- include/linux/memcontrol.h | 2 ++ mm/memcontrol.c| 31 +++ mm/percpu.c| 9 ++--- mm/slab.h | 6 +++--- 4 files changed, 34 insertions(+), 14 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 95f12996e66c..6890f999c1a3 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1592,6 +1592,8 @@ struct obj_cgroup *get_obj_cgroup_from_current(void); int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size); void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size); +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size, + struct pglist_data *pgdat, int idx); extern struct static_key_false memcg_kmem_enabled_key; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d66e1e38f8ac..b19100c68aa0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3225,12 +3225,9 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, return false; } -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) +static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) { struct memcg_stock_pcp *stock; - unsigned long flags; - - local_irq_save(flags); stock = this_cpu_ptr(_stock); if (stock->cached_objcg != objcg) { /* reset if necessary */ @@ -3243,7 +3240,14 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) if (stock->nr_bytes > PAGE_SIZE) drain_obj_stock(stock); +} + +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) +{ + unsigned long flags; + local_irq_save(flags); + __refill_obj_stock(objcg, nr_bytes); local_irq_restore(flags); } @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size) refill_obj_stock(objcg, size); } +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size, + struct pglist_data *pgdat, int idx) +{ + unsigned long flags; + struct mem_cgroup *memcg; + struct lruvec *lruvec = NULL; + + local_irq_save(flags); + __refill_obj_stock(objcg, size); + + rcu_read_lock(); + memcg = obj_cgroup_memcg(objcg); + if (pgdat) + lruvec = mem_cgroup_lruvec(memcg, pgdat); + __mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size); + rcu_read_unlock(); + local_irq_restore(flags); +} + #endif /* CONFIG_MEMCG_KMEM */ /* diff --git a/mm/percpu.c b/mm/percpu.c index 23308113a5ff..fd7aad6d7f90 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1631,13 +1631,8 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size) objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT]; chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL; - obj_cgroup_uncharge(objcg, size * num_possible_cpus()); - - rcu_read_lock(); - mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B, - -(size * num_possible_cpus())); - rcu_read_unlock(); - + obj_cgroup_uncharge_mod_state(objcg, size * num_possible_cpus(), + NULL, MEMCG_PERCPU_B); obj_cgroup_put(objcg); } diff --git a/mm/slab.h b/mm/slab.h index bc6c7545e487..677cdc52e641 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -366,9 +366,9 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s_orig, continue; objcgs[off] = NULL; - obj_cgroup_uncharge(objcg, obj_full_size(s)); - mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s), - -obj_full_size(s)); + obj_cgroup_uncharge_mod_state(objcg, obj_full_size(s), + page_pgdat(page), + cache_vmstat_idx(s)); obj_cgroup_put(objcg); } } -- 2.18.1
[PATCH v3 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
v3: - Add missing "inline" qualifier to the alternate mod_obj_stock_state() in patch 3. - Remove redundant current_obj_stock() call in patch 5. v2: - Fix bug found by test robot in patch 5. - Update cover letter and commit logs. With the recent introduction of the new slab memory controller, we eliminate the need for having separate kmemcaches for each memory cgroup and reduce overall kernel memory usage. However, we also add additional memory accounting overhead to each call of kmem_cache_alloc() and kmem_cache_free(). For workloads that require a lot of kmemcache allocations and de-allocations, they may experience performance regression as illustrated in [1] and [2]. A simple kernel module that performs repeated loop of 100,000,000 kmem_cache_alloc() and kmem_cache_free() of a 64-byte object at module init time is used for benchmarking. The test was run on a CascadeLake server with turbo-boosting disable to reduce run-to-run variation. With memory accounting disable, the run time was 2.848s. With memory accounting enabled, the run times with the application of various patches in the patchset were: Applied patches Run time Accounting overhead Overhead %age --- --- - None 10.800s 7.952s 100.0% 1-2 9.140s 6.292s 79.1% 1-3 7.641s 4.793s 60.3% 1-5 6.801s 3.953s 49.7% Note that this is the best case scenario where most updates happen only to the percpu stocks. Real workloads will likely have a certain amount of updates to the memcg charges and vmstats. So the performance benefit will be less. It was found that a big part of the memory accounting overhead was caused by the local_irq_save()/local_irq_restore() sequences in updating local stock charge bytes and vmstat array, at least in x86 systems. There are two such sequences in kmem_cache_alloc() and two in kmem_cache_free(). This patchset tries to reduce the use of such sequences as much as possible. In fact, it eliminates them in the common case. Another part of this patchset to cache the vmstat data update in the local stock as well which also helps. [1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u [2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/ Waiman Long (5): mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state() mm/memcg: Introduce obj_cgroup_uncharge_mod_state() mm/memcg: Cache vmstat data in percpu memcg_stock_pcp mm/memcg: Separate out object stock data into its own struct mm/memcg: Optimize user context object stock access include/linux/memcontrol.h | 14 ++- mm/memcontrol.c| 199 - mm/percpu.c| 9 +- mm/slab.h | 32 +++--- 4 files changed, 196 insertions(+), 58 deletions(-) -- 2.18.1
Re: [PATCH v2 5/5] mm/memcg: Optimize user context object stock access
On 4/12/21 7:10 PM, Shakeel Butt wrote: On Mon, Apr 12, 2021 at 3:55 PM Waiman Long wrote: Most kmem_cache_alloc() calls are from user context. With instrumentation enabled, the measured amount of kmem_cache_alloc() calls from non-task context was about 0.01% of the total. The irq disable/enable sequence used in this case to access content from object stock is slow. To optimize for user context access, there are now two object stocks for task context and interrupt context access respectively. The task context object stock can be accessed after disabling preemption which is cheap in non-preempt kernel. The interrupt context object stock can only be accessed after disabling interrupt. User context code can access interrupt object stock, but not vice versa. The mod_objcg_state() function is also modified to make sure that memcg and lruvec stat updates are done with interrupted disabled. The downside of this change is that there are more data stored in local object stocks and not reflected in the charge counter and the vmstat arrays. However, this is a small price to pay for better performance. Signed-off-by: Waiman Long Acked-by: Roman Gushchin --- mm/memcontrol.c | 73 +++-- 1 file changed, 59 insertions(+), 14 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 69f728383efe..29f2df76644a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2229,7 +2229,8 @@ struct obj_stock { struct memcg_stock_pcp { struct mem_cgroup *cached; /* this never be root cgroup */ unsigned int nr_pages; - struct obj_stock obj; + struct obj_stock task_obj; + struct obj_stock irq_obj; struct work_struct work; unsigned long flags; @@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, } #endif +/* + * Most kmem_cache_alloc() calls are from user context. The irq disable/enable + * sequence used in this case to access content from object stock is slow. + * To optimize for user context access, there are now two object stocks for + * task context and interrupt context access respectively. + * + * The task context object stock can be accessed by disabling preemption only + * which is cheap in non-preempt kernel. The interrupt context object stock + * can only be accessed after disabling interrupt. User context code can + * access interrupt object stock, but not vice versa. + */ static inline struct obj_stock *current_obj_stock(void) { struct memcg_stock_pcp *stock = this_cpu_ptr(_stock); - return >obj; + return in_task() ? >task_obj : >irq_obj; +} + +#define get_obj_stock(flags) \ +({ \ + struct memcg_stock_pcp *stock; \ + struct obj_stock *obj_stock;\ + \ + if (in_task()) {\ + preempt_disable(); \ + (flags) = -1L; \ + stock = this_cpu_ptr(_stock); \ The above line was missing in the previous version. + obj_stock = >task_obj; \ + } else {\ + local_irq_save(flags); \ + stock = this_cpu_ptr(_stock); \ + obj_stock = >irq_obj;\ + } \ + obj_stock; \ +}) + +static inline void put_obj_stock(unsigned long flags) +{ + if (flags == -1L) + preempt_enable(); + else + local_irq_restore(flags); } /** @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy) local_irq_save(flags); stock = this_cpu_ptr(_stock); - drain_obj_stock(>obj); + drain_obj_stock(>irq_obj); + if (in_task()) + drain_obj_stock(>task_obj); drain_stock(stock); clear_bit(FLUSHING_CACHED_CHARGE, >flags); @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg, memcg = obj_cgroup_memcg(objcg); if (pgdat) lruvec = mem_cgroup_lruvec(memcg, pgdat); - __mod_memcg_lruvec_state(memcg, lruvec, idx, nr); + mod_memcg_lruvec_state(memcg, lruvec, idx, nr); rcu_read_unlock(); } @@ -3193,7 +3233,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) unsigned long flags; bool ret = false; - local_irq_save(flags); + stock = get_obj_stock(flags); stock = current_obj_stock(); The above is redundant. Right. I should check the patch carefully. Will remove it. Thanks, Longman
[PATCH v2 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
v2: - Fix bug found by test robot in patch 5. - Update cover letter and commit logs. With the recent introduction of the new slab memory controller, we eliminate the need for having separate kmemcaches for each memory cgroup and reduce overall kernel memory usage. However, we also add additional memory accounting overhead to each call of kmem_cache_alloc() and kmem_cache_free(). For workloads that require a lot of kmemcache allocations and de-allocations, they may experience performance regression as illustrated in [1] and [2]. A simple kernel module that performs repeated loop of 100,000,000 kmem_cache_alloc() and kmem_cache_free() of a 64-byte object at module init time is used for benchmarking. The test was run on a CascadeLake server with turbo-boosting disable to reduce run-to-run variation. With memory accounting disable, the run time was 2.848s. With memory accounting enabled, the run times with the application of various patches in the patchset were: Applied patches Run time Accounting overhead Overhead %age --- --- - None 10.800s 7.952s 100.0% 1-2 9.140s 6.292s 79.1% 1-3 7.641s 4.793s 60.3% 1-5 6.801s 3.953s 49.7% Note that this is the best case scenario where most updates happen only to the percpu stocks. Real workloads will likely have a certain amount of updates to the memcg charges and vmstats. So the performance benefit will be less. It was found that a big part of the memory accounting overhead was caused by the local_irq_save()/local_irq_restore() sequences in updating local stock charge bytes and vmstat array, at least in x86 systems. There are two such sequences in kmem_cache_alloc() and two in kmem_cache_free(). This patchset tries to reduce the use of such sequences as much as possible. In fact, it eliminates them in the common case. Another part of this patchset to cache the vmstat data update in the local stock as well which also helps. [1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u [2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/ Waiman Long (5): mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state() mm/memcg: Introduce obj_cgroup_uncharge_mod_state() mm/memcg: Cache vmstat data in percpu memcg_stock_pcp mm/memcg: Separate out object stock data into its own struct mm/memcg: Optimize user context object stock access include/linux/memcontrol.h | 14 ++- mm/memcontrol.c| 200 - mm/percpu.c| 9 +- mm/slab.h | 32 +++--- 4 files changed, 197 insertions(+), 58 deletions(-) -- 2.18.1
[PATCH v2 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily available. So both of them are now passed to mod_memcg_lruvec_state() and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is updated to allow either of the two parameters to be set to null. This makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec is null. The new __mod_memcg_lruvec_state() function will be used in the next patch as a replacement of mod_memcg_state() in mm/percpu.c for the consolidation of the memory uncharge and vmstat update functions in the kmem_cache_free() path. Signed-off-by: Waiman Long Acked-by: Roman Gushchin Reviewed-by: Shakeel Butt --- include/linux/memcontrol.h | 12 +++- mm/memcontrol.c| 19 +-- mm/slab.h | 2 +- 3 files changed, 21 insertions(+), 12 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0c04d39a7967..95f12996e66c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -955,8 +955,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, return x; } -void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, - int val); +void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec, + enum node_stat_item idx, int val); void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val); static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, @@ -969,13 +969,14 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, local_irq_restore(flags); } -static inline void mod_memcg_lruvec_state(struct lruvec *lruvec, +static inline void mod_memcg_lruvec_state(struct mem_cgroup *memcg, + struct lruvec *lruvec, enum node_stat_item idx, int val) { unsigned long flags; local_irq_save(flags); - __mod_memcg_lruvec_state(lruvec, idx, val); + __mod_memcg_lruvec_state(memcg, lruvec, idx, val); local_irq_restore(flags); } @@ -1369,7 +1370,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, return node_page_state(lruvec_pgdat(lruvec), idx); } -static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec, +static inline void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, + struct lruvec *lruvec, enum node_stat_item idx, int val) { } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e064ac0d850a..d66e1e38f8ac 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -799,20 +799,27 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid) return mem_cgroup_nodeinfo(parent, nid); } -void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, - int val) +/* + * Either one of memcg or lruvec can be NULL, but not both. + */ +void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec, + enum node_stat_item idx, int val) { struct mem_cgroup_per_node *pn; - struct mem_cgroup *memcg; long x, threshold = MEMCG_CHARGE_BATCH; + /* Update lruvec */ pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); - memcg = pn->memcg; + + if (!memcg) + memcg = pn->memcg; /* Update memcg */ __mod_memcg_state(memcg, idx, val); - /* Update lruvec */ + if (!lruvec) + return; + __this_cpu_add(pn->lruvec_stat_local->count[idx], val); if (vmstat_item_in_bytes(idx)) @@ -848,7 +855,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, /* Update memcg and lruvec */ if (!mem_cgroup_disabled()) - __mod_memcg_lruvec_state(lruvec, idx, val); + __mod_memcg_lruvec_state(NULL, lruvec, idx, val); } void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx, diff --git a/mm/slab.h b/mm/slab.h index 076582f58f68..bc6c7545e487 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -293,7 +293,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg, rcu_read_lock(); memcg = obj_cgroup_memcg(objcg); lruvec = mem_cgroup_lruvec(memcg, pgdat); - mod_memcg_lruvec_state(lruvec, idx, nr); + mod_memcg_lruvec_state(memcg, lruvec, idx, nr); rcu_read_unlock(); } -- 2.18.1
[PATCH v2 4/5] mm/memcg: Separate out object stock data into its own struct
The object stock data stored in struct memcg_stock_pcp are independent of the other page based data stored there. Separating them out into their own struct to highlight the independency. Signed-off-by: Waiman Long Acked-by: Roman Gushchin --- mm/memcontrol.c | 41 ++--- 1 file changed, 26 insertions(+), 15 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 539c3b632e47..69f728383efe 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2214,17 +2214,22 @@ void unlock_page_memcg(struct page *page) } EXPORT_SYMBOL(unlock_page_memcg); -struct memcg_stock_pcp { - struct mem_cgroup *cached; /* this never be root cgroup */ - unsigned int nr_pages; - +struct obj_stock { #ifdef CONFIG_MEMCG_KMEM struct obj_cgroup *cached_objcg; struct pglist_data *cached_pgdat; unsigned int nr_bytes; int vmstat_idx; int vmstat_bytes; +#else + int dummy[0]; #endif +}; + +struct memcg_stock_pcp { + struct mem_cgroup *cached; /* this never be root cgroup */ + unsigned int nr_pages; + struct obj_stock obj; struct work_struct work; unsigned long flags; @@ -2234,12 +2239,12 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock); static DEFINE_MUTEX(percpu_charge_mutex); #ifdef CONFIG_MEMCG_KMEM -static void drain_obj_stock(struct memcg_stock_pcp *stock); +static void drain_obj_stock(struct obj_stock *stock); static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, struct mem_cgroup *root_memcg); #else -static inline void drain_obj_stock(struct memcg_stock_pcp *stock) +static inline void drain_obj_stock(struct obj_stock *stock) { } static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, @@ -2249,6 +2254,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, } #endif +static inline struct obj_stock *current_obj_stock(void) +{ + struct memcg_stock_pcp *stock = this_cpu_ptr(_stock); + + return >obj; +} + /** * consume_stock: Try to consume stocked charge on this cpu. * @memcg: memcg to consume from. @@ -2315,7 +2327,7 @@ static void drain_local_stock(struct work_struct *dummy) local_irq_save(flags); stock = this_cpu_ptr(_stock); - drain_obj_stock(stock); + drain_obj_stock(>obj); drain_stock(stock); clear_bit(FLUSHING_CACHED_CHARGE, >flags); @@ -3177,13 +3189,13 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg, static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) { - struct memcg_stock_pcp *stock; + struct obj_stock *stock; unsigned long flags; bool ret = false; local_irq_save(flags); - stock = this_cpu_ptr(_stock); + stock = current_obj_stock(); if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) { stock->nr_bytes -= nr_bytes; ret = true; @@ -3194,7 +3206,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) return ret; } -static void drain_obj_stock(struct memcg_stock_pcp *stock) +static void drain_obj_stock(struct obj_stock *stock) { struct obj_cgroup *old = stock->cached_objcg; @@ -3242,8 +3254,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, { struct mem_cgroup *memcg; - if (stock->cached_objcg) { - memcg = obj_cgroup_memcg(stock->cached_objcg); + if (stock->obj.cached_objcg) { + memcg = obj_cgroup_memcg(stock->obj.cached_objcg); if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) return true; } @@ -3253,9 +3265,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) { - struct memcg_stock_pcp *stock; + struct obj_stock *stock = current_obj_stock(); - stock = this_cpu_ptr(_stock); if (stock->cached_objcg != objcg) { /* reset if necessary */ drain_obj_stock(stock); obj_cgroup_get(objcg); @@ -3280,7 +3291,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) static void __mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, int idx, int nr) { - struct memcg_stock_pcp *stock = this_cpu_ptr(_stock); + struct obj_stock *stock = current_obj_stock(); if (stock->cached_objcg != objcg) { /* Output the current data as is */ -- 2.18.1
[PATCH v2 5/5] mm/memcg: Optimize user context object stock access
Most kmem_cache_alloc() calls are from user context. With instrumentation enabled, the measured amount of kmem_cache_alloc() calls from non-task context was about 0.01% of the total. The irq disable/enable sequence used in this case to access content from object stock is slow. To optimize for user context access, there are now two object stocks for task context and interrupt context access respectively. The task context object stock can be accessed after disabling preemption which is cheap in non-preempt kernel. The interrupt context object stock can only be accessed after disabling interrupt. User context code can access interrupt object stock, but not vice versa. The mod_objcg_state() function is also modified to make sure that memcg and lruvec stat updates are done with interrupted disabled. The downside of this change is that there are more data stored in local object stocks and not reflected in the charge counter and the vmstat arrays. However, this is a small price to pay for better performance. Signed-off-by: Waiman Long Acked-by: Roman Gushchin --- mm/memcontrol.c | 73 +++-- 1 file changed, 59 insertions(+), 14 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 69f728383efe..29f2df76644a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2229,7 +2229,8 @@ struct obj_stock { struct memcg_stock_pcp { struct mem_cgroup *cached; /* this never be root cgroup */ unsigned int nr_pages; - struct obj_stock obj; + struct obj_stock task_obj; + struct obj_stock irq_obj; struct work_struct work; unsigned long flags; @@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, } #endif +/* + * Most kmem_cache_alloc() calls are from user context. The irq disable/enable + * sequence used in this case to access content from object stock is slow. + * To optimize for user context access, there are now two object stocks for + * task context and interrupt context access respectively. + * + * The task context object stock can be accessed by disabling preemption only + * which is cheap in non-preempt kernel. The interrupt context object stock + * can only be accessed after disabling interrupt. User context code can + * access interrupt object stock, but not vice versa. + */ static inline struct obj_stock *current_obj_stock(void) { struct memcg_stock_pcp *stock = this_cpu_ptr(_stock); - return >obj; + return in_task() ? >task_obj : >irq_obj; +} + +#define get_obj_stock(flags) \ +({ \ + struct memcg_stock_pcp *stock; \ + struct obj_stock *obj_stock;\ + \ + if (in_task()) {\ + preempt_disable(); \ + (flags) = -1L; \ + stock = this_cpu_ptr(_stock); \ + obj_stock = >task_obj; \ + } else {\ + local_irq_save(flags); \ + stock = this_cpu_ptr(_stock); \ + obj_stock = >irq_obj;\ + } \ + obj_stock; \ +}) + +static inline void put_obj_stock(unsigned long flags) +{ + if (flags == -1L) + preempt_enable(); + else + local_irq_restore(flags); } /** @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy) local_irq_save(flags); stock = this_cpu_ptr(_stock); - drain_obj_stock(>obj); + drain_obj_stock(>irq_obj); + if (in_task()) + drain_obj_stock(>task_obj); drain_stock(stock); clear_bit(FLUSHING_CACHED_CHARGE, >flags); @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg, memcg = obj_cgroup_memcg(objcg); if (pgdat) lruvec = mem_cgroup_lruvec(memcg, pgdat); - __mod_memcg_lruvec_state(memcg, lruvec, idx, nr); + mod_memcg_lruvec_state(memcg, lruvec, idx, nr); rcu_read_unlock(); } @@ -3193,7 +3233,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) unsigned long flags; bool ret = false; - local_irq_save(flags); + stock = get_obj_stock(flags); stock = current_obj_stock(); if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) { @@ -3201,7 +3241,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) ret = true; } - local_irq_restore(flags); + put_obj_stock(flags); return ret; } @@ -3254,8 +329
[PATCH v2 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge() is followed by mod_objcg_state()/mod_memcg_state(). Each of these function call goes through a separate irq_save/irq_restore cycle. That is inefficient. Introduce a new function obj_cgroup_uncharge_mod_state() that combines them with a single irq_save/irq_restore cycle. Signed-off-by: Waiman Long Reviewed-by: Shakeel Butt Acked-by: Roman Gushchin --- include/linux/memcontrol.h | 2 ++ mm/memcontrol.c| 31 +++ mm/percpu.c| 9 ++--- mm/slab.h | 6 +++--- 4 files changed, 34 insertions(+), 14 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 95f12996e66c..6890f999c1a3 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1592,6 +1592,8 @@ struct obj_cgroup *get_obj_cgroup_from_current(void); int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size); void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size); +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size, + struct pglist_data *pgdat, int idx); extern struct static_key_false memcg_kmem_enabled_key; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d66e1e38f8ac..b19100c68aa0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3225,12 +3225,9 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, return false; } -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) +static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) { struct memcg_stock_pcp *stock; - unsigned long flags; - - local_irq_save(flags); stock = this_cpu_ptr(_stock); if (stock->cached_objcg != objcg) { /* reset if necessary */ @@ -3243,7 +3240,14 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) if (stock->nr_bytes > PAGE_SIZE) drain_obj_stock(stock); +} + +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) +{ + unsigned long flags; + local_irq_save(flags); + __refill_obj_stock(objcg, nr_bytes); local_irq_restore(flags); } @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size) refill_obj_stock(objcg, size); } +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size, + struct pglist_data *pgdat, int idx) +{ + unsigned long flags; + struct mem_cgroup *memcg; + struct lruvec *lruvec = NULL; + + local_irq_save(flags); + __refill_obj_stock(objcg, size); + + rcu_read_lock(); + memcg = obj_cgroup_memcg(objcg); + if (pgdat) + lruvec = mem_cgroup_lruvec(memcg, pgdat); + __mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size); + rcu_read_unlock(); + local_irq_restore(flags); +} + #endif /* CONFIG_MEMCG_KMEM */ /* diff --git a/mm/percpu.c b/mm/percpu.c index 23308113a5ff..fd7aad6d7f90 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1631,13 +1631,8 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size) objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT]; chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL; - obj_cgroup_uncharge(objcg, size * num_possible_cpus()); - - rcu_read_lock(); - mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B, - -(size * num_possible_cpus())); - rcu_read_unlock(); - + obj_cgroup_uncharge_mod_state(objcg, size * num_possible_cpus(), + NULL, MEMCG_PERCPU_B); obj_cgroup_put(objcg); } diff --git a/mm/slab.h b/mm/slab.h index bc6c7545e487..677cdc52e641 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -366,9 +366,9 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s_orig, continue; objcgs[off] = NULL; - obj_cgroup_uncharge(objcg, obj_full_size(s)); - mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s), - -obj_full_size(s)); + obj_cgroup_uncharge_mod_state(objcg, obj_full_size(s), + page_pgdat(page), + cache_vmstat_idx(s)); obj_cgroup_put(objcg); } } -- 2.18.1
[PATCH v2 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
Before the new slab memory controller with per object byte charging, charging and vmstat data update happen only when new slab pages are allocated or freed. Now they are done with every kmem_cache_alloc() and kmem_cache_free(). This causes additional overhead for workloads that generate a lot of alloc and free calls. The memcg_stock_pcp is used to cache byte charge for a specific obj_cgroup to reduce that overhead. To further reducing it, this patch makes the vmstat data cached in the memcg_stock_pcp structure as well until it accumulates a page size worth of update or when other cached data change. On a 2-socket Cascade Lake server with instrumentation enabled and this patch applied, it was found that about 17% (946796 out of 5515184) of the time when __mod_obj_stock_state() is called leads to an actual call to mod_objcg_state() after initial boot. When doing parallel kernel build, the figure was about 16% (21894614 out of 139780628). So caching the vmstat data reduces the number of calls to mod_objcg_state() by more than 80%. Signed-off-by: Waiman Long --- mm/memcontrol.c | 78 +++-- mm/slab.h | 26 +++-- 2 files changed, 79 insertions(+), 25 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b19100c68aa0..539c3b632e47 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2220,7 +2220,10 @@ struct memcg_stock_pcp { #ifdef CONFIG_MEMCG_KMEM struct obj_cgroup *cached_objcg; + struct pglist_data *cached_pgdat; unsigned int nr_bytes; + int vmstat_idx; + int vmstat_bytes; #endif struct work_struct work; @@ -3157,6 +3160,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) css_put(>css); } +static inline void mod_objcg_state(struct obj_cgroup *objcg, + struct pglist_data *pgdat, + enum node_stat_item idx, int nr) +{ + struct mem_cgroup *memcg; + struct lruvec *lruvec = NULL; + + rcu_read_lock(); + memcg = obj_cgroup_memcg(objcg); + if (pgdat) + lruvec = mem_cgroup_lruvec(memcg, pgdat); + __mod_memcg_lruvec_state(memcg, lruvec, idx, nr); + rcu_read_unlock(); +} + static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) { struct memcg_stock_pcp *stock; @@ -3207,6 +3225,14 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock) stock->nr_bytes = 0; } + if (stock->vmstat_bytes) { + mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx, + stock->vmstat_bytes); + stock->vmstat_bytes = 0; + stock->vmstat_idx = 0; + stock->cached_pgdat = NULL; + } + obj_cgroup_put(old); stock->cached_objcg = NULL; } @@ -3251,6 +3277,48 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) local_irq_restore(flags); } +static void __mod_obj_stock_state(struct obj_cgroup *objcg, + struct pglist_data *pgdat, int idx, int nr) +{ + struct memcg_stock_pcp *stock = this_cpu_ptr(_stock); + + if (stock->cached_objcg != objcg) { + /* Output the current data as is */ + } else if (!stock->vmstat_bytes) { + /* Save the current data */ + stock->vmstat_bytes = nr; + stock->vmstat_idx = idx; + stock->cached_pgdat = pgdat; + nr = 0; + } else if ((stock->cached_pgdat != pgdat) || + (stock->vmstat_idx != idx)) { + /* Output the cached data & save the current data */ + swap(nr, stock->vmstat_bytes); + swap(idx, stock->vmstat_idx); + swap(pgdat, stock->cached_pgdat); + } else { + stock->vmstat_bytes += nr; + if (abs(nr) > PAGE_SIZE) { + nr = stock->vmstat_bytes; + stock->vmstat_bytes = 0; + } else { + nr = 0; + } + } + if (nr) + mod_objcg_state(objcg, pgdat, idx, nr); +} + +void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, +int idx, int nr) +{ + unsigned long flags; + + local_irq_save(flags); + __mod_obj_stock_state(objcg, pgdat, idx, nr); + local_irq_restore(flags); +} + int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) { struct mem_cgroup *memcg; @@ -3300,18 +3368,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size, struct pglist_data *pgdat, int idx) { unsigned long flags; - struct mem_cgroup *memcg; - struct lruvec *lruvec = N
Re: [PATCH 5/5] mm/memcg: Optimize user context object stock access
On 4/12/21 2:55 PM, Roman Gushchin wrote: On Fri, Apr 09, 2021 at 07:18:42PM -0400, Waiman Long wrote: Most kmem_cache_alloc() calls are from user context. With instrumentation enabled, the measured amount of kmem_cache_alloc() calls from non-task context was about 0.01% of the total. The irq disable/enable sequence used in this case to access content from object stock is slow. To optimize for user context access, there are now two object stocks for task context and interrupt context access respectively. The task context object stock can be accessed after disabling preemption which is cheap in non-preempt kernel. The interrupt context object stock can only be accessed after disabling interrupt. User context code can access interrupt object stock, but not vice versa. The mod_objcg_state() function is also modified to make sure that memcg and lruvec stat updates are done with interrupted disabled. The downside of this change is that there are more data stored in local object stocks and not reflected in the charge counter and the vmstat arrays. However, this is a small price to pay for better performance. I agree, the extra memory space is not a significant concern. I'd be more worried about the code complexity, but the result looks nice to me! Acked-by: Roman Gushchin Btw, it seems that the mm tree ran a bit off, so I had to apply this series on top of Linus's tree to review. Please, rebase. This patchset is based on the code in Linus' tree. I had applied the patchset to linux-next to see if there was any conflicts. Two of the patches had minor fuzzes around the edge but no actual merge conflict for now. Cheers, Longman
Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
On 4/12/21 3:05 PM, Roman Gushchin wrote: On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote: With the recent introduction of the new slab memory controller, we eliminate the need for having separate kmemcaches for each memory cgroup and reduce overall kernel memory usage. However, we also add additional memory accounting overhead to each call of kmem_cache_alloc() and kmem_cache_free(). For workloads that require a lot of kmemcache allocations and de-allocations, they may experience performance regression as illustrated in [1]. With a simple kernel module that performs repeated loop of 100,000,000 kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module init. The execution time to load the kernel module with and without memory accounting were: with accounting = 6.798s w/o accounting = 1.758s That is an increase of 5.04s (287%). With this patchset applied, the execution time became 4.254s. So the memory accounting overhead is now 2.496s which is a 50% reduction. Btw, there were two recent independent report about benchmark results regression caused by the introduction of the per-object accounting: 1) Xing reported a hackbench regression: https://lkml.org/lkml/2021/1/13/1277 2) Masayoshi reported a pgbench regression: https://www.spinics.net/lists/linux-mm/msg252540.html I wonder if you can run them (or at least one) and attach the result to the series? It would be very helpful. Actually, it was a bug reported filed by Masayoshi-san that triggered me to work on reducing the memory accounting overhead. He is also in the cc line and so is aware of that. I will cc Xing in my v2 patch. Cheers, Longman
Re: [PATCH 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
On 4/12/21 2:22 PM, Roman Gushchin wrote: On Fri, Apr 09, 2021 at 07:18:40PM -0400, Waiman Long wrote: Before the new slab memory controller with per object byte charging, charging and vmstat data update happen only when new slab pages are allocated or freed. Now they are done with every kmem_cache_alloc() and kmem_cache_free(). This causes additional overhead for workloads that generate a lot of alloc and free calls. The memcg_stock_pcp is used to cache byte charge for a specific obj_cgroup to reduce that overhead. To further reducing it, this patch makes the vmstat data cached in the memcg_stock_pcp structure as well until it accumulates a page size worth of update or when other cached data change. The idea makes total sense to me and also gives a hope to remove byte-sized vmstats in the long-term. On a 2-socket Cascade Lake server with instrumentation enabled and this patch applied, it was found that about 17% (946796 out of 5515184) of the time when __mod_obj_stock_state() is called leads to an actual call to mod_objcg_state() after initial boot. When doing parallel kernel build, the figure was about 16% (21894614 out of 139780628). So caching the vmstat data reduces the number of calls to mod_objcg_state() by more than 80%. Signed-off-by: Waiman Long --- mm/memcontrol.c | 78 +++-- mm/slab.h | 26 +++-- 2 files changed, 79 insertions(+), 25 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b19100c68aa0..539c3b632e47 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2220,7 +2220,10 @@ struct memcg_stock_pcp { #ifdef CONFIG_MEMCG_KMEM struct obj_cgroup *cached_objcg; + struct pglist_data *cached_pgdat; unsigned int nr_bytes; + int vmstat_idx; + int vmstat_bytes; #endif Because vmstat_idx can realistically take only 3 values (slab_reclaimable, slab_unreclaimable and percpu), I wonder if it's better to have vmstat_bytes[3] and save a bit more on the reduced number of flushes? It must be an often case when a complex (reclaimable) kernel object has non-reclaimable parts (e.g. kmallocs) or percpu counters. If the difference will be too small, maybe the current form is better. I have thought about that too. However, that will make the code more complex. So I decided to cache just one for now. We can certainly play around with caching more in a later patch. Cheers, Longman
Re: [PATCH 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
On 4/12/21 2:04 PM, Roman Gushchin wrote: On Fri, Apr 09, 2021 at 07:18:38PM -0400, Waiman Long wrote: The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily available. So both of them are now passed to mod_memcg_lruvec_state() and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is updated to allow either of the two parameters to be set to null. This makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec is null. This patch seems to be correct, but it's a bit hard to understand why it's required without looking into the rest of the series. Can you, please, add a couple of words about it? E.g. we need it to handle stats which do not exist on the lruvec level... Otherwise, Acked-by: Roman Gushchin Good point. Will update the commit log to indicate the change is needed for the subsequent patch. Cheers, Longman
Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
On 4/12/21 1:47 PM, Roman Gushchin wrote: On Mon, Apr 12, 2021 at 10:03:13AM -0400, Waiman Long wrote: On 4/9/21 9:51 PM, Roman Gushchin wrote: On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote: With the recent introduction of the new slab memory controller, we eliminate the need for having separate kmemcaches for each memory cgroup and reduce overall kernel memory usage. However, we also add additional memory accounting overhead to each call of kmem_cache_alloc() and kmem_cache_free(). For workloads that require a lot of kmemcache allocations and de-allocations, they may experience performance regression as illustrated in [1]. With a simple kernel module that performs repeated loop of 100,000,000 kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module init. The execution time to load the kernel module with and without memory accounting were: with accounting = 6.798s w/o accounting = 1.758s That is an increase of 5.04s (287%). With this patchset applied, the execution time became 4.254s. So the memory accounting overhead is now 2.496s which is a 50% reduction. Hi Waiman! Thank you for working on it, it's indeed very useful! A couple of questions: 1) did your config included lockdep or not? The test kernel is based on a production kernel config and so lockdep isn't enabled. 2) do you have a (rough) estimation how much each change contributes to the overall reduction? I should have a better breakdown of the effect of individual patches. I rerun the benchmarking module with turbo-boosting disabled to reduce run-to-run variation. The execution times were: Before patch: time = 10.800s (with memory accounting), 2.848s (w/o accounting), overhead = 7.952s After patch 2: time = 9.140s, overhead = 6.292s After patch 3: time = 7.641s, overhead = 4.793s After patch 5: time = 6.801s, overhead = 3.953s Thank you! If there will be v2, I'd include this information into commit logs. Yes, I am planning to send out v2 with these information in the cover-letter. I am just waiting a bit to see if there are more feedback. -Longman Patches 1 & 4 are preparatory patches that should affect performance. So the memory accounting overhead was reduced by about half. BTW, the benchmark that I used is kind of the best case behavior as it as all updates are to the percpu stocks. Real workloads will likely to have a certain amount of update to the memcg charges and vmstats. So the performance benefit will be less. Cheers, Longman
Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
On 4/9/21 9:51 PM, Roman Gushchin wrote: On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote: With the recent introduction of the new slab memory controller, we eliminate the need for having separate kmemcaches for each memory cgroup and reduce overall kernel memory usage. However, we also add additional memory accounting overhead to each call of kmem_cache_alloc() and kmem_cache_free(). For workloads that require a lot of kmemcache allocations and de-allocations, they may experience performance regression as illustrated in [1]. With a simple kernel module that performs repeated loop of 100,000,000 kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module init. The execution time to load the kernel module with and without memory accounting were: with accounting = 6.798s w/o accounting = 1.758s That is an increase of 5.04s (287%). With this patchset applied, the execution time became 4.254s. So the memory accounting overhead is now 2.496s which is a 50% reduction. Hi Waiman! Thank you for working on it, it's indeed very useful! A couple of questions: 1) did your config included lockdep or not? The test kernel is based on a production kernel config and so lockdep isn't enabled. 2) do you have a (rough) estimation how much each change contributes to the overall reduction? I should have a better breakdown of the effect of individual patches. I rerun the benchmarking module with turbo-boosting disabled to reduce run-to-run variation. The execution times were: Before patch: time = 10.800s (with memory accounting), 2.848s (w/o accounting), overhead = 7.952s After patch 2: time = 9.140s, overhead = 6.292s After patch 3: time = 7.641s, overhead = 4.793s After patch 5: time = 6.801s, overhead = 3.953s Patches 1 & 4 are preparatory patches that should affect performance. So the memory accounting overhead was reduced by about half. Cheers, Longman
[PATCH 5/5] mm/memcg: Optimize user context object stock access
Most kmem_cache_alloc() calls are from user context. With instrumentation enabled, the measured amount of kmem_cache_alloc() calls from non-task context was about 0.01% of the total. The irq disable/enable sequence used in this case to access content from object stock is slow. To optimize for user context access, there are now two object stocks for task context and interrupt context access respectively. The task context object stock can be accessed after disabling preemption which is cheap in non-preempt kernel. The interrupt context object stock can only be accessed after disabling interrupt. User context code can access interrupt object stock, but not vice versa. The mod_objcg_state() function is also modified to make sure that memcg and lruvec stat updates are done with interrupted disabled. The downside of this change is that there are more data stored in local object stocks and not reflected in the charge counter and the vmstat arrays. However, this is a small price to pay for better performance. Signed-off-by: Waiman Long --- mm/memcontrol.c | 71 +++-- 1 file changed, 57 insertions(+), 14 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 69f728383efe..00c9074e42e5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2229,7 +2229,8 @@ struct obj_stock { struct memcg_stock_pcp { struct mem_cgroup *cached; /* this never be root cgroup */ unsigned int nr_pages; - struct obj_stock obj; + struct obj_stock task_obj; + struct obj_stock irq_obj; struct work_struct work; unsigned long flags; @@ -2254,11 +2255,46 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, } #endif +/* + * Most kmem_cache_alloc() calls are from user context. The irq disable/enable + * sequence used in this case to access content from object stock is slow. + * To optimize for user context access, there are now two object stocks for + * task context and interrupt context access respectively. + * + * The task context object stock can be accessed by disabling preemption only + * which is cheap in non-preempt kernel. The interrupt context object stock + * can only be accessed after disabling interrupt. User context code can + * access interrupt object stock, but not vice versa. + */ static inline struct obj_stock *current_obj_stock(void) { struct memcg_stock_pcp *stock = this_cpu_ptr(_stock); - return >obj; + return in_task() ? >task_obj : >irq_obj; +} + +#define get_obj_stock(flags) \ +({ \ + struct memcg_stock_pcp *stock; \ + struct obj_stock *obj_stock;\ + \ + if (in_task()) {\ + preempt_disable(); \ + (flags) = -1L; \ + obj_stock = >task_obj; \ + } else {\ + local_irq_save(flags); \ + obj_stock = >irq_obj;\ + } \ + obj_stock; \ +}) + +static inline void put_obj_stock(unsigned long flags) +{ + if (flags == -1L) + preempt_enable(); + else + local_irq_restore(flags); } /** @@ -2327,7 +2363,9 @@ static void drain_local_stock(struct work_struct *dummy) local_irq_save(flags); stock = this_cpu_ptr(_stock); - drain_obj_stock(>obj); + drain_obj_stock(>irq_obj); + if (in_task()) + drain_obj_stock(>task_obj); drain_stock(stock); clear_bit(FLUSHING_CACHED_CHARGE, >flags); @@ -3183,7 +3221,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg, memcg = obj_cgroup_memcg(objcg); if (pgdat) lruvec = mem_cgroup_lruvec(memcg, pgdat); - __mod_memcg_lruvec_state(memcg, lruvec, idx, nr); + mod_memcg_lruvec_state(memcg, lruvec, idx, nr); rcu_read_unlock(); } @@ -3193,7 +3231,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) unsigned long flags; bool ret = false; - local_irq_save(flags); + stock = get_obj_stock(flags); stock = current_obj_stock(); if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) { @@ -3201,7 +3239,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) ret = true; } - local_irq_restore(flags); + put_obj_stock(flags); return ret; } @@ -3254,8 +3292,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, { struct mem_cgroup *memcg; - if (stock->obj.cached_objcg) { - memcg = obj_cgroup_memcg(stock->obj.cached_objcg); +
[PATCH 4/5] mm/memcg: Separate out object stock data into its own struct
The object stock data stored in struct memcg_stock_pcp are independent of the other page based data stored there. Separating them out into their own struct to highlight the independency. Signed-off-by: Waiman Long --- mm/memcontrol.c | 41 ++--- 1 file changed, 26 insertions(+), 15 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 539c3b632e47..69f728383efe 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2214,17 +2214,22 @@ void unlock_page_memcg(struct page *page) } EXPORT_SYMBOL(unlock_page_memcg); -struct memcg_stock_pcp { - struct mem_cgroup *cached; /* this never be root cgroup */ - unsigned int nr_pages; - +struct obj_stock { #ifdef CONFIG_MEMCG_KMEM struct obj_cgroup *cached_objcg; struct pglist_data *cached_pgdat; unsigned int nr_bytes; int vmstat_idx; int vmstat_bytes; +#else + int dummy[0]; #endif +}; + +struct memcg_stock_pcp { + struct mem_cgroup *cached; /* this never be root cgroup */ + unsigned int nr_pages; + struct obj_stock obj; struct work_struct work; unsigned long flags; @@ -2234,12 +2239,12 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock); static DEFINE_MUTEX(percpu_charge_mutex); #ifdef CONFIG_MEMCG_KMEM -static void drain_obj_stock(struct memcg_stock_pcp *stock); +static void drain_obj_stock(struct obj_stock *stock); static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, struct mem_cgroup *root_memcg); #else -static inline void drain_obj_stock(struct memcg_stock_pcp *stock) +static inline void drain_obj_stock(struct obj_stock *stock) { } static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, @@ -2249,6 +2254,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, } #endif +static inline struct obj_stock *current_obj_stock(void) +{ + struct memcg_stock_pcp *stock = this_cpu_ptr(_stock); + + return >obj; +} + /** * consume_stock: Try to consume stocked charge on this cpu. * @memcg: memcg to consume from. @@ -2315,7 +2327,7 @@ static void drain_local_stock(struct work_struct *dummy) local_irq_save(flags); stock = this_cpu_ptr(_stock); - drain_obj_stock(stock); + drain_obj_stock(>obj); drain_stock(stock); clear_bit(FLUSHING_CACHED_CHARGE, >flags); @@ -3177,13 +3189,13 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg, static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) { - struct memcg_stock_pcp *stock; + struct obj_stock *stock; unsigned long flags; bool ret = false; local_irq_save(flags); - stock = this_cpu_ptr(_stock); + stock = current_obj_stock(); if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) { stock->nr_bytes -= nr_bytes; ret = true; @@ -3194,7 +3206,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) return ret; } -static void drain_obj_stock(struct memcg_stock_pcp *stock) +static void drain_obj_stock(struct obj_stock *stock) { struct obj_cgroup *old = stock->cached_objcg; @@ -3242,8 +3254,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, { struct mem_cgroup *memcg; - if (stock->cached_objcg) { - memcg = obj_cgroup_memcg(stock->cached_objcg); + if (stock->obj.cached_objcg) { + memcg = obj_cgroup_memcg(stock->obj.cached_objcg); if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) return true; } @@ -3253,9 +3265,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) { - struct memcg_stock_pcp *stock; + struct obj_stock *stock = current_obj_stock(); - stock = this_cpu_ptr(_stock); if (stock->cached_objcg != objcg) { /* reset if necessary */ drain_obj_stock(stock); obj_cgroup_get(objcg); @@ -3280,7 +3291,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) static void __mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, int idx, int nr) { - struct memcg_stock_pcp *stock = this_cpu_ptr(_stock); + struct obj_stock *stock = current_obj_stock(); if (stock->cached_objcg != objcg) { /* Output the current data as is */ -- 2.18.1
[PATCH 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
Before the new slab memory controller with per object byte charging, charging and vmstat data update happen only when new slab pages are allocated or freed. Now they are done with every kmem_cache_alloc() and kmem_cache_free(). This causes additional overhead for workloads that generate a lot of alloc and free calls. The memcg_stock_pcp is used to cache byte charge for a specific obj_cgroup to reduce that overhead. To further reducing it, this patch makes the vmstat data cached in the memcg_stock_pcp structure as well until it accumulates a page size worth of update or when other cached data change. On a 2-socket Cascade Lake server with instrumentation enabled and this patch applied, it was found that about 17% (946796 out of 5515184) of the time when __mod_obj_stock_state() is called leads to an actual call to mod_objcg_state() after initial boot. When doing parallel kernel build, the figure was about 16% (21894614 out of 139780628). So caching the vmstat data reduces the number of calls to mod_objcg_state() by more than 80%. Signed-off-by: Waiman Long --- mm/memcontrol.c | 78 +++-- mm/slab.h | 26 +++-- 2 files changed, 79 insertions(+), 25 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b19100c68aa0..539c3b632e47 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2220,7 +2220,10 @@ struct memcg_stock_pcp { #ifdef CONFIG_MEMCG_KMEM struct obj_cgroup *cached_objcg; + struct pglist_data *cached_pgdat; unsigned int nr_bytes; + int vmstat_idx; + int vmstat_bytes; #endif struct work_struct work; @@ -3157,6 +3160,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) css_put(>css); } +static inline void mod_objcg_state(struct obj_cgroup *objcg, + struct pglist_data *pgdat, + enum node_stat_item idx, int nr) +{ + struct mem_cgroup *memcg; + struct lruvec *lruvec = NULL; + + rcu_read_lock(); + memcg = obj_cgroup_memcg(objcg); + if (pgdat) + lruvec = mem_cgroup_lruvec(memcg, pgdat); + __mod_memcg_lruvec_state(memcg, lruvec, idx, nr); + rcu_read_unlock(); +} + static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) { struct memcg_stock_pcp *stock; @@ -3207,6 +3225,14 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock) stock->nr_bytes = 0; } + if (stock->vmstat_bytes) { + mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx, + stock->vmstat_bytes); + stock->vmstat_bytes = 0; + stock->vmstat_idx = 0; + stock->cached_pgdat = NULL; + } + obj_cgroup_put(old); stock->cached_objcg = NULL; } @@ -3251,6 +3277,48 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) local_irq_restore(flags); } +static void __mod_obj_stock_state(struct obj_cgroup *objcg, + struct pglist_data *pgdat, int idx, int nr) +{ + struct memcg_stock_pcp *stock = this_cpu_ptr(_stock); + + if (stock->cached_objcg != objcg) { + /* Output the current data as is */ + } else if (!stock->vmstat_bytes) { + /* Save the current data */ + stock->vmstat_bytes = nr; + stock->vmstat_idx = idx; + stock->cached_pgdat = pgdat; + nr = 0; + } else if ((stock->cached_pgdat != pgdat) || + (stock->vmstat_idx != idx)) { + /* Output the cached data & save the current data */ + swap(nr, stock->vmstat_bytes); + swap(idx, stock->vmstat_idx); + swap(pgdat, stock->cached_pgdat); + } else { + stock->vmstat_bytes += nr; + if (abs(nr) > PAGE_SIZE) { + nr = stock->vmstat_bytes; + stock->vmstat_bytes = 0; + } else { + nr = 0; + } + } + if (nr) + mod_objcg_state(objcg, pgdat, idx, nr); +} + +void mod_obj_stock_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, +int idx, int nr) +{ + unsigned long flags; + + local_irq_save(flags); + __mod_obj_stock_state(objcg, pgdat, idx, nr); + local_irq_restore(flags); +} + int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) { struct mem_cgroup *memcg; @@ -3300,18 +3368,10 @@ void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size, struct pglist_data *pgdat, int idx) { unsigned long flags; - struct mem_cgroup *memcg; - struct lruvec *lruvec = N
[PATCH 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge() is followed by mod_objcg_state()/mod_memcg_state(). Each of these function call goes through a separate irq_save/irq_restore cycle. That is inefficient. Introduce a new function obj_cgroup_uncharge_mod_state() that combines them with a single irq_save/irq_restore cycle. Signed-off-by: Waiman Long --- include/linux/memcontrol.h | 2 ++ mm/memcontrol.c| 31 +++ mm/percpu.c| 9 ++--- mm/slab.h | 6 +++--- 4 files changed, 34 insertions(+), 14 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 95f12996e66c..6890f999c1a3 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1592,6 +1592,8 @@ struct obj_cgroup *get_obj_cgroup_from_current(void); int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size); void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size); +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size, + struct pglist_data *pgdat, int idx); extern struct static_key_false memcg_kmem_enabled_key; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d66e1e38f8ac..b19100c68aa0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3225,12 +3225,9 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, return false; } -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) +static void __refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) { struct memcg_stock_pcp *stock; - unsigned long flags; - - local_irq_save(flags); stock = this_cpu_ptr(_stock); if (stock->cached_objcg != objcg) { /* reset if necessary */ @@ -3243,7 +3240,14 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) if (stock->nr_bytes > PAGE_SIZE) drain_obj_stock(stock); +} + +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) +{ + unsigned long flags; + local_irq_save(flags); + __refill_obj_stock(objcg, nr_bytes); local_irq_restore(flags); } @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size) refill_obj_stock(objcg, size); } +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size, + struct pglist_data *pgdat, int idx) +{ + unsigned long flags; + struct mem_cgroup *memcg; + struct lruvec *lruvec = NULL; + + local_irq_save(flags); + __refill_obj_stock(objcg, size); + + rcu_read_lock(); + memcg = obj_cgroup_memcg(objcg); + if (pgdat) + lruvec = mem_cgroup_lruvec(memcg, pgdat); + __mod_memcg_lruvec_state(memcg, lruvec, idx, -(int)size); + rcu_read_unlock(); + local_irq_restore(flags); +} + #endif /* CONFIG_MEMCG_KMEM */ /* diff --git a/mm/percpu.c b/mm/percpu.c index 6596a0a4286e..24ecd51c688c 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1631,13 +1631,8 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size) objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT]; chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL; - obj_cgroup_uncharge(objcg, size * num_possible_cpus()); - - rcu_read_lock(); - mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B, - -(size * num_possible_cpus())); - rcu_read_unlock(); - + obj_cgroup_uncharge_mod_state(objcg, size * num_possible_cpus(), + NULL, MEMCG_PERCPU_B); obj_cgroup_put(objcg); } diff --git a/mm/slab.h b/mm/slab.h index bc6c7545e487..677cdc52e641 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -366,9 +366,9 @@ static inline void memcg_slab_free_hook(struct kmem_cache *s_orig, continue; objcgs[off] = NULL; - obj_cgroup_uncharge(objcg, obj_full_size(s)); - mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s), - -obj_full_size(s)); + obj_cgroup_uncharge_mod_state(objcg, obj_full_size(s), + page_pgdat(page), + cache_vmstat_idx(s)); obj_cgroup_put(objcg); } } -- 2.18.1
[PATCH 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily available. So both of them are now passed to mod_memcg_lruvec_state() and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is updated to allow either of the two parameters to be set to null. This makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec is null. Signed-off-by: Waiman Long --- include/linux/memcontrol.h | 12 +++- mm/memcontrol.c| 19 +-- mm/slab.h | 2 +- 3 files changed, 21 insertions(+), 12 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0c04d39a7967..95f12996e66c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -955,8 +955,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, return x; } -void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, - int val); +void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec, + enum node_stat_item idx, int val); void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val); static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, @@ -969,13 +969,14 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx, local_irq_restore(flags); } -static inline void mod_memcg_lruvec_state(struct lruvec *lruvec, +static inline void mod_memcg_lruvec_state(struct mem_cgroup *memcg, + struct lruvec *lruvec, enum node_stat_item idx, int val) { unsigned long flags; local_irq_save(flags); - __mod_memcg_lruvec_state(lruvec, idx, val); + __mod_memcg_lruvec_state(memcg, lruvec, idx, val); local_irq_restore(flags); } @@ -1369,7 +1370,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec, return node_page_state(lruvec_pgdat(lruvec), idx); } -static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec, +static inline void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, + struct lruvec *lruvec, enum node_stat_item idx, int val) { } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e064ac0d850a..d66e1e38f8ac 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -799,20 +799,27 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid) return mem_cgroup_nodeinfo(parent, nid); } -void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, - int val) +/* + * Either one of memcg or lruvec can be NULL, but not both. + */ +void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec *lruvec, + enum node_stat_item idx, int val) { struct mem_cgroup_per_node *pn; - struct mem_cgroup *memcg; long x, threshold = MEMCG_CHARGE_BATCH; + /* Update lruvec */ pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); - memcg = pn->memcg; + + if (!memcg) + memcg = pn->memcg; /* Update memcg */ __mod_memcg_state(memcg, idx, val); - /* Update lruvec */ + if (!lruvec) + return; + __this_cpu_add(pn->lruvec_stat_local->count[idx], val); if (vmstat_item_in_bytes(idx)) @@ -848,7 +855,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, /* Update memcg and lruvec */ if (!mem_cgroup_disabled()) - __mod_memcg_lruvec_state(lruvec, idx, val); + __mod_memcg_lruvec_state(NULL, lruvec, idx, val); } void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx, diff --git a/mm/slab.h b/mm/slab.h index 076582f58f68..bc6c7545e487 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -293,7 +293,7 @@ static inline void mod_objcg_state(struct obj_cgroup *objcg, rcu_read_lock(); memcg = obj_cgroup_memcg(objcg); lruvec = mem_cgroup_lruvec(memcg, pgdat); - mod_memcg_lruvec_state(lruvec, idx, nr); + mod_memcg_lruvec_state(memcg, lruvec, idx, nr); rcu_read_unlock(); } -- 2.18.1
[PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead
With the recent introduction of the new slab memory controller, we eliminate the need for having separate kmemcaches for each memory cgroup and reduce overall kernel memory usage. However, we also add additional memory accounting overhead to each call of kmem_cache_alloc() and kmem_cache_free(). For workloads that require a lot of kmemcache allocations and de-allocations, they may experience performance regression as illustrated in [1]. With a simple kernel module that performs repeated loop of 100,000,000 kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module init. The execution time to load the kernel module with and without memory accounting were: with accounting = 6.798s w/o accounting = 1.758s That is an increase of 5.04s (287%). With this patchset applied, the execution time became 4.254s. So the memory accounting overhead is now 2.496s which is a 50% reduction. It was found that a major part of the memory accounting overhead is caused by the local_irq_save()/local_irq_restore() sequences in updating local stock charge bytes and vmstat array, at least in x86 systems. There are two such sequences in kmem_cache_alloc() and two in kmem_cache_free(). This patchset tries to reduce the use of such sequences as much as possible. In fact, it eliminates them in the common case. Another part of this patchset to cache the vmstat data update in the local stock as well which also helps. [1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u Waiman Long (5): mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state() mm/memcg: Introduce obj_cgroup_uncharge_mod_state() mm/memcg: Cache vmstat data in percpu memcg_stock_pcp mm/memcg: Separate out object stock data into its own struct mm/memcg: Optimize user context object stock access include/linux/memcontrol.h | 14 ++- mm/memcontrol.c| 198 - mm/percpu.c| 9 +- mm/slab.h | 32 +++--- 4 files changed, 195 insertions(+), 58 deletions(-) -- 2.18.1
Re: [OpenRISC] [PATCH v6 1/9] locking/qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32
On 4/6/21 7:52 PM, Stafford Horne wrote: For OpenRISC I did ack the patch to convert to CONFIG_ARCH_USE_QUEUED_SPINLOCKS_XCHG32=y. But I think you are right, the generic code in xchg_tail and the xchg16 emulation code in produced by OpenRISC using xchg32 would produce very similar code. I have not compared instructions, but it does seem like duplicate functionality. Why doesn't RISC-V add the xchg16 emulation code similar to OpenRISC? For OpenRISC we added xchg16 and xchg8 emulation code to enable qspinlocks. So one thought is with CONFIG_ARCH_USE_QUEUED_SPINLOCKS_XCHG32=y, can we remove our xchg16/xchg8 emulation code? For the record, the latest qspinlock code doesn't use xchg8 anymore. It still need xchg16, though. Cheers, Longman
Re: [PATCH v4] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only
On 4/6/21 5:15 AM, Peter Zijlstra wrote: On Mon, Apr 05, 2021 at 07:42:03PM -0400, Waiman Long wrote: The handling of sysrq key can be activated by echoing the key to /proc/sysrq-trigger or via the magic key sequence typed into a terminal that is connected to the system in some way (serial, USB or other mean). In the former case, the handling is done in a user context. In the latter case, it is likely to be in an interrupt context. [ 7809.796281] [ 7809.796282] _raw_spin_lock_irqsave+0x32/0x40 [ 7809.796283] print_cpu+0x261/0x7c0 [ 7809.796283] sysrq_sched_debug_show+0x34/0x50 [ 7809.796284] sysrq_handle_showstate+0xc/0x20 [ 7809.796284] __handle_sysrq.cold.11+0x48/0xfb [ 7809.796285] write_sysrq_trigger+0x2b/0x30 [ 7809.796285] proc_reg_write+0x39/0x60 [ 7809.796286] vfs_write+0xa5/0x1a0 [ 7809.796286] ksys_write+0x4f/0xb0 [ 7809.796287] do_syscall_64+0x5b/0x1a0 [ 7809.796287] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 7809.796288] RIP: 0033:0x7fabe4ceb648 The purpose of sched_debug_lock is to serialize the use of the global cgroup_path[] buffer in print_cpu(). The rests of the printk calls don't need serialization from sched_debug_lock. The print_cpu() function has two callers - sched_debug_show() and sysrq_sched_debug_show(). So what idiot is doing sysrq and that proc file at the same time? Why is it a problem now? We got a customer bug report on watchdog panic because a process somehow stay within the sched_debug_lock for too long. I don't know what exactly the customer was doing, but we can reproduce the panic just by having 2 parallel "echo t > /proc/sysrq-trigger" commands. This happens on certain selected systems. I was using some Dell systems for my testing. Of course, it is not sensible to have more than one sysrq happening at the same time. However, a parallel thread accessing /proc/sched_debug is certainly possible. That invokes similar code path. @@ -470,16 +468,49 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group #endif #ifdef CONFIG_CGROUP_SCHED +static DEFINE_SPINLOCK(sched_debug_lock); static char group_path[PATH_MAX]; +static enum { + TOKEN_NONE, + TOKEN_ACQUIRED, + TOKEN_NA/* Not applicable */ +} console_token = TOKEN_ACQUIRED; +/* + * All the print_cpu() callers from sched_debug_show() will be allowed + * to contend for sched_debug_lock and use group_path[] as their SEQ_printf() + * calls will be much faster. However only one print_cpu() caller from + * sysrq_sched_debug_show() which outputs to the console will be allowed + * to use group_path[]. Another parallel console writer will have to use + * a shorter stack buffer instead. Since the console output will be garbled + * anyway, truncation of some cgroup paths shouldn't be a big issue. + */ +#define SEQ_printf_task_group_path(m, tg, fmt...) \ +{ \ + unsigned long flags;\ + int token = m ? TOKEN_NA\ + : xchg_acquire(_token, TOKEN_NONE); \ + \ + if (token == TOKEN_NONE) { \ + char buf[128]; \ + task_group_path(tg, buf, sizeof(buf)); \ + SEQ_printf(m, fmt, buf);\ + } else {\ + spin_lock_irqsave(_debug_lock, flags);\ + task_group_path(tg, group_path, sizeof(group_path));\ + SEQ_printf(m, fmt, group_path); \ + spin_unlock_irqrestore(_debug_lock, flags); \ + if (token == TOKEN_ACQUIRED)\ + smp_store_release(_token, token); \ + } \ } This is disgusting... you have an open-coded test-and-set lock like thing *AND* a spinlock, what gives? What's wrong with something simple like this? --- diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 4b49cc2af5c4..2ac2977f3b96 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -8,8 +8,6 @@ */ #include "sched.h" -static DEFINE_SPINLOCK(sched_debug_lock); - /* * This allows printing both to /proc/sched_debug and * to the console @@ -470,6 +468,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group #endif #ifdef CONFIG_CGROUP_SCHED +static DEFINE_SPINLOCK(group_path_lock); static char group_path[PATH_MAX]; static char *task_group_path(struct task_group *tg) @@ -481,6 +480,22 @@ static char *task_group_path(struct task_group *tg) return grou
Re: [PATCH v4] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only
On 4/5/21 8:18 PM, Steven Rostedt wrote: On Mon, 5 Apr 2021 19:42:03 -0400 Waiman Long wrote: +/* + * All the print_cpu() callers from sched_debug_show() will be allowed + * to contend for sched_debug_lock and use group_path[] as their SEQ_printf() + * calls will be much faster. However only one print_cpu() caller from + * sysrq_sched_debug_show() which outputs to the console will be allowed + * to use group_path[]. Another parallel console writer will have to use + * a shorter stack buffer instead. Since the console output will be garbled + * anyway, truncation of some cgroup paths shouldn't be a big issue. + */ +#define SEQ_printf_task_group_path(m, tg, fmt...) \ +{ \ + unsigned long flags;\ + int token = m ? TOKEN_NA\ + : xchg_acquire(_token, TOKEN_NONE); \ + \ + if (token == TOKEN_NONE) { \ + char buf[128]; \ + task_group_path(tg, buf, sizeof(buf)); \ + SEQ_printf(m, fmt, buf);\ + } else {\ + spin_lock_irqsave(_debug_lock, flags);\ + task_group_path(tg, group_path, sizeof(group_path));\ + SEQ_printf(m, fmt, group_path); \ + spin_unlock_irqrestore(_debug_lock, flags); \ + if (token == TOKEN_ACQUIRED)\ + smp_store_release(_token, token); \ + } \ } #endif And you said my suggestion was complex! I'll let others review this. This patch is actually inspired by your suggestion, though it is structured differently from your approach. I really want to thank you for your valuable feedback. I realized that printing to a sequence file wasn't really a problem, only printing to console can be problematic. That is why I decided to allow unlimited use of group_path[] for those users and only one console writer is allowed to use it. As for calling touch_nmi_watchdog(), I am still thinking where will be the right place to do it, but it can be done with a separate patch, if needed. Cheers, Longman
[PATCH v4] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only
The handling of sysrq key can be activated by echoing the key to /proc/sysrq-trigger or via the magic key sequence typed into a terminal that is connected to the system in some way (serial, USB or other mean). In the former case, the handling is done in a user context. In the latter case, it is likely to be in an interrupt context. There should be no more than one instance of sysrq key processing via a terminal, but multiple instances of /proc/sysrq-trigger is possible. Currently in print_cpu() of kernel/sched/debug.c, sched_debug_lock is taken with interrupt disabled for the whole duration of the calls to print_*_stats() and print_rq() which could last for the quite some time if the information dump happens on the serial console. If the system has many cpus and the sched_debug_lock is somehow busy (e.g. parallel sysrq-t), the system may hit a hard lockup panic depending on the actually serial console implementation of the system. For instance, [ 7809.796262] Kernel panic - not syncing: Hard LOCKUP [ 7809.796264] CPU: 13 PID: 79867 Comm: reproducer.sh Kdump: loaded Tainted: G I - - - 4.18.0-301.el8.x86_64 #1 [ 7809.796264] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.4.9 06/29/2018 [ 7809.796265] Call Trace: [ 7809.796265] [ 7809.796266] dump_stack+0x5c/0x80 [ 7809.796266] panic+0xe7/0x2a9 [ 7809.796267] nmi_panic.cold.9+0xc/0xc [ 7809.796267] watchdog_overflow_callback.cold.7+0x5c/0x70 [ 7809.796268] __perf_event_overflow+0x52/0xf0 [ 7809.796268] handle_pmi_common+0x204/0x2a0 [ 7809.796269] ? __set_pte_vaddr+0x32/0x50 [ 7809.796269] ? __native_set_fixmap+0x24/0x30 [ 7809.796270] ? ghes_copy_tofrom_phys+0xd3/0x1c0 [ 7809.796271] intel_pmu_handle_irq+0xbf/0x160 [ 7809.796271] perf_event_nmi_handler+0x2d/0x50 [ 7809.796272] nmi_handle+0x63/0x110 [ 7809.796272] default_do_nmi+0x49/0x100 [ 7809.796273] do_nmi+0x17e/0x1e0 [ 7809.796273] end_repeat_nmi+0x16/0x6f [ 7809.796274] RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1d0 [ 7809.796275] Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00 [ 7809.796276] RSP: 0018:aa54cd887df8 EFLAGS: 0002 [ 7809.796277] RAX: 0101 RBX: 0246 RCX: [ 7809.796278] RDX: RSI: RDI: 936b66d0 [ 7809.796278] RBP: 9301fb40 R08: 0004 R09: 004f [ 7809.796279] R10: R11: aa54cd887cc0 R12: 907fd0a29ec0 [ 7809.796280] R13: R14: 926ab7c0 R15: [ 7809.796280] ? native_queued_spin_lock_slowpath+0x5b/0x1d0 [ 7809.796281] ? native_queued_spin_lock_slowpath+0x5b/0x1d0 [ 7809.796281] [ 7809.796282] _raw_spin_lock_irqsave+0x32/0x40 [ 7809.796283] print_cpu+0x261/0x7c0 [ 7809.796283] sysrq_sched_debug_show+0x34/0x50 [ 7809.796284] sysrq_handle_showstate+0xc/0x20 [ 7809.796284] __handle_sysrq.cold.11+0x48/0xfb [ 7809.796285] write_sysrq_trigger+0x2b/0x30 [ 7809.796285] proc_reg_write+0x39/0x60 [ 7809.796286] vfs_write+0xa5/0x1a0 [ 7809.796286] ksys_write+0x4f/0xb0 [ 7809.796287] do_syscall_64+0x5b/0x1a0 [ 7809.796287] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 7809.796288] RIP: 0033:0x7fabe4ceb648 The purpose of sched_debug_lock is to serialize the use of the global cgroup_path[] buffer in print_cpu(). The rests of the printk calls don't need serialization from sched_debug_lock. Calling printk() with interrupt disabled can still be problematic if multiple instances are running. Allocating a stack buffer of PATH_MAX bytes is not feasible because of the limited size of the kernel stack. The print_cpu() function has two callers - sched_debug_show() and sysrq_sched_debug_show(). The solution implemented in this patch is to allow all sched_debug_show() callers to contend for sched_debug_lock and use the full size group_path[] as their SEQ_printf() calls will be much faster. However only one sysrq_sched_debug_show() caller that output to the slow console will be allowed to use group_path[]. Another parallel console writer will have to use a shorter stack buffer instead. Since the console output will be garbled anyway, truncation of some cgroup paths shouldn't be a big issue. Fixes: efe25c2c7b3a ("sched: Reinstate group names in /proc/sched_debug") Signed-off-by: Waiman Long --- kernel/sched/debug.c | 54 +--- 1 file changed, 41 insertions(+), 13 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 486f403a778b..5d021b247998 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -8,8 +8,6 @@ */ #include "sched.h" -static DEFINE_SPINLOCK(sched_debug_lock); - /* * This allows printing both to /proc/sched_debug and * to the console @@ -470,16 +468,49 @@ static void print_cfs_group_stats(st
Re: [PATCH v3] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only
On 4/4/21 9:27 PM, Waiman Long wrote: On 4/4/21 12:02 PM, Steven Rostedt wrote: On Fri, 2 Apr 2021 23:09:09 -0400 Waiman Long wrote: The main problem with sched_debug_lock is that under certain circumstances, a lock waiter may wait a long time to acquire the lock (in seconds). We can't insert touch_nmi_watchdog() while the cpu is waiting for the spinlock. The problem I have with the patch is that it seems to be a hack (as it doesn't fix the issue in all cases). Since sched_debug_lock is "special", perhaps we can add wrappers to take it, and instead of doing the spin_lock_irqsave(), do a trylock loop. Add lockdep annotation to tell lockdep that this is not a try lock (so that it can still detect deadlocks). Then have the strategically placed touch_nmi_watchdog() also increment a counter. Then in that trylock loop, if it sees the counter get incremented, it knows that forward progress is being made by the lock holder, and it too can call touch_nmi_watchdog(). Thanks for the suggestion, but it also sound complicated. I think we can fix this lockup problem if we are willing to lose some information in case of contention. As you have suggested, a trylock will be used to acquire sched_debug_lock. If succeeded, all is good. Otherwise, a shorter stack buffer will be used for cgroup path. The path may be truncated in this case. If we detect that the full length of the buffer is used, we assume truncation and add, e.g. "...", to indicate there is more to the actual path. Do you think this is an acceptable comprise? Actually, I don't really need to disable interrupt under this situation as deadlock can't happen. Cheers, Longman
Re: [PATCH v3] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only
On 4/4/21 12:02 PM, Steven Rostedt wrote: On Fri, 2 Apr 2021 23:09:09 -0400 Waiman Long wrote: The main problem with sched_debug_lock is that under certain circumstances, a lock waiter may wait a long time to acquire the lock (in seconds). We can't insert touch_nmi_watchdog() while the cpu is waiting for the spinlock. The problem I have with the patch is that it seems to be a hack (as it doesn't fix the issue in all cases). Since sched_debug_lock is "special", perhaps we can add wrappers to take it, and instead of doing the spin_lock_irqsave(), do a trylock loop. Add lockdep annotation to tell lockdep that this is not a try lock (so that it can still detect deadlocks). Then have the strategically placed touch_nmi_watchdog() also increment a counter. Then in that trylock loop, if it sees the counter get incremented, it knows that forward progress is being made by the lock holder, and it too can call touch_nmi_watchdog(). Thanks for the suggestion, but it also sound complicated. I think we can fix this lockup problem if we are willing to lose some information in case of contention. As you have suggested, a trylock will be used to acquire sched_debug_lock. If succeeded, all is good. Otherwise, a shorter stack buffer will be used for cgroup path. The path may be truncated in this case. If we detect that the full length of the buffer is used, we assume truncation and add, e.g. "...", to indicate there is more to the actual path. Do you think this is an acceptable comprise? Cheers, Longman
Re: [PATCH v3] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only
On 4/2/21 4:40 PM, Steven Rostedt wrote: On Thu, 1 Apr 2021 14:10:30 -0400 Waiman Long wrote: The handling of sysrq key can be activated by echoing the key to /proc/sysrq-trigger or via the magic key sequence typed into a terminal that is connected to the system in some way (serial, USB or other mean). In the former case, the handling is done in a user context. In the latter case, it is likely to be in an interrupt context. There should be no more than one instance of sysrq key processing via a terminal, but multiple instances of /proc/sysrq-trigger is possible. Currently in print_cpu() of kernel/sched/debug.c, sched_debug_lock is taken with interrupt disabled for the whole duration of the calls to print_*_stats() and print_rq() which could last for the quite some time if the information dump happens on the serial console. If the system has many cpus and the sched_debug_lock is somehow busy (e.g. parallel sysrq-t), the system may hit a hard lockup panic depending on the actually serial console implementation of the system. For instance, Wouldn't placing strategically located "touch_nmi_watchdog()"s around fix this? -- Steve The main problem with sched_debug_lock is that under certain circumstances, a lock waiter may wait a long time to acquire the lock (in seconds). We can't insert touch_nmi_watchdog() while the cpu is waiting for the spinlock. Cheers, Longman
[PATCH v3] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only
The handling of sysrq key can be activated by echoing the key to /proc/sysrq-trigger or via the magic key sequence typed into a terminal that is connected to the system in some way (serial, USB or other mean). In the former case, the handling is done in a user context. In the latter case, it is likely to be in an interrupt context. There should be no more than one instance of sysrq key processing via a terminal, but multiple instances of /proc/sysrq-trigger is possible. Currently in print_cpu() of kernel/sched/debug.c, sched_debug_lock is taken with interrupt disabled for the whole duration of the calls to print_*_stats() and print_rq() which could last for the quite some time if the information dump happens on the serial console. If the system has many cpus and the sched_debug_lock is somehow busy (e.g. parallel sysrq-t), the system may hit a hard lockup panic depending on the actually serial console implementation of the system. For instance, [ 7809.796262] Kernel panic - not syncing: Hard LOCKUP [ 7809.796264] CPU: 13 PID: 79867 Comm: reproducer.sh Kdump: loaded Tainted: G I - - - 4.18.0-301.el8.x86_64 #1 [ 7809.796264] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.4.9 06/29/2018 [ 7809.796265] Call Trace: [ 7809.796265] [ 7809.796266] dump_stack+0x5c/0x80 [ 7809.796266] panic+0xe7/0x2a9 [ 7809.796267] nmi_panic.cold.9+0xc/0xc [ 7809.796267] watchdog_overflow_callback.cold.7+0x5c/0x70 [ 7809.796268] __perf_event_overflow+0x52/0xf0 [ 7809.796268] handle_pmi_common+0x204/0x2a0 [ 7809.796269] ? __set_pte_vaddr+0x32/0x50 [ 7809.796269] ? __native_set_fixmap+0x24/0x30 [ 7809.796270] ? ghes_copy_tofrom_phys+0xd3/0x1c0 [ 7809.796271] intel_pmu_handle_irq+0xbf/0x160 [ 7809.796271] perf_event_nmi_handler+0x2d/0x50 [ 7809.796272] nmi_handle+0x63/0x110 [ 7809.796272] default_do_nmi+0x49/0x100 [ 7809.796273] do_nmi+0x17e/0x1e0 [ 7809.796273] end_repeat_nmi+0x16/0x6f [ 7809.796274] RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1d0 [ 7809.796275] Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00 [ 7809.796276] RSP: 0018:aa54cd887df8 EFLAGS: 0002 [ 7809.796277] RAX: 0101 RBX: 0246 RCX: [ 7809.796278] RDX: RSI: RDI: 936b66d0 [ 7809.796278] RBP: 9301fb40 R08: 0004 R09: 004f [ 7809.796279] R10: R11: aa54cd887cc0 R12: 907fd0a29ec0 [ 7809.796280] R13: R14: 926ab7c0 R15: [ 7809.796280] ? native_queued_spin_lock_slowpath+0x5b/0x1d0 [ 7809.796281] ? native_queued_spin_lock_slowpath+0x5b/0x1d0 [ 7809.796281] [ 7809.796282] _raw_spin_lock_irqsave+0x32/0x40 [ 7809.796283] print_cpu+0x261/0x7c0 [ 7809.796283] sysrq_sched_debug_show+0x34/0x50 [ 7809.796284] sysrq_handle_showstate+0xc/0x20 [ 7809.796284] __handle_sysrq.cold.11+0x48/0xfb [ 7809.796285] write_sysrq_trigger+0x2b/0x30 [ 7809.796285] proc_reg_write+0x39/0x60 [ 7809.796286] vfs_write+0xa5/0x1a0 [ 7809.796286] ksys_write+0x4f/0xb0 [ 7809.796287] do_syscall_64+0x5b/0x1a0 [ 7809.796287] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 7809.796288] RIP: 0033:0x7fabe4ceb648 The purpose of sched_debug_lock is to serialize the use of the global cgroup_path[] buffer in print_cpu(). The rests of the printk calls don't need serialization from sched_debug_lock. Calling printk() with interrupt disabled can still be problematic if multiple instances are running. Allocating a stack buffer of PATH_MAX bytes is not feasible. So a compromised solution is used where a small stack buffer is allocated for pathname. If the actual pathname is short enough, it is copied to the stack buffer with sched_debug_lock release afterward before printk. Otherwise, the global group_path[] buffer will be used with sched_debug_lock held until after printk(). This patch does not completely solve the problem, but it will greatly reduce the chance of its occurrence. To really fix it, we probably need to wait until the printk rework is done so that printk() will take much less time to execute than before. Fixes: efe25c2c7b3a ("sched: Reinstate group names in /proc/sched_debug") Signed-off-by: Waiman Long --- kernel/sched/debug.c | 33 + 1 file changed, 25 insertions(+), 8 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 486f403a778b..191ccb111cb4 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -8,8 +8,6 @@ */ #include "sched.h" -static DEFINE_SPINLOCK(sched_debug_lock); - /* * This allows printing both to /proc/sched_debug and * to the console @@ -470,6 +468,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group #endif #ifdef CONFIG_CGROUP_SCH
Re: [PATCH v2] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only
On 3/30/21 6:42 AM, Daniel Thompson wrote: On Mon, Mar 29, 2021 at 03:32:35PM -0400, Waiman Long wrote: The handling of sysrq keys should normally be done in an user context except when MAGIC_SYSRQ_SERIAL is set and the magic sequence is typed in a serial console. This seems to be a poor summary of the typical calling context for handle_sysrq() except in the trivial case of using /proc/sysrq-trigger. For example on my system then the backtrace when I do sysrq-h on a USB keyboard shows us running from a softirq handler and with interrupts locked. Note also that the interrupt lock is present even on systems that handle keyboard input from a kthread due to the interrupt lock in report_input_key(). I will reword this part of the patch. I don't have a deep understanding of how the different way of keyword input work and thanks for showing me that there are other ways of getting keyboard input. Currently in print_cpu() of kernel/sched/debug.c, sched_debug_lock is taken with interrupt disabled for the whole duration of the calls to print_*_stats() and print_rq() which could last for the quite some time if the information dump happens on the serial console. If the system has many cpus and the sched_debug_lock is somehow busy (e.g. parallel sysrq-t), the system may hit a hard lockup panic, like The purpose of sched_debug_lock is to serialize the use of the global cgroup_path[] buffer in print_cpu(). The rests of the printk() calls don't need serialization from sched_debug_lock. Calling printk() with interrupt disabled can still be/proc/sysrq-trigger problematic. Allocating a stack buffer of PATH_MAX bytes is not feasible. So a compromised solution is used where a small stack buffer is allocated for pathname. If the actual pathname is short enough, it is copied to the stack buffer with sched_debug_lock release afterward before printk(). Otherwise, the global group_path[] buffer will be used with sched_debug_lock held until after printk(). Does this actually fix the problem in any circumstance except when the sysrq is triggered using /proc/sysrq-trigger? I have a reproducer that generates hard lockup panic when there are multiple instances of sysrq-t via /proc/sysrq-trigger. This is probably less a problem on console as I don't think we can do multiple simultaneous sysrq-t there. Anyway, my goal is to limit the amount of time that irq is disabled. Doing a printk can take a while depending on whether there are contention in the underlying locks or resources. Even if I limit the the critical sections to just those printk() that outputs cgroup path, I can still cause the panic. Cheers, Longman The approach used by this patch should minimize the chance of a panic happening. However, if there are many tasks with very long cgroup paths, I suppose that panic may still happen under some extreme conditions. So I won't say this will completely fix the problem until the printk() rework that makes printk work more like printk_deferred() is merged.
Re: [PATCH v4 3/4] locking/qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32
On 3/29/21 11:13 PM, Guo Ren wrote: On Mon, Mar 29, 2021 at 8:50 PM Peter Zijlstra wrote: On Mon, Mar 29, 2021 at 08:01:41PM +0800, Guo Ren wrote: u32 a = 0x55aa66bb; u16 *ptr = CPU0 CPU1 = = xchg16(ptr, new) while(1) WRITE_ONCE(*(ptr + 1), x); When we use lr.w/sc.w implement xchg16, it'll cause CPU0 deadlock. Then I think your LL/SC is broken. That also means you really don't want to build super complex locking primitives on top, because that live-lock will percolate through. Do you mean the below implementation has live-lock risk? +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) +{ + u32 old, new, val = atomic_read(>val); + + for (;;) { + new = (val & _Q_LOCKED_PENDING_MASK) | tail; + old = atomic_cmpxchg(>val, val, new); + if (old == val) + break; + + val = old; + } + return old; +} If there is a continuous stream of incoming spinlock takers, it is possible that some cpus may have to wait a long time to set the tail right. However, this should only happen on artificial workload. I doubt it will happen with real workload or with limit number of cpus. Step 1 would be to get your architecute fixed such that it can provide fwd progress guarantees for LL/SC. Otherwise there's absolutely no point in building complex systems with it. Quote Waiman's comment [1] on xchg16 optimization: "This optimization is needed to make the qspinlock achieve performance parity with ticket spinlock at light load." [1] https://lore.kernel.org/kvm/1429901803-29771-6-git-send-email-waiman.l...@hp.com/ So for a non-xhg16 machine: - ticket-lock for small numbers of CPUs - qspinlock for large numbers of CPUs Okay, I'll put all of them into the next patch :P It is true that qspinlock may not offer much advantage when the number of cpus is small. It shines for systems with many cpus. You may use NR_CPUS to determine if the default should be ticket or qspinlock with user override. To determine the right NR_CPUS threshold, you may need to run on real SMP RISCV systems to find out. Cheers, Longman
Re: [PATCH -tip] locking: Move CONFIG_LOCK_EVENT_COUNTS into Kernel hacking section
On 3/29/21 10:06 PM, Davidlohr Bueso wrote: It's a lot more intuitive to have it in the locking section of the kernel hacking part rather than under "General architecture-dependent options". Signed-off-by: Davidlohr Bueso --- arch/Kconfig | 9 - lib/Kconfig.debug | 9 + 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index ecfd3520b676..d6f9aeaaf9f2 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -1113,15 +1113,6 @@ config HAVE_ARCH_PREL32_RELOCATIONS config ARCH_USE_MEMREMAP_PROT bool -config LOCK_EVENT_COUNTS - bool "Locking event counts collection" - depends on DEBUG_FS - help - Enable light-weight counting of various locking related events - in the system with minimal performance impact. This reduces - the chance of application behavior change because of timing - differences. The counts are reported via debugfs. - # Select if the architecture has support for applying RELR relocations. config ARCH_HAS_RELR bool diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 2779c29d9981..76639ff5998c 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1401,6 +1401,15 @@ config DEBUG_LOCKING_API_SELFTESTS The following locking APIs are covered: spinlocks, rwlocks, mutexes and rwsems. +config LOCK_EVENT_COUNTS + bool "Locking event counts collection" + depends on DEBUG_FS + help + Enable light-weight counting of various locking related events + in the system with minimal performance impact. This reduces + the chance of application behavior change because of timing + differences. The counts are reported via debugfs. + config LOCK_TORTURE_TEST tristate "torture tests for locking" depends on DEBUG_KERNEL It looks good to me. Acked-by: Waiman Long
Re: [PATCH v2] x86/apic/vector: Move pr_warn() out of vector_lock
On 3/29/21 8:42 AM, Thomas Gleixner wrote: Waiman, On Sun, Mar 28 2021 at 20:52, Waiman Long wrote: It was found that the following circular locking dependency warning could happen in some systems: [ 218.097878] == [ 218.097879] WARNING: possible circular locking dependency detected [ 218.097880] 4.18.0-228.el8.x86_64+debug #1 Not tainted Reports have to be against latest mainline and not against the random distro frankenkernel of the day. That's nothing new. Plus I was asking you to provide a full splat to look at so this can be discussed _upfront_. Oh well... That was the full splat that I can see except the following trailing data: [ 218.098064] RIP: 0033:0x7ff4ee7620d6 [ 218.098066] Code: 89 54 24 08 e8 9b f4 ff ff 8b 74 24 0c 48 8b 3c 24 41 89 c0 44 8b 54 24 08 b8 01 01 00 00 89 f2 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 30 44 89 c7 89 44 24 08 e8 c6 f4 ff ff 8b 44 [ 218.098067] RSP: 002b:7ffdda1116a0 EFLAGS: 0293 ORIG_RAX: 0101 [ 218.098069] RAX: ffda RBX: 564733953f70 RCX: 7ff4ee7620d6 [ 218.098070] RDX: 00080902 RSI: 5647339560e0 RDI: ff9c [ 218.098071] RBP: 0015 R08: R09: 004b [ 218.098072] R10: R11: 0293 R12: 00080902 [ 218.098074] R13: 5647339560e0 R14: 564733953a90 R15: 0002 [ 218.098460] irq 3: Affinity broken due to vector space exhaustion. [ 218.097914] -> #2 (_desc_lock_class){-.-.}: [ 218.097917]_raw_spin_lock_irqsave+0x48/0x81 [ 218.097918]__irq_get_desc_lock+0xcf/0x140 [ 218.097919]__dble_irq_nosync+0x6e/0x110 This function does not even exist in mainline and never existed... [ 218.097967] [ 218.097967] Chain exists of: [ 218.097968] console_oc_lock_class --> vector_lock [ 218.097972] [ 218.097973] Possible unsafe locking scenario: [ 218.097973] [ 218.097974]CPU0CPU1 [ 218.097975] [ 218.097975] lock(vector_lock); [ 218.097977]lock(_desc_lock_class); [ 218.097980]lock(vector_lock); [ 218.097981] lock(console_owner); [ 218.097983] [ 218.097984] *** DEADLOCK *** [ 218.097984] [ 218.097985] 6 locks held by systemd/1: [ 218.097986] #0: 88822b5cc1e8 (>legacy_mutex){+.+.}, at: tty_init_dev+0x79/0x440 [ 218.097989] #1: 88832ee00770 (>mutex){+.+.}, at: tty_port_open+0x85/0x190 [ 218.097993] #2: 88813be85a88 (>request_mutex){+.+.}, at: __setup_irq+0x249/0x1e60 [ 218.097996] #3: 88813be858c0 (_desc_lock_class){-.-.}, at: __setup_irq+0x2d9/0x1e60 [ 218.098000] #4: 84afca78 (vector_lock){-.-.}, at: x86_vector_activate+0xca/0xab0 [ 218.098003] #5: 84c27e20 (console_lock){+.+.}, at: vprintk_emit+0x13a/0x450 This is a more fundamental problem than just vector lock and the same problem exists with any other printk over serial which is nested in the interrupt activation chain not only on X86. That is true. This problem is more generic than just that. I am hoping that the new printk rewrite may address this problem. I have been waiting for a while and that work is still not upstream yet. So what is your current timeline for that? If that will happen soon, I probably don't need this patch. I send this patch out as I am uncertain about it. -static int activate_reserved(struct irq_data *irqd) +static int activate_reserved(struct irq_data *irqd, char *wbuf, size_t wsize) { ... if (!cpumask_subset(irq_data_get_effective_affinity_mask(irqd), irq_data_get_affinity_mask(irqd))) { - pr_warn("irq %u: Affinity broken due to vector space exhaustion.\n", - irqd->irq); + snprintf(wbuf, wsize, KERN_WARNING +"irq %u: Affinity broken due to vector space exhaustion.\n", +irqd->irq); This is not really any more tasteful than the previous one and it does not fix the fundamental underlying problem. But, because I'm curious and printk is a constant source of trouble, I just added unconditional pr_warns into those functions under vector_lock on 5.12-rc5. Still waiting for the lockdep splat to show up while enjoying the trickle of printks over serial. If you really think this is an upstream problem then please provide a corresponding lockdep splat on plain 5.12-rc5 along with a .config and the scenario which triggers this. Not less, not more. I will try to reproduce this problem with an upstream kernel. Thanks, Longman
[PATCH v2] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only
The handling of sysrq keys should normally be done in an user context except when MAGIC_SYSRQ_SERIAL is set and the magic sequence is typed in a serial console. Currently in print_cpu() of kernel/sched/debug.c, sched_debug_lock is taken with interrupt disabled for the whole duration of the calls to print_*_stats() and print_rq() which could last for the quite some time if the information dump happens on the serial console. If the system has many cpus and the sched_debug_lock is somehow busy (e.g. parallel sysrq-t), the system may hit a hard lockup panic, like [ 7809.796262] Kernel panic - not syncing: Hard LOCKUP [ 7809.796264] CPU: 13 PID: 79867 Comm: reproducer.sh Kdump: loaded Tainted: G I - - - 4.18.0-301.el8.x86_64 #1 [ 7809.796264] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.4.9 06/29/2018 [ 7809.796265] Call Trace: [ 7809.796265] [ 7809.796266] dump_stack+0x5c/0x80 [ 7809.796266] panic+0xe7/0x2a9 [ 7809.796267] nmi_panic.cold.9+0xc/0xc [ 7809.796267] watchdog_overflow_callback.cold.7+0x5c/0x70 [ 7809.796268] __perf_event_overflow+0x52/0xf0 [ 7809.796268] handle_pmi_common+0x204/0x2a0 [ 7809.796269] ? __set_pte_vaddr+0x32/0x50 [ 7809.796269] ? __native_set_fixmap+0x24/0x30 [ 7809.796270] ? ghes_copy_tofrom_phys+0xd3/0x1c0 [ 7809.796271] intel_pmu_handle_irq+0xbf/0x160 [ 7809.796271] perf_event_nmi_handler+0x2d/0x50 [ 7809.796272] nmi_handle+0x63/0x110 [ 7809.796272] default_do_nmi+0x49/0x100 [ 7809.796273] do_nmi+0x17e/0x1e0 [ 7809.796273] end_repeat_nmi+0x16/0x6f [ 7809.796274] RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1d0 [ 7809.796275] Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00 [ 7809.796276] RSP: 0018:aa54cd887df8 EFLAGS: 0002 [ 7809.796277] RAX: 0101 RBX: 0246 RCX: [ 7809.796278] RDX: RSI: RDI: 936b66d0 [ 7809.796278] RBP: 9301fb40 R08: 0004 R09: 004f [ 7809.796279] R10: R11: aa54cd887cc0 R12: 907fd0a29ec0 [ 7809.796280] R13: R14: 926ab7c0 R15: [ 7809.796280] ? native_queued_spin_lock_slowpath+0x5b/0x1d0 [ 7809.796281] ? native_queued_spin_lock_slowpath+0x5b/0x1d0 [ 7809.796281] [ 7809.796282] _raw_spin_lock_irqsave+0x32/0x40 [ 7809.796283] print_cpu+0x261/0x7c0 [ 7809.796283] sysrq_sched_debug_show+0x34/0x50 [ 7809.796284] sysrq_handle_showstate+0xc/0x20 [ 7809.796284] __handle_sysrq.cold.11+0x48/0xfb [ 7809.796285] write_sysrq_trigger+0x2b/0x30 [ 7809.796285] proc_reg_write+0x39/0x60 [ 7809.796286] vfs_write+0xa5/0x1a0 [ 7809.796286] ksys_write+0x4f/0xb0 [ 7809.796287] do_syscall_64+0x5b/0x1a0 [ 7809.796287] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 7809.796288] RIP: 0033:0x7fabe4ceb648 The purpose of sched_debug_lock is to serialize the use of the global cgroup_path[] buffer in print_cpu(). The rests of the printk() calls don't need serialization from sched_debug_lock. Calling printk() with interrupt disabled can still be problematic. Allocating a stack buffer of PATH_MAX bytes is not feasible. So a compromised solution is used where a small stack buffer is allocated for pathname. If the actual pathname is short enough, it is copied to the stack buffer with sched_debug_lock release afterward before printk(). Otherwise, the global group_path[] buffer will be used with sched_debug_lock held until after printk(). Fixes: efe25c2c7b3a ("sched: Reinstate group names in /proc/sched_debug") Signed-off-by: Waiman Long --- kernel/sched/debug.c | 33 + 1 file changed, 25 insertions(+), 8 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 486f403a778b..191ccb111cb4 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -8,8 +8,6 @@ */ #include "sched.h" -static DEFINE_SPINLOCK(sched_debug_lock); - /* * This allows printing both to /proc/sched_debug and * to the console @@ -470,6 +468,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group #endif #ifdef CONFIG_CGROUP_SCHED +static DEFINE_SPINLOCK(sched_debug_lock); static char group_path[PATH_MAX]; static char *task_group_path(struct task_group *tg) @@ -481,6 +480,27 @@ static char *task_group_path(struct task_group *tg) return group_path; } + +#define SEQ_printf_task_group_path(m, tg, fmt...) \ +{ \ + unsigned long flags;\ + char *path, buf[128];
Re: [PATCH 1/2] sched/debug: Don't disable IRQ when acquiring sched_debug_lock
On 3/29/21 6:23 AM, Daniel Thompson wrote: On Sat, Mar 27, 2021 at 07:25:28PM -0400, Waiman Long wrote: The sched_debug_lock was used only in print_cpu(). The print_cpu() function has two callers - sched_debug_show() and sysrq_sched_debug_show(). Both of them are invoked by user action (sched_debug file and sysrq-t). As print_cpu() won't be called from interrupt context at all, there is no point in disabling IRQ when acquiring sched_debug_lock. This looks like it introduces a deadlock risk if sysrq-t triggers from an interrupt context. Has the behaviour of sysrq changed recently or will tools like MAGIC_SYSRQ_SERIAL still trigger from interrupt context? You are right. It looks like that if CONFIG_MAGIC_SYSRQ_SERIAL is set, it is possible for sysrq information dump to happen in an interrupt context. I missed that in my initial analysis. However, doing the time consuming info dump with interrupt disabled for an extended period of time is not a good idea. Still with my second patch to minimize the size of the critical sections, it will minimize the chance of causing trouble except when doing it directly from the console. Thanks, Longman
[PATCH v2] x86/apic/vector: Move pr_warn() out of vector_lock
.097981] lock(console_owner); [ 218.097983] [ 218.097984] *** DEADLOCK *** [ 218.097984] [ 218.097985] 6 locks held by systemd/1: [ 218.097986] #0: 88822b5cc1e8 (>legacy_mutex){+.+.}, at: tty_init_dev+0x79/0x440 [ 218.097989] #1: 88832ee00770 (>mutex){+.+.}, at: tty_port_open+0x85/0x190 [ 218.097993] #2: 88813be85a88 (>request_mutex){+.+.}, at: __setup_irq+0x249/0x1e60 [ 218.097996] #3: 88813be858c0 (_desc_lock_class){-.-.}, at: __setup_irq+0x2d9/0x1e60 [ 218.098000] #4: 84afca78 (vector_lock){-.-.}, at: x86_vector_activate+0xca/0xab0 [ 218.098003] #5: 84c27e20 (console_lock){+.+.}, at: vprintk_emit+0x13a/0x450 [ 218.098007] [ 218.098007] stack backtrace: [ 218.098009] CPU: 1 PID: 1 Comm: systemdt tainted 4.18.0-228.el8.x86_64+debug #1 [ 218.098010] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 03/07/2016 [ 218.098010] Call Trace: [ 218.098011] dump_stack+0x9a/0xf0 [ 218.098012] check_noncircular+0x317/0x3c0 [ 218.098013] ? print_circular_bug+0x1e0/0x1e0 [ 218.098014] ? do_profile_hits.isra.4.cold.9+0x2d/0x2d [ 218.098015] ? sched_clock+0x5/0x10 [ 218.098015] __lock_acquire+0x2894/0x7300 [ 218.098016] ? trace_hardirqs_on+0x10/0x10 [ 218.098017] ? lock_downgrade+0x6f0/0x6f0 [ 218.098018] lock_acquire+0x14f/0x3b0 [ 218.098019] ? console_unlock+0x3fb/0x9f0 [ 218.098019] console_unlock+0x45d/0x9f0 [ 218.098020] ? console_unlock+0x3fb/0x9f0 [ 218.098021] vprintk_emit+0x147/0x450 [ 218.098022] ? apic_update_irq_cfg+0x3f6/0x520 [ 218.098023] printk+0x9f/0xc5 [ 218.098023] ? kmsg_dump_rewind_nolock+0xd9/0xd9 [ 218.098024] x86_vector_activate+0x3f8/0xab0 [ 218.098025] ? find_held_lock+0x3a/0x1c0 [ 218.098026] __irq_domain_activate_irq+0xdb/0x160 [ 218.098027] __irq_domain_activate_irq+0x7d/0x160 [ 218.098028] __irq_domain_activate_irq+0x7d/0x160 [ 218.098029] irq_domain_activate_irq+0x95/0x130 [ 218.098029] ? __init_waitqueue_head+0x3a/0x90 [ 218.098030] __setup_irq+0x861/0x1e60 [ 218.098031] ? kmem_cache_alloc_trace+0x169/0x360 [ 218.098032] ? request_threaded_irq+0xf9/0x2c0 [ 218.098033] request_threaded_irq+0x1e7/0x2c0 [ 218.098033] univ8250_setup_irq+0x6ef/0x8b0 [ 218.50_do_startup+0x7d4/0x1ed0 [ 218.098035] uart_startup.part.7+0x16f/0x4b0 [ 218.098036] tty_port_open+0x112/0x190 [ 218.098037] ? _raw_spin_unlock+0x1f/0x30 [ 218.098037] uart_open+0xe0/0x150 [ 218.098038] tty_open+0x229/0x9a0 [ 218.098039] ? lock_acquire+0x14f/0x3b0 [ 218.098040] ? tty_init_dev+0x440/0x440 [ 218.098041] ? selinux_file_open+0x2cf/0x400 [ 218.098041] chrdev_open+0x1e0/0x4e0 [ 218.098042] ? cdev_put.part.0+0x50/0x50 [ 218.098043] do_dentry_open+0x3d9/0xea0 [ 218.098044] ? cdev_put.part.0+0x50/0x50 [ 218.098045] ? devcgroup_check_permission+0x17c/0x2f0 [ 218.098046] ? __x64_sys_fchdir+0x180/0x180 [ 218.098046] ? security_inode_permission+0x79/0xc0 [ 218.098047] path_openat+0xb73/0x2690 [ 218.098048] ? save_stack+0x81/0xb0 [ 218.098049] ? path_lookupat+0x820/0x820 [ 218.098049] ? sched_clock+0x5/0x10 [ 218.098050] ? sched_clock_cpu+0x18/0x1e0 [ 218.098051] ? do_syscall_64+0xa5/0x4d0 [ 218.098052] ? entry_SYSCALL_64_after_hwframe+0x6a/0xdf [ 218.098053] ? __lock_acquire+0x646/0x7300 [ 218.098054] ? sched_clock_cpu+0x18/0x1e0 [ 218.098054] ? find_held_lock+0x3a/0x1c0 filp_open+0x17c/0x250 [ 218.098056] ? may_open_dev+0xc0/0xc0 [ 218.098057] ? do_raw_spin_unlock+0x54/0x230 [ 218.098058] ? _raw_spin_unlock+0x1f/0x30 [ 218.098058] ? _raw_spin_unlock_irq+0x24/0x40 [ 218.098059] do_sys_open+0x1d9/0x360 [ 218.098060] ? filp_open+0x50/0x50 [ 218.098061] ? entry_SYSCALL_64_after_hwframe+0x7a/0xdf [ 218.098062] ? do_syscall_64+0x22/0x4d0 [ 218.098062] do_syscall_64+0xa5/0x4d0 [ 218.098063] entry_SYSCALL_64_after_hwframe+0x6a/0xdf [ 218.098064] RIP: 0033:0x7ff4ee7620d6 This lockdep warning was causing by printing of the warning message: [ 218.095152] irq 3: Affinity broken due to vector space exhaustion. To avoid this potential deadlock scenario, this patch moves all the pr_warn() calls in the vector.c file out of the vector_lock critical sections. Signed-off-by: Waiman Long --- arch/x86/kernel/apic/vector.c | 35 +-- 1 file changed, 21 insertions(+), 14 deletions(-) diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 3c9c7492252f..79f1ce3a9539 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -385,7 +385,7 @@ static void x86_vector_deactivate(struct irq_domain *dom, struct irq_data *irqd) raw_spin_unlock_irqrestore(_lock, flags); } -static int activate_reserved(struct irq_data *irqd) +static int activate_reserved(struct irq_data *irqd, char *wbuf, size_t wsize) { struct apic_chip_data *apicd = apic_chip_data(irqd); int ret; @@ -410,8 +410,9 @@ static int activate_reserved(struct irq_data *irqd) */
Re: [PATCH] x86/apic/vector: Move pr_warn() outside of vector_lock
On 3/28/21 6:04 PM, Thomas Gleixner wrote: Waiman, On Sun, Mar 28 2021 at 15:58, Waiman Long wrote: It was found that the following circular locking dependency warning could happen in some systems: [ 218.097878] == [ 218.097879] WARNING: possible circular locking dependency detected [ 218.097880] 4.18.0-228.el8.x86_64+debug #1 Not tainted [ 218.097881] -- [ 218.097882] systemd/1 is trying to acquire lock: [ 218.097883] 84c27920 (console_owner){-.-.}, at: console_unlock+0x3fb/0x9f0 [ 218.097886] [ 218.097887] but task is already holding lock: [ 218.097888] 84afca78 (vector_lock){-.-.}, at: x86_vector_activate+0xca/0xab0 [ 218.097891] [ 218.097892] which lock already depends on the new lock. : [ 218.097966] other info that might help us debug this: [ 218.097967] [ 218.097967] Chain exists of: [ 218.097968] console_oc_lock_class --> vector_lock [ 218.097972] [ 218.097973] Possible unsafe locking scenario: [ 218.097973] [ 218.097974]CPU0CPU1 [ 218.097975] [ 218.097975] lock(vector_lock); [ 218.097977]lock(_desc_lock_class); [ 218.097980]lock(vector_lock); [ 218.097981] lock(console_owner); [ 218.097983] [ 218.097984] *** DEADLOCK *** can you please post the full lockdep output? Will do. This lockdep warning was causing by printing of the warning message: [ 218.095152] irq 3: Affinity broken due to vector space exhaustion. It looks that this warning message is relatively more common than the other warnings in arch/x86/kernel/apic/vector.c. To avoid this potential deadlock scenario, this patch moves all the pr_warn() calls in the vector.c file outside of the vector_lock critical sections. Definitely not. -static int activate_reserved(struct irq_data *irqd) +static int activate_reserved(struct irq_data *irqd, unsigned long flags, +bool *unlocked) { struct apic_chip_data *apicd = apic_chip_data(irqd); int ret; @@ -410,6 +411,8 @@ static int activate_reserved(struct irq_data *irqd) */ if (!cpumask_subset(irq_data_get_effective_affinity_mask(irqd), irq_data_get_affinity_mask(irqd))) { + raw_spin_unlock_irqrestore(_lock, flags); + *unlocked = true; What? pr_warn("irq %u: Affinity broken due to vector space exhaustion.\n", irqd->irq); } @@ -446,6 +449,7 @@ static int x86_vector_activate(struct irq_domain *dom, struct irq_data *irqd, { struct apic_chip_data *apicd = apic_chip_data(irqd); unsigned long flags; + bool unlocked = false; int ret = 0; trace_vector_activate(irqd->irq, apicd->is_managed, @@ -459,8 +463,9 @@ static int x86_vector_activate(struct irq_domain *dom, struct irq_data *irqd, else if (apicd->is_managed) ret = activate_managed(irqd); else if (apicd->has_reserved) - ret = activate_reserved(irqd); - raw_spin_unlock_irqrestore(_lock, flags); + ret = activate_reserved(irqd, flags, ); + if (!unlocked) + raw_spin_unlock_irqrestore(_lock, flags); Even moar what? return ret; } This turns that code into complete unreadable gunk. No way. I am sorry that this part of the patch is sloppy. I will revise it to make it better. Cheers, Longman
[PATCH] x86/apic/vector: Move pr_warn() outside of vector_lock
It was found that the following circular locking dependency warning could happen in some systems: [ 218.097878] == [ 218.097879] WARNING: possible circular locking dependency detected [ 218.097880] 4.18.0-228.el8.x86_64+debug #1 Not tainted [ 218.097881] -- [ 218.097882] systemd/1 is trying to acquire lock: [ 218.097883] 84c27920 (console_owner){-.-.}, at: console_unlock+0x3fb/0x9f0 [ 218.097886] [ 218.097887] but task is already holding lock: [ 218.097888] 84afca78 (vector_lock){-.-.}, at: x86_vector_activate+0xca/0xab0 [ 218.097891] [ 218.097892] which lock already depends on the new lock. : [ 218.097966] other info that might help us debug this: [ 218.097967] [ 218.097967] Chain exists of: [ 218.097968] console_oc_lock_class --> vector_lock [ 218.097972] [ 218.097973] Possible unsafe locking scenario: [ 218.097973] [ 218.097974]CPU0CPU1 [ 218.097975] [ 218.097975] lock(vector_lock); [ 218.097977]lock(_desc_lock_class); [ 218.097980]lock(vector_lock); [ 218.097981] lock(console_owner); [ 218.097983] [ 218.097984] *** DEADLOCK *** This lockdep warning was causing by printing of the warning message: [ 218.095152] irq 3: Affinity broken due to vector space exhaustion. It looks that this warning message is relatively more common than the other warnings in arch/x86/kernel/apic/vector.c. To avoid this potential deadlock scenario, this patch moves all the pr_warn() calls in the vector.c file outside of the vector_lock critical sections. Signed-off-by: Waiman Long --- arch/x86/kernel/apic/vector.c | 33 - 1 file changed, 20 insertions(+), 13 deletions(-) diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 3c9c7492252f..948ec592ef61 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -385,7 +385,8 @@ static void x86_vector_deactivate(struct irq_domain *dom, struct irq_data *irqd) raw_spin_unlock_irqrestore(_lock, flags); } -static int activate_reserved(struct irq_data *irqd) +static int activate_reserved(struct irq_data *irqd, unsigned long flags, +bool *unlocked) { struct apic_chip_data *apicd = apic_chip_data(irqd); int ret; @@ -410,6 +411,8 @@ static int activate_reserved(struct irq_data *irqd) */ if (!cpumask_subset(irq_data_get_effective_affinity_mask(irqd), irq_data_get_affinity_mask(irqd))) { + raw_spin_unlock_irqrestore(_lock, flags); + *unlocked = true; pr_warn("irq %u: Affinity broken due to vector space exhaustion.\n", irqd->irq); } @@ -446,6 +449,7 @@ static int x86_vector_activate(struct irq_domain *dom, struct irq_data *irqd, { struct apic_chip_data *apicd = apic_chip_data(irqd); unsigned long flags; + bool unlocked = false; int ret = 0; trace_vector_activate(irqd->irq, apicd->is_managed, @@ -459,8 +463,9 @@ static int x86_vector_activate(struct irq_domain *dom, struct irq_data *irqd, else if (apicd->is_managed) ret = activate_managed(irqd); else if (apicd->has_reserved) - ret = activate_reserved(irqd); - raw_spin_unlock_irqrestore(_lock, flags); + ret = activate_reserved(irqd, flags, ); + if (!unlocked) + raw_spin_unlock_irqrestore(_lock, flags); return ret; } @@ -989,6 +994,7 @@ void irq_force_complete_move(struct irq_desc *desc) struct apic_chip_data *apicd; struct irq_data *irqd; unsigned int vector; + bool warn_move_in_progress = false; /* * The function is called for all descriptors regardless of which @@ -1064,12 +1070,14 @@ void irq_force_complete_move(struct irq_desc *desc) * so we have the necessary information when a problem in that * area arises. */ - pr_warn("IRQ fixup: irq %d move in progress, old vector %d\n", - irqd->irq, vector); + warn_move_in_progress = true; } free_moved_vector(apicd); unlock: raw_spin_unlock(_lock); + if (warn_move_in_progress) + pr_warn("IRQ fixup: irq %d move in progress, old vector %d\n", + irqd->irq, vector); } #ifdef CONFIG_HOTPLUG_CPU @@ -1079,25 +1087,24 @@ void irq_force_complete_move(struct irq_desc *desc) */ int lapic_can_unplug_cpu(void) { - unsigned int rsvd, avl, tomove, cpu = smp_processor_id(); + unsigned int rsvd = 0, avl, tomove, cpu = smp_processor_id
[PATCH 1/2] sched/debug: Don't disable IRQ when acquiring sched_debug_lock
The sched_debug_lock was used only in print_cpu(). The print_cpu() function has two callers - sched_debug_show() and sysrq_sched_debug_show(). Both of them are invoked by user action (sched_debug file and sysrq-t). As print_cpu() won't be called from interrupt context at all, there is no point in disabling IRQ when acquiring sched_debug_lock. Besides, if the system has many cpus and the sched_debug_lock is somehow busy (e.g. parallel sysrq-t), the system may hit a hard lockup panic, like [ 7809.796262] Kernel panic - not syncing: Hard LOCKUP [ 7809.796264] CPU: 13 PID: 79867 Comm: reproducer.sh Kdump: loaded Tainted: G I - - - 4.18.0-301.el8.x86_64 #1 [ 7809.796264] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.4.9 06/29/2018 [ 7809.796265] Call Trace: [ 7809.796265] [ 7809.796266] dump_stack+0x5c/0x80 [ 7809.796266] panic+0xe7/0x2a9 [ 7809.796267] nmi_panic.cold.9+0xc/0xc [ 7809.796267] watchdog_overflow_callback.cold.7+0x5c/0x70 [ 7809.796268] __perf_event_overflow+0x52/0xf0 [ 7809.796268] handle_pmi_common+0x204/0x2a0 [ 7809.796269] ? __set_pte_vaddr+0x32/0x50 [ 7809.796269] ? __native_set_fixmap+0x24/0x30 [ 7809.796270] ? ghes_copy_tofrom_phys+0xd3/0x1c0 [ 7809.796271] intel_pmu_handle_irq+0xbf/0x160 [ 7809.796271] perf_event_nmi_handler+0x2d/0x50 [ 7809.796272] nmi_handle+0x63/0x110 [ 7809.796272] default_do_nmi+0x49/0x100 [ 7809.796273] do_nmi+0x17e/0x1e0 [ 7809.796273] end_repeat_nmi+0x16/0x6f [ 7809.796274] RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1d0 [ 7809.796275] Code: 6d f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00 [ 7809.796276] RSP: 0018:aa54cd887df8 EFLAGS: 0002 [ 7809.796277] RAX: 0101 RBX: 0246 RCX: [ 7809.796278] RDX: RSI: RDI: 936b66d0 [ 7809.796278] RBP: 9301fb40 R08: 0004 R09: 004f [ 7809.796279] R10: R11: aa54cd887cc0 R12: 907fd0a29ec0 [ 7809.796280] R13: R14: 926ab7c0 R15: [ 7809.796280] ? native_queued_spin_lock_slowpath+0x5b/0x1d0 [ 7809.796281] ? native_queued_spin_lock_slowpath+0x5b/0x1d0 [ 7809.796281] [ 7809.796282] _raw_spin_lock_irqsave+0x32/0x40 [ 7809.796283] print_cpu+0x261/0x7c0 [ 7809.796283] sysrq_sched_debug_show+0x34/0x50 [ 7809.796284] sysrq_handle_showstate+0xc/0x20 [ 7809.796284] __handle_sysrq.cold.11+0x48/0xfb [ 7809.796285] write_sysrq_trigger+0x2b/0x30 [ 7809.796285] proc_reg_write+0x39/0x60 [ 7809.796286] vfs_write+0xa5/0x1a0 [ 7809.796286] ksys_write+0x4f/0xb0 [ 7809.796287] do_syscall_64+0x5b/0x1a0 [ 7809.796287] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 7809.796288] RIP: 0033:0x7fabe4ceb648 It is not to disable IRQ for so long that a hard lockup panic is triggered. Fix that by not disabling IRQ when taking sched_debug_lock. Fixes: efe25c2c7b3a ("sched: Reinstate group names in /proc/sched_debug") Signed-off-by: Waiman Long --- kernel/sched/debug.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 486f403a778b..c4ae8a0853a1 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -666,7 +666,6 @@ void print_dl_rq(struct seq_file *m, int cpu, struct dl_rq *dl_rq) static void print_cpu(struct seq_file *m, int cpu) { struct rq *rq = cpu_rq(cpu); - unsigned long flags; #ifdef CONFIG_X86 { @@ -717,13 +716,13 @@ do { \ } #undef P - spin_lock_irqsave(_debug_lock, flags); + spin_lock(_debug_lock); print_cfs_stats(m, cpu); print_rt_stats(m, cpu); print_dl_stats(m, cpu); print_rq(m, rq, cpu); - spin_unlock_irqrestore(_debug_lock, flags); + spin_unlock(_debug_lock); SEQ_printf(m, "\n"); } -- 2.18.1
[PATCH 2/2] sched/debug: Use sched_debug_lock to serialize use of cgroup_path[] only
The purpose of sched_debug_lock is to serialize the use of the global cgroup_path[] buffer in print_cpu(). The rests of the printf calls don't need serialization from sched_debug_lock. The printing of sched debug data to console can take quite a while, taking sched_debug_lock at the print_cpu() level will unnecessarily block the progress of other print_cpu() users. Fix that by holding sched_debug_lock only when using cgroup_path[] via task_group_path(). Signed-off-by: Waiman Long --- kernel/sched/debug.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index c4ae8a0853a1..87168d13714f 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -8,8 +8,6 @@ */ #include "sched.h" -static DEFINE_SPINLOCK(sched_debug_lock); - /* * This allows printing both to /proc/sched_debug and * to the console @@ -470,6 +468,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group #endif #ifdef CONFIG_CGROUP_SCHED +static DEFINE_SPINLOCK(sched_debug_lock); static char group_path[PATH_MAX]; static char *task_group_path(struct task_group *tg) @@ -506,7 +505,9 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p) SEQ_printf(m, " %d %d", task_node(p), task_numa_group_id(p)); #endif #ifdef CONFIG_CGROUP_SCHED + spin_lock(_debug_lock); SEQ_printf(m, " %s", task_group_path(task_group(p))); + spin_unlock(_debug_lock); #endif SEQ_printf(m, "\n"); @@ -543,7 +544,9 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) #ifdef CONFIG_FAIR_GROUP_SCHED SEQ_printf(m, "\n"); + spin_lock(_debug_lock); SEQ_printf(m, "cfs_rq[%d]:%s\n", cpu, task_group_path(cfs_rq->tg)); + spin_unlock(_debug_lock); #else SEQ_printf(m, "\n"); SEQ_printf(m, "cfs_rq[%d]:\n", cpu); @@ -614,7 +617,9 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq) { #ifdef CONFIG_RT_GROUP_SCHED SEQ_printf(m, "\n"); + spin_lock(_debug_lock); SEQ_printf(m, "rt_rq[%d]:%s\n", cpu, task_group_path(rt_rq->tg)); + spin_unlock(_debug_lock); #else SEQ_printf(m, "\n"); SEQ_printf(m, "rt_rq[%d]:\n", cpu); @@ -716,13 +721,11 @@ do { \ } #undef P - spin_lock(_debug_lock); print_cfs_stats(m, cpu); print_rt_stats(m, cpu); print_dl_stats(m, cpu); print_rq(m, rq, cpu); - spin_unlock(_debug_lock); SEQ_printf(m, "\n"); } -- 2.18.1
Re: [PATCH v4 3/4] locking/qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32
On 3/27/21 2:06 PM, guo...@kernel.org wrote: From: Guo Ren Some architectures don't have sub-word swap atomic instruction, they only have the full word's one. The sub-word swap only improve the performance when: NR_CPUS < 16K * 0- 7: locked byte * 8: pending * 9-15: not used * 16-17: tail index * 18-31: tail cpu (+1) The 9-15 bits are wasted to use xchg16 in xchg_tail. Please let architecture select xchg16/xchg32 to implement xchg_tail. Signed-off-by: Guo Ren Cc: Peter Zijlstra Cc: Will Deacon Cc: Ingo Molnar Cc: Waiman Long Cc: Arnd Bergmann Cc: Anup Patel --- kernel/Kconfig.locks | 3 +++ kernel/locking/qspinlock.c | 44 +- 2 files changed, 27 insertions(+), 20 deletions(-) diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks index 3de8fd11873b..d02f1261f73f 100644 --- a/kernel/Kconfig.locks +++ b/kernel/Kconfig.locks @@ -239,6 +239,9 @@ config LOCK_SPIN_ON_OWNER config ARCH_USE_QUEUED_SPINLOCKS bool +config ARCH_USE_QUEUED_SPINLOCKS_XCHG32 + bool + config QUEUED_SPINLOCKS def_bool y if ARCH_USE_QUEUED_SPINLOCKS depends on SMP diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index cbff6ba53d56..54de0632c6a8 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -163,26 +163,6 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock) WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL); } -/* - * xchg_tail - Put in the new queue tail code word & retrieve previous one - * @lock : Pointer to queued spinlock structure - * @tail : The new queue tail code word - * Return: The previous queue tail code word - * - * xchg(lock, tail), which heads an address dependency - * - * p,*,* -> n,*,* ; prev = xchg(lock, node) - */ -static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) -{ - /* -* We can use relaxed semantics since the caller ensures that the -* MCS node is properly initialized before updating the tail. -*/ - return (u32)xchg_relaxed(>tail, -tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET; -} - #else /* _Q_PENDING_BITS == 8 */ /** @@ -206,6 +186,30 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock) { atomic_add(-_Q_PENDING_VAL + _Q_LOCKED_VAL, >val); } +#endif + +#if _Q_PENDING_BITS == 8 && !defined(CONFIG_ARCH_USE_QUEUED_SPINLOCKS_XCHG32) +/* + * xchg_tail - Put in the new queue tail code word & retrieve previous one + * @lock : Pointer to queued spinlock structure + * @tail : The new queue tail code word + * Return: The previous queue tail code word + * + * xchg(lock, tail), which heads an address dependency + * + * p,*,* -> n,*,* ; prev = xchg(lock, node) + */ +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) +{ + /* +* We can use relaxed semantics since the caller ensures that the +* MCS node is properly initialized before updating the tail. +*/ + return (u32)xchg_relaxed(>tail, +tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET; +} + +#else /** * xchg_tail - Put in the new queue tail code word & retrieve previous one I don't have any problem adding a CONFIG_ARCH_USE_QUEUED_SPINLOCKS_XCHG32 config option to control that. One minor nit: #endif /* _Q_PENDING_BITS == 8 */ You should probably remove the comment at the trailing end of the corresponding "#endif" as it is now wrong. Cheers, Longman
Re: [PATCH] locking/mutex: Remove repeated declaration
On 3/22/21 9:09 PM, Shaokun Zhang wrote: Commit 0cd39f4600ed ("locking/seqlock, headers: Untangle the spaghetti monster") introduces 'struct ww_acquire_ctx' again, remove the repeated declaration. Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Signed-off-by: Shaokun Zhang --- include/linux/mutex.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/include/linux/mutex.h b/include/linux/mutex.h index 0cd631a19727..d80c0e22c822 100644 --- a/include/linux/mutex.h +++ b/include/linux/mutex.h @@ -20,8 +20,6 @@ #include #include -struct ww_acquire_ctx; - /* * Simple, straightforward mutexes with strict semantics: * Yes, it is duplicated. Acked-by: Waiman Long
Re: [PATCH] lockdep: add a hint for "INFO: trying to register non-static key." message
On 3/21/21 2:49 AM, Tetsuo Handa wrote: Since this message is printed when dynamically allocated spinlocks (e.g. kzalloc()) are used without initialization (e.g. spin_lock_init()), suggest developers to check whether initialization functions for objects are called, before making developers wonder what annotation is missing. Signed-off-by: Tetsuo Handa --- kernel/locking/lockdep.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index c6d0c1dc6253..44c549f5c061 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -931,6 +931,7 @@ static bool assign_lock_key(struct lockdep_map *lock) debug_locks_off(); pr_err("INFO: trying to register non-static key.\n"); pr_err("the code is fine but needs lockdep annotation.\n"); + pr_err("maybe you didn't initialize this object before you use?\n"); pr_err("turning off the locking correctness validator.\n"); dump_stack(); return false; The only way this message is written is when the appropriate lock init function isn't called for locks in dynamically allocated objects. I think you can just say so without the word "maybe". Like "the code is fine but needs lockdep annotation by calling appropriate lock initialization function.". Cheers, Longman