[tip: sched/core] sched/membarrier: fix missing local execution of ipi_sync_rq_state()
The following commit has been merged into the sched/core branch of tip: Commit-ID: ce29ddc47b91f97e7f69a0fb7cbb5845f52a9825 Gitweb: https://git.kernel.org/tip/ce29ddc47b91f97e7f69a0fb7cbb5845f52a9825 Author:Mathieu Desnoyers AuthorDate:Wed, 17 Feb 2021 11:56:51 -05:00 Committer: Ingo Molnar CommitterDate: Sat, 06 Mar 2021 12:40:21 +01:00 sched/membarrier: fix missing local execution of ipi_sync_rq_state() The function sync_runqueues_membarrier_state() should copy the membarrier state from the @mm received as parameter to each runqueue currently running tasks using that mm. However, the use of smp_call_function_many() skips the current runqueue, which is unintended. Replace by a call to on_each_cpu_mask(). Fixes: 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy load") Reported-by: Nadav Amit Signed-off-by: Mathieu Desnoyers Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Cc: sta...@vger.kernel.org # 5.4.x+ Link: https://lore.kernel.org/r/74f1e842-4a84-47bf-b6c2-5407dfdd4...@gmail.com --- kernel/sched/membarrier.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index acdae62..b5add64 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -471,9 +471,7 @@ static int sync_runqueues_membarrier_state(struct mm_struct *mm) } rcu_read_unlock(); - preempt_disable(); - smp_call_function_many(tmpmask, ipi_sync_rq_state, mm, 1); - preempt_enable(); + on_each_cpu_mask(tmpmask, ipi_sync_rq_state, mm, true); free_cpumask_var(tmpmask); cpus_read_unlock();
[tip: sched/urgent] sched/membarrier: fix missing local execution of ipi_sync_rq_state()
The following commit has been merged into the sched/urgent branch of tip: Commit-ID: fba111913e51a934eaad85734254eab801343836 Gitweb: https://git.kernel.org/tip/fba111913e51a934eaad85734254eab801343836 Author:Mathieu Desnoyers AuthorDate:Wed, 17 Feb 2021 11:56:51 -05:00 Committer: Peter Zijlstra CommitterDate: Mon, 01 Mar 2021 11:02:15 +01:00 sched/membarrier: fix missing local execution of ipi_sync_rq_state() The function sync_runqueues_membarrier_state() should copy the membarrier state from the @mm received as parameter to each runqueue currently running tasks using that mm. However, the use of smp_call_function_many() skips the current runqueue, which is unintended. Replace by a call to on_each_cpu_mask(). Fixes: 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy load") Reported-by: Nadav Amit Signed-off-by: Mathieu Desnoyers Signed-off-by: Peter Zijlstra (Intel) Cc: sta...@vger.kernel.org # 5.4.x+ Link: https://lore.kernel.org/r/74f1e842-4a84-47bf-b6c2-5407dfdd4...@gmail.com --- kernel/sched/membarrier.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index acdae62..b5add64 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -471,9 +471,7 @@ static int sync_runqueues_membarrier_state(struct mm_struct *mm) } rcu_read_unlock(); - preempt_disable(); - smp_call_function_many(tmpmask, ipi_sync_rq_state, mm, 1); - preempt_enable(); + on_each_cpu_mask(tmpmask, ipi_sync_rq_state, mm, true); free_cpumask_var(tmpmask); cpus_read_unlock();
[tip: sched/core] sched: fix exit_mm vs membarrier (v4)
The following commit has been merged into the sched/core branch of tip: Commit-ID: 5bc78502322a5e4eef3f1b2a2813751dc6434143 Gitweb: https://git.kernel.org/tip/5bc78502322a5e4eef3f1b2a2813751dc6434143 Author:Mathieu Desnoyers AuthorDate:Tue, 20 Oct 2020 09:47:13 -04:00 Committer: Peter Zijlstra CommitterDate: Thu, 29 Oct 2020 11:00:30 +01:00 sched: fix exit_mm vs membarrier (v4) exit_mm should issue memory barriers after user-space memory accesses, before clearing current->mm, to order user-space memory accesses performed prior to exit_mm before clearing tsk->mm, which has the effect of skipping the membarrier private expedited IPIs. exit_mm should also update the runqueue's membarrier_state so membarrier global expedited IPIs are not sent when they are not needed. The membarrier system call can be issued concurrently with do_exit if we have thread groups created with CLONE_VM but not CLONE_THREAD. Here is the scenario I have in mind: Two thread groups are created, A and B. Thread group B is created by issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD. Let's assume we have a single thread within each thread group (Thread A and Thread B). The AFAIU we can have: Userspace variables: int x = 0, y = 0; CPU 0 CPU 1 Thread AThread B (in thread group A) (in thread group B) x = 1 barrier() y = 1 exit() exit_mm() current->mm = NULL; r1 = load y membarrier() skips CPU 0 (no IPI) because its current mm is NULL r2 = load x BUG_ON(r1 == 1 && r2 == 0) Signed-off-by: Mathieu Desnoyers Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20201020134715.13909-2-mathieu.desnoy...@efficios.com --- include/linux/sched/mm.h | 5 + kernel/exit.c | 16 +++- kernel/sched/membarrier.c | 12 3 files changed, 32 insertions(+), 1 deletion(-) diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index d5ece7a..a91fb3a 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -347,6 +347,8 @@ static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) extern void membarrier_exec_mmap(struct mm_struct *mm); +extern void membarrier_update_current_mm(struct mm_struct *next_mm); + #else #ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS static inline void membarrier_arch_switch_mm(struct mm_struct *prev, @@ -361,6 +363,9 @@ static inline void membarrier_exec_mmap(struct mm_struct *mm) static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { } +static inline void membarrier_update_current_mm(struct mm_struct *next_mm) +{ +} #endif #endif /* _LINUX_SCHED_MM_H */ diff --git a/kernel/exit.c b/kernel/exit.c index 87a2d51..a3dd6b3 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -475,10 +475,24 @@ static void exit_mm(void) BUG_ON(mm != current->active_mm); /* more a memory barrier than a real lock */ task_lock(current); + /* +* When a thread stops operating on an address space, the loop +* in membarrier_private_expedited() may not observe that +* tsk->mm, and the loop in membarrier_global_expedited() may +* not observe a MEMBARRIER_STATE_GLOBAL_EXPEDITED +* rq->membarrier_state, so those would not issue an IPI. +* Membarrier requires a memory barrier after accessing +* user-space memory, before clearing tsk->mm or the +* rq->membarrier_state. +*/ + smp_mb__after_spinlock(); + local_irq_disable(); current->mm = NULL; - mmap_read_unlock(mm); + membarrier_update_current_mm(NULL); enter_lazy_tlb(mm, current); + local_irq_enable(); task_unlock(current); + mmap_read_unlock(mm); mm_update_next_owner(mm); mmput(mm); if (test_thread_flag(TIF_MEMDIE)) diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index e23e74d..aac3292 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -76,6 +76,18 @@ void membarrier_exec_mmap(struct mm_struct *mm) this_cpu_write(runqueues.membarrier_state, 0); } +void membarrier_update_current_mm(struct mm_struct *next_mm) +{ + struct rq *rq = this_rq(); + int membarrier_state = 0; + + if (next_mm) + membarrier_state = atomic_read(&next_mm->membarrier_state); + if (READ_ONCE(rq->membarrier_state) == membarrier_state) + return; + WRITE_ONCE(rq->membarrier_state, membarrier_state); +} + static int membarrier_global_expedited(void) { int cpu;
[tip: sched/core] sched: membarrier: cover kthread_use_mm (v4)
The following commit has been merged into the sched/core branch of tip: Commit-ID: 618758ed3a4f7d790414d020b362111748ebbf9f Gitweb: https://git.kernel.org/tip/618758ed3a4f7d790414d020b362111748ebbf9f Author:Mathieu Desnoyers AuthorDate:Tue, 20 Oct 2020 09:47:14 -04:00 Committer: Peter Zijlstra CommitterDate: Thu, 29 Oct 2020 11:00:31 +01:00 sched: membarrier: cover kthread_use_mm (v4) Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm to allow the effect of membarrier(2) to apply to kthreads accessing user-space memory as well. Given that no prior kthread use this guarantee and that it only affects kthreads, adding this guarantee does not affect user-space ABI. Refine the check in membarrier_global_expedited to exclude runqueues running the idle thread rather than all kthreads from the IPI cpumask. Now that membarrier_global_expedited can IPI kthreads, the scheduler also needs to update the runqueue's membarrier_state when entering lazy TLB state. Signed-off-by: Mathieu Desnoyers Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20201020134715.13909-3-mathieu.desnoy...@efficios.com --- kernel/kthread.c | 21 + kernel/sched/idle.c | 1 + kernel/sched/membarrier.c | 7 +++ 3 files changed, 25 insertions(+), 4 deletions(-) diff --git a/kernel/kthread.c b/kernel/kthread.c index e29773c..481428f 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -1248,6 +1248,7 @@ void kthread_use_mm(struct mm_struct *mm) tsk->active_mm = mm; } tsk->mm = mm; + membarrier_update_current_mm(mm); switch_mm_irqs_off(active_mm, mm, tsk); local_irq_enable(); task_unlock(tsk); @@ -1255,8 +1256,19 @@ void kthread_use_mm(struct mm_struct *mm) finish_arch_post_lock_switch(); #endif + /* +* When a kthread starts operating on an address space, the loop +* in membarrier_{private,global}_expedited() may not observe +* that tsk->mm, and not issue an IPI. Membarrier requires a +* memory barrier after storing to tsk->mm, before accessing +* user-space memory. A full memory barrier for membarrier +* {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by +* mmdrop(), or explicitly with smp_mb(). +*/ if (active_mm != mm) mmdrop(active_mm); + else + smp_mb(); to_kthread(tsk)->oldfs = force_uaccess_begin(); } @@ -1276,9 +1288,18 @@ void kthread_unuse_mm(struct mm_struct *mm) force_uaccess_end(to_kthread(tsk)->oldfs); task_lock(tsk); + /* +* When a kthread stops operating on an address space, the loop +* in membarrier_{private,global}_expedited() may not observe +* that tsk->mm, and not issue an IPI. Membarrier requires a +* memory barrier after accessing user-space memory, before +* clearing tsk->mm. +*/ + smp_mb__after_spinlock(); sync_mm_rss(mm); local_irq_disable(); tsk->mm = NULL; + membarrier_update_current_mm(NULL); /* active_mm is still 'mm' */ enter_lazy_tlb(mm, tsk); local_irq_enable(); diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 24d0ee2..846743e 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -338,6 +338,7 @@ void play_idle_precise(u64 duration_ns, u64 latency_ns) WARN_ON_ONCE(!(current->flags & PF_KTHREAD)); WARN_ON_ONCE(!(current->flags & PF_NO_SETAFFINITY)); WARN_ON_ONCE(!duration_ns); + WARN_ON_ONCE(current->mm); rcu_sleep_check(); preempt_disable(); diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index aac3292..f223f35 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -126,12 +126,11 @@ static int membarrier_global_expedited(void) continue; /* -* Skip the CPU if it runs a kernel thread. The scheduler -* leaves the prior task mm in place as an optimization when -* scheduling a kthread. +* Skip the CPU if it runs a kernel thread which is not using +* a task mm. */ p = rcu_dereference(cpu_rq(cpu)->curr); - if (p->flags & PF_KTHREAD) + if (!p->mm) continue; __cpumask_set_cpu(cpu, tmpmask);
[tip: sched/core] sched: membarrier: document memory ordering scenarios
The following commit has been merged into the sched/core branch of tip: Commit-ID: 25595eb6aaa9fbb31330f1e0b400642694bc6574 Gitweb: https://git.kernel.org/tip/25595eb6aaa9fbb31330f1e0b400642694bc6574 Author:Mathieu Desnoyers AuthorDate:Tue, 20 Oct 2020 09:47:15 -04:00 Committer: Peter Zijlstra CommitterDate: Thu, 29 Oct 2020 11:00:31 +01:00 sched: membarrier: document memory ordering scenarios Document membarrier ordering scenarios in membarrier.c. Thanks to Alan Stern for refreshing my memory. Now that I have those in mind, it seems appropriate to serialize them to comments for posterity. Signed-off-by: Mathieu Desnoyers Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20201020134715.13909-4-mathieu.desnoy...@efficios.com --- kernel/sched/membarrier.c | 128 +- 1 file changed, 128 insertions(+) diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index f223f35..5a40b38 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -7,6 +7,134 @@ #include "sched.h" /* + * For documentation purposes, here are some membarrier ordering + * scenarios to keep in mind: + * + * A) Userspace thread execution after IPI vs membarrier's memory + *barrier before sending the IPI + * + * Userspace variables: + * + * int x = 0, y = 0; + * + * The memory barrier at the start of membarrier() on CPU0 is necessary in + * order to enforce the guarantee that any writes occurring on CPU0 before + * the membarrier() is executed will be visible to any code executing on + * CPU1 after the IPI-induced memory barrier: + * + * CPU0 CPU1 + * + * x = 1 + * membarrier(): + * a: smp_mb() + * b: send IPI IPI-induced mb + * c: smp_mb() + * r2 = y + * y = 1 + * barrier() + * r1 = x + * + * BUG_ON(r1 == 0 && r2 == 0) + * + * The write to y and load from x by CPU1 are unordered by the hardware, + * so it's possible to have "r1 = x" reordered before "y = 1" at any + * point after (b). If the memory barrier at (a) is omitted, then "x = 1" + * can be reordered after (a) (although not after (c)), so we get r1 == 0 + * and r2 == 0. This violates the guarantee that membarrier() is + * supposed by provide. + * + * The timing of the memory barrier at (a) has to ensure that it executes + * before the IPI-induced memory barrier on CPU1. + * + * B) Userspace thread execution before IPI vs membarrier's memory + *barrier after completing the IPI + * + * Userspace variables: + * + * int x = 0, y = 0; + * + * The memory barrier at the end of membarrier() on CPU0 is necessary in + * order to enforce the guarantee that any writes occurring on CPU1 before + * the membarrier() is executed will be visible to any code executing on + * CPU0 after the membarrier(): + * + * CPU0 CPU1 + * + * x = 1 + * barrier() + * y = 1 + * r2 = y + * membarrier(): + * a: smp_mb() + * b: send IPI IPI-induced mb + * c: smp_mb() + * r1 = x + * BUG_ON(r1 == 0 && r2 == 1) + * + * The writes to x and y are unordered by the hardware, so it's possible to + * have "r2 = 1" even though the write to x doesn't execute until (b). If + * the memory barrier at (c) is omitted then "r1 = x" can be reordered + * before (b) (although not before (a)), so we get "r1 = 0". This violates + * the guarantee that membarrier() is supposed to provide. + * + * The timing of the memory barrier at (c) has to ensure that it executes + * after the IPI-induced memory barrier on CPU1. + * + * C) Scheduling userspace thread -> kthread -> userspace thread vs membarrier + * + * CPU0CPU1 + * + * membarrier(): + * a: smp_mb() + * d: switch to kthread (includes mb) + * b: read rq->curr->mm == NULL + * e: switch to user (includes mb) + * c: smp_mb() + * + * Using the scenario from (A), we can show that (a) needs to be paired + * with (e). Using the scenario from (B), we can show that (c) needs to + * be paired with (d). + * + * D) exit_mm vs membarrier + * + * Two thread groups are created, A and B. Thread group B is created by + * issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD. + * Let's assume we have a single thread within each thread group (Thread A + * and Thread B). Thread A runs on CPU0, Thread B runs on CPU1. + * + * CPU0CPU1 + * + * membarrier(): + *
[tip: sched/urgent] sched: Fix unreliable rseq cpu_id for new tasks
The following commit has been merged into the sched/urgent branch of tip: Commit-ID: ce3614daabea8a2d01c1dd17ae41d1ec5e5ae7db Gitweb: https://git.kernel.org/tip/ce3614daabea8a2d01c1dd17ae41d1ec5e5ae7db Author:Mathieu Desnoyers AuthorDate:Mon, 06 Jul 2020 16:49:10 -04:00 Committer: Peter Zijlstra CommitterDate: Wed, 08 Jul 2020 11:38:50 +02:00 sched: Fix unreliable rseq cpu_id for new tasks While integrating rseq into glibc and replacing glibc's sched_getcpu implementation with rseq, glibc's tests discovered an issue with incorrect __rseq_abi.cpu_id field value right after the first time a newly created process issues sched_setaffinity. For the records, it triggers after building glibc and running tests, and then issuing: for x in {1..2000} ; do posix/tst-affinity-static & done and shows up as: error: Unexpected CPU 2, expected 0 error: Unexpected CPU 2, expected 0 error: Unexpected CPU 2, expected 0 error: Unexpected CPU 2, expected 0 error: Unexpected CPU 138, expected 0 error: Unexpected CPU 138, expected 0 error: Unexpected CPU 138, expected 0 error: Unexpected CPU 138, expected 0 This is caused by the scheduler invoking __set_task_cpu() directly from sched_fork() and wake_up_new_task(), thus bypassing rseq_migrate() which is done by set_task_cpu(). Add the missing rseq_migrate() to both functions. The only other direct use of __set_task_cpu() is done by init_idle(), which does not involve a user-space task. Based on my testing with the glibc test-case, just adding rseq_migrate() to wake_up_new_task() is sufficient to fix the observed issue. Also add it to sched_fork() to keep things consistent. The reason why this never triggered so far with the rseq/basic_test selftest is unclear. The current use of sched_getcpu(3) does not typically require it to be always accurate. However, use of the __rseq_abi.cpu_id field within rseq critical sections requires it to be accurate. If it is not accurate, it can cause corruption in the per-cpu data targeted by rseq critical sections in user-space. Reported-By: Florian Weimer Signed-off-by: Mathieu Desnoyers Signed-off-by: Peter Zijlstra (Intel) Tested-By: Florian Weimer Cc: sta...@vger.kernel.org # v4.18+ Link: https://lkml.kernel.org/r/20200707201505.2632-1-mathieu.desnoy...@efficios.com --- kernel/sched/core.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 950ac45..e15543c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2965,6 +2965,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p) * Silence PROVE_RCU. */ raw_spin_lock_irqsave(&p->pi_lock, flags); + rseq_migrate(p); /* * We're setting the CPU for the first time, we don't migrate, * so use __set_task_cpu(). @@ -3029,6 +3030,7 @@ void wake_up_new_task(struct task_struct *p) * as we're not fully set-up yet. */ p->recent_used_cpu = task_cpu(p); + rseq_migrate(p); __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); #endif rq = __task_rq_lock(p, &rf);
[tip: sched/urgent] sched/membarrier: Fix p->mm->membarrier_state racy load
The following commit has been merged into the sched/urgent branch of tip: Commit-ID: 227a4aadc75ba22fcb6c4e1c078817b8cbaae4ce Gitweb: https://git.kernel.org/tip/227a4aadc75ba22fcb6c4e1c078817b8cbaae4ce Author:Mathieu Desnoyers AuthorDate:Thu, 19 Sep 2019 13:37:02 -04:00 Committer: Ingo Molnar CommitterDate: Wed, 25 Sep 2019 17:42:30 +02:00 sched/membarrier: Fix p->mm->membarrier_state racy load The membarrier_state field is located within the mm_struct, which is not guaranteed to exist when used from runqueue-lock-free iteration on runqueues by the membarrier system call. Copy the membarrier_state from the mm_struct into the scheduler runqueue when the scheduler switches between mm. When registering membarrier for mm, after setting the registration bit in the mm membarrier state, issue a synchronize_rcu() to ensure the scheduler observes the change. In order to take care of the case where a runqueue keeps executing the target mm without swapping to other mm, iterate over each runqueue and issue an IPI to copy the membarrier_state from the mm_struct into each runqueue which have the same mm which state has just been modified. Move the mm membarrier_state field closer to pgd in mm_struct to use a cache line already touched by the scheduler switch_mm. The membarrier_execve() (now membarrier_exec_mmap) hook now needs to clear the runqueue's membarrier state in addition to clear the mm membarrier state, so move its implementation into the scheduler membarrier code so it can access the runqueue structure. Add memory barrier in membarrier_exec_mmap() prior to clearing the membarrier state, ensuring memory accesses executed prior to exec are not reordered with the stores clearing the membarrier state. As suggested by Linus, move all membarrier.c RCU read-side locks outside of the for each cpu loops. Suggested-by: Linus Torvalds Signed-off-by: Mathieu Desnoyers Signed-off-by: Peter Zijlstra (Intel) Cc: Chris Metcalf Cc: Christoph Lameter Cc: Eric W. Biederman Cc: Kirill Tkhai Cc: Mike Galbraith Cc: Oleg Nesterov Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Russell King - ARM Linux admin Cc: Thomas Gleixner Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoy...@efficios.com Signed-off-by: Ingo Molnar --- fs/exec.c | 2 +- include/linux/mm_types.h | 14 ++- include/linux/sched/mm.h | 8 +-- kernel/sched/core.c | 4 +- kernel/sched/membarrier.c | 175 +++-- kernel/sched/sched.h | 34 +++- 6 files changed, 183 insertions(+), 54 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index f7f6a14..555e93c 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1033,6 +1033,7 @@ static int exec_mmap(struct mm_struct *mm) } task_lock(tsk); active_mm = tsk->active_mm; + membarrier_exec_mmap(mm); tsk->mm = mm; tsk->active_mm = mm; activate_mm(active_mm, mm); @@ -1825,7 +1826,6 @@ static int __do_execve_file(int fd, struct filename *filename, /* execve succeeded */ current->fs->in_exec = 0; current->in_execve = 0; - membarrier_execve(current); rseq_execve(current); acct_update_integrals(current); task_numa_free(current, false); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 6a7a108..ec9bd3a 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -383,6 +383,16 @@ struct mm_struct { unsigned long highest_vm_end; /* highest vma end address */ pgd_t * pgd; +#ifdef CONFIG_MEMBARRIER + /** +* @membarrier_state: Flags controlling membarrier behavior. +* +* This field is close to @pgd to hopefully fit in the same +* cache-line, which needs to be touched by switch_mm(). +*/ + atomic_t membarrier_state; +#endif + /** * @mm_users: The number of users including userspace. * @@ -452,9 +462,7 @@ struct mm_struct { unsigned long flags; /* Must use atomic bitops to access */ struct core_state *core_state; /* coredumping support */ -#ifdef CONFIG_MEMBARRIER - atomic_t membarrier_state; -#endif + #ifdef CONFIG_AIO spinlock_t ioctx_lock; struct kioctx_table __rcu *ioctx_table; diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 8557ec6..e677001 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -370,10 +370,8 @@ static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) sync_core_before_usermode(); } -static inline void membarrier_execve(struct task_struct *t) -{ - atomic_set(&t->mm->membarrier_state, 0); -} +extern void membarrier_exec_mmap(struct mm_struct *mm); + #else #ifdef CONFIG_ARCH_HAS_M
[tip: sched/urgent] selftests, sched/membarrier: Add multi-threaded test
The following commit has been merged into the sched/urgent branch of tip: Commit-ID: 19a4ff534bb09686f53800564cb977bad2177c00 Gitweb: https://git.kernel.org/tip/19a4ff534bb09686f53800564cb977bad2177c00 Author:Mathieu Desnoyers AuthorDate:Thu, 19 Sep 2019 13:37:03 -04:00 Committer: Ingo Molnar CommitterDate: Wed, 25 Sep 2019 17:42:31 +02:00 selftests, sched/membarrier: Add multi-threaded test membarrier commands cover very different code paths if they are in a single-threaded vs multi-threaded process. Therefore, exercise both scenarios in the kernel selftests to increase coverage of this selftest. Signed-off-by: Mathieu Desnoyers Signed-off-by: Peter Zijlstra (Intel) Cc: Chris Metcalf Cc: Christoph Lameter Cc: Eric W. Biederman Cc: Kirill Tkhai Cc: Linus Torvalds Cc: Mike Galbraith Cc: Oleg Nesterov Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Russell King - ARM Linux admin Cc: Shuah Khan Cc: Thomas Gleixner Link: https://lkml.kernel.org/r/20190919173705.2181-6-mathieu.desnoy...@efficios.com Signed-off-by: Ingo Molnar --- tools/testing/selftests/membarrier/.gitignore | 3 +- tools/testing/selftests/membarrier/Makefile| 5 +- tools/testing/selftests/membarrier/membarrier_test.c | 313 +- tools/testing/selftests/membarrier/membarrier_test_impl.h | 317 ++- tools/testing/selftests/membarrier/membarrier_test_multi_thread.c | 73 - tools/testing/selftests/membarrier/membarrier_test_single_thread.c | 24 +- 6 files changed, 419 insertions(+), 316 deletions(-) delete mode 100644 tools/testing/selftests/membarrier/membarrier_test.c create mode 100644 tools/testing/selftests/membarrier/membarrier_test_impl.h create mode 100644 tools/testing/selftests/membarrier/membarrier_test_multi_thread.c create mode 100644 tools/testing/selftests/membarrier/membarrier_test_single_thread.c diff --git a/tools/testing/selftests/membarrier/.gitignore b/tools/testing/selftests/membarrier/.gitignore index 020c44f..f2f7ec0 100644 --- a/tools/testing/selftests/membarrier/.gitignore +++ b/tools/testing/selftests/membarrier/.gitignore @@ -1 +1,2 @@ -membarrier_test +membarrier_test_multi_thread +membarrier_test_single_thread diff --git a/tools/testing/selftests/membarrier/Makefile b/tools/testing/selftests/membarrier/Makefile index 97e3bdf..34d1c81 100644 --- a/tools/testing/selftests/membarrier/Makefile +++ b/tools/testing/selftests/membarrier/Makefile @@ -1,7 +1,8 @@ # SPDX-License-Identifier: GPL-2.0-only CFLAGS += -g -I../../../../usr/include/ +LDLIBS += -lpthread -TEST_GEN_PROGS := membarrier_test +TEST_GEN_PROGS := membarrier_test_single_thread \ + membarrier_test_multi_thread include ../lib.mk - diff --git a/tools/testing/selftests/membarrier/membarrier_test.c b/tools/testing/selftests/membarrier/membarrier_test.c deleted file mode 100644 index 70b4ddb..000 --- a/tools/testing/selftests/membarrier/membarrier_test.c +++ /dev/null @@ -1,313 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -#define _GNU_SOURCE -#include -#include -#include -#include -#include - -#include "../kselftest.h" - -static int sys_membarrier(int cmd, int flags) -{ - return syscall(__NR_membarrier, cmd, flags); -} - -static int test_membarrier_cmd_fail(void) -{ - int cmd = -1, flags = 0; - const char *test_name = "sys membarrier invalid command"; - - if (sys_membarrier(cmd, flags) != -1) { - ksft_exit_fail_msg( - "%s test: command = %d, flags = %d. Should fail, but passed\n", - test_name, cmd, flags); - } - if (errno != EINVAL) { - ksft_exit_fail_msg( - "%s test: flags = %d. Should return (%d: \"%s\"), but returned (%d: \"%s\").\n", - test_name, flags, EINVAL, strerror(EINVAL), - errno, strerror(errno)); - } - - ksft_test_result_pass( - "%s test: command = %d, flags = %d, errno = %d. Failed as expected\n", - test_name, cmd, flags, errno); - return 0; -} - -static int test_membarrier_flags_fail(void) -{ - int cmd = MEMBARRIER_CMD_QUERY, flags = 1; - const char *test_name = "sys membarrier MEMBARRIER_CMD_QUERY invalid flags"; - - if (sys_membarrier(cmd, flags) != -1) { - ksft_exit_fail_msg( - "%s test: flags = %d. Should fail, but passed\n", - test_name, flags); - } - if (errno != EINVAL) { - ksft_exit_fail_msg( - "%s test: flags = %d. Should return (%d: \"%s\"), but returned (%d: \"%s\").\n", - test_name, flags, EINVAL, strerror(EINVAL), - errno, strerror(errno)
[tip: sched/urgent] sched/membarrier: Call sync_core only before usermode for same mm
The following commit has been merged into the sched/urgent branch of tip: Commit-ID: 2840cf02fae627860156737e83326df354ee4ec6 Gitweb: https://git.kernel.org/tip/2840cf02fae627860156737e83326df354ee4ec6 Author:Mathieu Desnoyers AuthorDate:Thu, 19 Sep 2019 13:37:01 -04:00 Committer: Ingo Molnar CommitterDate: Wed, 25 Sep 2019 17:42:30 +02:00 sched/membarrier: Call sync_core only before usermode for same mm When the prev and next task's mm change, switch_mm() provides the core serializing guarantees before returning to usermode. The only case where an explicit core serialization is needed is when the scheduler keeps the same mm for prev and next. Suggested-by: Oleg Nesterov Signed-off-by: Mathieu Desnoyers Signed-off-by: Peter Zijlstra (Intel) Cc: Chris Metcalf Cc: Christoph Lameter Cc: Eric W. Biederman Cc: Kirill Tkhai Cc: Linus Torvalds Cc: Mike Galbraith Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Russell King - ARM Linux admin Cc: Thomas Gleixner Link: https://lkml.kernel.org/r/20190919173705.2181-4-mathieu.desnoy...@efficios.com Signed-off-by: Ingo Molnar --- include/linux/sched/mm.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 4a79440..8557ec6 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -362,6 +362,8 @@ enum { static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { + if (current->mm != mm) + return; if (likely(!(atomic_read(&mm->membarrier_state) & MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) return;
[tip: sched/urgent] sched/membarrier: Skip IPIs when mm->mm_users == 1
The following commit has been merged into the sched/urgent branch of tip: Commit-ID: c6d68c1c4a4d6611fc0f8145d764226571d737ca Gitweb: https://git.kernel.org/tip/c6d68c1c4a4d6611fc0f8145d764226571d737ca Author:Mathieu Desnoyers AuthorDate:Thu, 19 Sep 2019 13:37:04 -04:00 Committer: Ingo Molnar CommitterDate: Wed, 25 Sep 2019 17:42:31 +02:00 sched/membarrier: Skip IPIs when mm->mm_users == 1 If there is only a single mm_user for the mm, the private expedited membarrier command can skip the IPIs, because only a single thread is using the mm. Signed-off-by: Mathieu Desnoyers Signed-off-by: Peter Zijlstra (Intel) Cc: Chris Metcalf Cc: Christoph Lameter Cc: Eric W. Biederman Cc: Kirill Tkhai Cc: Linus Torvalds Cc: Mike Galbraith Cc: Oleg Nesterov Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Russell King - ARM Linux admin Cc: Thomas Gleixner Link: https://lkml.kernel.org/r/20190919173705.2181-7-mathieu.desnoy...@efficios.com Signed-off-by: Ingo Molnar --- kernel/sched/membarrier.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index 070cf43..fced54a 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -145,20 +145,21 @@ static int membarrier_private_expedited(int flags) int cpu; bool fallback = false; cpumask_var_t tmpmask; + struct mm_struct *mm = current->mm; if (flags & MEMBARRIER_FLAG_SYNC_CORE) { if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) return -EINVAL; - if (!(atomic_read(¤t->mm->membarrier_state) & + if (!(atomic_read(&mm->membarrier_state) & MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY)) return -EPERM; } else { - if (!(atomic_read(¤t->mm->membarrier_state) & + if (!(atomic_read(&mm->membarrier_state) & MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY)) return -EPERM; } - if (num_online_cpus() == 1) + if (atomic_read(&mm->mm_users) == 1 || num_online_cpus() == 1) return 0; /* @@ -194,7 +195,7 @@ static int membarrier_private_expedited(int flags) continue; rcu_read_lock(); p = rcu_dereference(cpu_rq(cpu)->curr); - if (p && p->mm == current->mm) { + if (p && p->mm == mm) { if (!fallback) __cpumask_set_cpu(cpu, tmpmask); else
[tip: sched/urgent] sched/membarrier: Remove redundant check
The following commit has been merged into the sched/urgent branch of tip: Commit-ID: 09554009c0cad4cb2223dd943c813c9257c6883a Gitweb: https://git.kernel.org/tip/09554009c0cad4cb2223dd943c813c9257c6883a Author:Mathieu Desnoyers AuthorDate:Thu, 19 Sep 2019 13:37:00 -04:00 Committer: Ingo Molnar CommitterDate: Wed, 25 Sep 2019 17:42:30 +02:00 sched/membarrier: Remove redundant check Checking that the number of threads is 1 is redundant with checking mm_users == 1. No change in functionality intended. Suggested-by: Oleg Nesterov Signed-off-by: Mathieu Desnoyers Signed-off-by: Peter Zijlstra (Intel) Cc: Chris Metcalf Cc: Christoph Lameter Cc: Eric W. Biederman Cc: Kirill Tkhai Cc: Linus Torvalds Cc: Mike Galbraith Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Russell King - ARM Linux admin Cc: Thomas Gleixner Link: https://lkml.kernel.org/r/20190919173705.2181-3-mathieu.desnoy...@efficios.com Signed-off-by: Ingo Molnar --- kernel/sched/membarrier.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index d48b95f..7ccbd0e 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -186,7 +186,7 @@ static int membarrier_register_global_expedited(void) MEMBARRIER_STATE_GLOBAL_EXPEDITED_READY) return 0; atomic_or(MEMBARRIER_STATE_GLOBAL_EXPEDITED, &mm->membarrier_state); - if (atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1) { + if (atomic_read(&mm->mm_users) == 1) { /* * For single mm user, single threaded process, we can * simply issue a memory barrier after setting @@ -232,7 +232,7 @@ static int membarrier_register_private_expedited(int flags) if (flags & MEMBARRIER_FLAG_SYNC_CORE) atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE, &mm->membarrier_state); - if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) { + if (atomic_read(&mm->mm_users) != 1) { /* * Ensure all future scheduler executions will observe the * new thread flag state for this process.
[tip: sched/urgent] sched/membarrier: Return -ENOMEM to userspace on memory allocation failure
The following commit has been merged into the sched/urgent branch of tip: Commit-ID: c172e0a3e8e65a4c6fffec5bc4d6de08d6f894f7 Gitweb: https://git.kernel.org/tip/c172e0a3e8e65a4c6fffec5bc4d6de08d6f894f7 Author:Mathieu Desnoyers AuthorDate:Thu, 19 Sep 2019 13:37:05 -04:00 Committer: Ingo Molnar CommitterDate: Wed, 25 Sep 2019 17:42:31 +02:00 sched/membarrier: Return -ENOMEM to userspace on memory allocation failure Remove the IPI fallback code from membarrier to deal with very infrequent cpumask memory allocation failure. Use GFP_KERNEL rather than GFP_NOWAIT, and relax the blocking guarantees for the expedited membarrier system call commands, allowing it to block if waiting for memory to be made available. In addition, now -ENOMEM can be returned to user-space if the cpumask memory allocation fails. Signed-off-by: Mathieu Desnoyers Signed-off-by: Peter Zijlstra (Intel) Cc: Chris Metcalf Cc: Christoph Lameter Cc: Eric W. Biederman Cc: Kirill Tkhai Cc: Linus Torvalds Cc: Mike Galbraith Cc: Oleg Nesterov Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Russell King - ARM Linux admin Cc: Thomas Gleixner Link: https://lkml.kernel.org/r/20190919173705.2181-8-mathieu.desnoy...@efficios.com Signed-off-by: Ingo Molnar --- kernel/sched/membarrier.c | 63 -- 1 file changed, 20 insertions(+), 43 deletions(-) diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index fced54a..a39bed2 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -66,7 +66,6 @@ void membarrier_exec_mmap(struct mm_struct *mm) static int membarrier_global_expedited(void) { int cpu; - bool fallback = false; cpumask_var_t tmpmask; if (num_online_cpus() == 1) @@ -78,15 +77,8 @@ static int membarrier_global_expedited(void) */ smp_mb(); /* system call entry is not a mb. */ - /* -* Expedited membarrier commands guarantee that they won't -* block, hence the GFP_NOWAIT allocation flag and fallback -* implementation. -*/ - if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { - /* Fallback for OOM. */ - fallback = true; - } + if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) + return -ENOMEM; cpus_read_lock(); rcu_read_lock(); @@ -117,18 +109,15 @@ static int membarrier_global_expedited(void) if (p->flags & PF_KTHREAD) continue; - if (!fallback) - __cpumask_set_cpu(cpu, tmpmask); - else - smp_call_function_single(cpu, ipi_mb, NULL, 1); + __cpumask_set_cpu(cpu, tmpmask); } rcu_read_unlock(); - if (!fallback) { - preempt_disable(); - smp_call_function_many(tmpmask, ipi_mb, NULL, 1); - preempt_enable(); - free_cpumask_var(tmpmask); - } + + preempt_disable(); + smp_call_function_many(tmpmask, ipi_mb, NULL, 1); + preempt_enable(); + + free_cpumask_var(tmpmask); cpus_read_unlock(); /* @@ -143,7 +132,6 @@ static int membarrier_global_expedited(void) static int membarrier_private_expedited(int flags) { int cpu; - bool fallback = false; cpumask_var_t tmpmask; struct mm_struct *mm = current->mm; @@ -168,15 +156,8 @@ static int membarrier_private_expedited(int flags) */ smp_mb(); /* system call entry is not a mb. */ - /* -* Expedited membarrier commands guarantee that they won't -* block, hence the GFP_NOWAIT allocation flag and fallback -* implementation. -*/ - if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) { - /* Fallback for OOM. */ - fallback = true; - } + if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) + return -ENOMEM; cpus_read_lock(); rcu_read_lock(); @@ -195,20 +176,16 @@ static int membarrier_private_expedited(int flags) continue; rcu_read_lock(); p = rcu_dereference(cpu_rq(cpu)->curr); - if (p && p->mm == mm) { - if (!fallback) - __cpumask_set_cpu(cpu, tmpmask); - else - smp_call_function_single(cpu, ipi_mb, NULL, 1); - } + if (p && p->mm == mm) + __cpumask_set_cpu(cpu, tmpmask); } rcu_read_unlock(); - if (!fallback) { - preempt_disable(); - smp_call_function_many(tmpmask, ipi_mb, NULL, 1); - preempt_enable(); - free_cpumask_var(tmpmask); - } + + preempt_disable(); + smp_call_function_many(tmpmask, ipi_mb, NULL, 1); + preempt_enable(); + +
[tip: sched/urgent] sched/membarrier: Fix private expedited registration check
The following commit has been merged into the sched/urgent branch of tip: Commit-ID: fc0d77387cb5ae883fd774fc559e056a8dde024c Gitweb: https://git.kernel.org/tip/fc0d77387cb5ae883fd774fc559e056a8dde024c Author:Mathieu Desnoyers AuthorDate:Thu, 19 Sep 2019 13:36:59 -04:00 Committer: Ingo Molnar CommitterDate: Wed, 25 Sep 2019 17:42:30 +02:00 sched/membarrier: Fix private expedited registration check Fix a logic flaw in the way membarrier_register_private_expedited() handles ready state checks for private expedited sync core and private expedited registrations. If a private expedited membarrier registration is first performed, and then a private expedited sync_core registration is performed, the ready state check will skip the second registration when it really should not. Signed-off-by: Mathieu Desnoyers Signed-off-by: Peter Zijlstra (Intel) Cc: Chris Metcalf Cc: Christoph Lameter Cc: Eric W. Biederman Cc: Kirill Tkhai Cc: Linus Torvalds Cc: Mike Galbraith Cc: Oleg Nesterov Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Russell King - ARM Linux admin Cc: Thomas Gleixner Link: https://lkml.kernel.org/r/20190919173705.2181-2-mathieu.desnoy...@efficios.com Signed-off-by: Ingo Molnar --- kernel/sched/membarrier.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index b14250a..d48b95f 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -226,7 +226,7 @@ static int membarrier_register_private_expedited(int flags) * groups, which use the same mm. (CLONE_VM but not * CLONE_THREAD). */ - if (atomic_read(&mm->membarrier_state) & state) + if ((atomic_read(&mm->membarrier_state) & state) == state) return 0; atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED, &mm->membarrier_state); if (flags & MEMBARRIER_FLAG_SYNC_CORE)