On Wed, Aug 06, 2014 at 05:29:48AM +0800, Fengguang Wu wrote: > Hi Peter, > > FYI, we noticed the below changes on > > git://internal_merge_and_test_tree devel-athens-alpha-201408042000 > commit 36932a9e2b1043506d06414fa988cd5b9592841f ("sched: Fix > finish_task_switch vs prev_state") > > test case: vm-vp-1G/boot/1 > > 11b4855f1d9f1cc 36932a9e2b1043506d06414fa > --------------- ------------------------- > 8524 ± 1% +121.2% 18858 ± 0% TOTAL boot-meminfo.SUnreclaim > 1321 ± 9% +3196.6% 43568 ± 0% TOTAL boot-meminfo.KernelStack > 8756 ± 0% +29.8% 11363 ± 0% TOTAL boot-slabinfo.num_pages > 34982 ± 0% +29.5% 45311 ± 0% TOTAL boot-meminfo.Slab > 535601 ± 0% -9.9% 482375 ± 0% TOTAL boot-meminfo.MemFree > 116621 ± 0% +9.4% 127548 ± 0% TOTAL boot-slabinfo.num_objs
*blink*.. how does that happen? For reference, that's the below patch (I've since given it an actual Changelog, but the patch is the same). --- Subject: sched: Fix finish_task_switch vs prev_state From: Peter Zijlstra <pet...@infradead.org> Date: Tue, 29 Jul 2014 11:22:37 +0200 Oleg wondered about the prev_state comment in finish_task_switch(). Aside from it being confusingly worded -- we neither initially understood the actual problem being described -- we found that we'd broken the scenario described. Ever since commit e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()") we don't actually acquire rq->lock on wakeups anymore and therefore holding rq->lock after the switch is no good. Even if we did, __ARCH_WANT_UNLOCKED_CTXSW was already broken, because it explicitly didn't hold the rq->lock anymore. We could fix things by placing a full barrier between the prev->state read and the next->on_cpu = 0 store, seeing how the remote wakeup code waits for ->on_cpu to become false before proceeding with the wakeup (so as to avoid having the task running on two cpus at the same time), however full barriers are expensive. Instead we read prev->state before the context switch and propagate it unto finish_task_switch. This trivially solves the problem without adding extra (and costly) memory barriers. I'm not sure why we've never seen crashes due to this, it appears entirely possible. Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()") Cc: Ingo Molnar <mi...@kernel.org> Cc: John Stultz <john.stu...@linaro.org> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Frederic Weisbecker <fweis...@gmail.com> Cc: Dave Jones <da...@redhat.com> Cc: Andrey Ryabinin <a.ryabi...@samsung.com> Cc: Sasha Levin <sasha.le...@oracle.com> Cc: Manfred Spraul <manf...@colorfullife.com> Reported-by: Oleg Nesterov <o...@redhat.com> Signed-off-by: Peter Zijlstra <pet...@infradead.org> Link: http://lkml.kernel.org/r/20140729092237.gu12...@laptop.lan --- kernel/sched/core.c | 36 ++++++++++++++++++++---------------- 1 file changed, 20 insertions(+), 16 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2190,6 +2190,7 @@ prepare_task_switch(struct rq *rq, struc * finish_task_switch - clean up after a task-switch * @rq: runqueue associated with task-switch * @prev: the thread we just switched away from. + * @prev_state: the state of @prev before we switched away from it. * * finish_task_switch must be called after the context switch, paired * with a prepare_task_switch call before the context switch. @@ -2201,26 +2202,14 @@ prepare_task_switch(struct rq *rq, struc * with the lock held can cause deadlocks; see schedule() for * details.) */ -static void finish_task_switch(struct rq *rq, struct task_struct *prev) +static void +finish_task_switch(struct rq *rq, struct task_struct *prev, long prev_state) __releases(rq->lock) { struct mm_struct *mm = rq->prev_mm; - long prev_state; rq->prev_mm = NULL; - /* - * A task struct has one reference for the use as "current". - * If a task dies, then it sets TASK_DEAD in tsk->state and calls - * schedule one last time. The schedule call will never return, and - * the scheduled task must drop that reference. - * The test for TASK_DEAD must occur while the runqueue locks are - * still held, otherwise prev could be scheduled on another cpu, die - * there before we look at prev->state, and then the reference would - * be dropped twice. - * Manfred Spraul <manf...@colorfullife.com> - */ - prev_state = prev->state; vtime_task_switch(prev); finish_arch_switch(prev); perf_event_task_sched_in(prev, current); @@ -2279,7 +2268,7 @@ asmlinkage __visible void schedule_tail( { struct rq *rq = this_rq(); - finish_task_switch(rq, prev); + finish_task_switch(rq, prev, 0); /* * FIXME: do we need to worry about rq being invalidated by the @@ -2304,6 +2293,21 @@ context_switch(struct rq *rq, struct tas struct task_struct *next) { struct mm_struct *mm, *oldmm; + /* + * A task struct has one reference for the use as "current". + * If a task dies, then it sets TASK_DEAD in tsk->state and calls + * schedule one last time. The schedule call will never return, and + * the scheduled task must drop that reference. + * + * We must observe prev->state before clearing prev->on_cpu (in + * finish_lock_switch), otherwise a concurrent wakeup can get prev + * running on another CPU and we could race with its RUNNING -> DEAD + * transition, and then the reference would be dropped twice. + * + * We avoid the race by observing prev->state while it is still + * current. + */ + long prev_state = prev->state; prepare_task_switch(rq, prev, next); @@ -2347,7 +2351,7 @@ context_switch(struct rq *rq, struct tas * CPUs since it called schedule(), thus the 'rq' on its stack * frame will be invalid. */ - finish_task_switch(this_rq(), prev); + finish_task_switch(this_rq(), prev, prev_state); } /*
pgpiBwnM78Nlw.pgp
Description: PGP signature