On Thu, May 30, 2019 at 05:58:09AM -0700, Paul E. McKenney wrote: > It turns out that tick_broadcast_offline() was an innocent bystander. > After all, interrupts are supposed to be disabled throughout > take_cpu_down(), and therefore should have been disabled upon entry to > tick_offline_cpu() and thus to tick_broadcast_offline(). This suggests > that one of the CPU-hotplug notifiers was incorrectly enabling interrupts, > and leaving them enabled on return. > > Some debugging code showed that the culprit was sched_cpu_dying(). > It had irqs enabled after return from sched_tick_stop(). Which in turn > had irqs enabled after return from cancel_delayed_work_sync(). Which is a > wrapper around __cancel_work_timer(). Which can sleep in the case where > something else is concurrently trying to cancel the same delayed work, > and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad > idea when you are invoked from take_cpu_down(), regardless of the state > you leave interrupts in upon return.
Nice catch! Sorry for leaving that puzzle behind. > > Code inspection located no reason why the delayed work absolutely > needed to be canceled from sched_tick_stop(): The work is not > bound to the outgoing CPU by design, given that the whole point is > to collect statistics without disturbing the outgoing CPU. > > This commit therefore simply drops the cancel_delayed_work_sync() from > sched_tick_stop(). Instead, a new ->state field is added to the tick_work > structure so that the delayed-work handler function sched_tick_remote() > can avoid reposting itself. A cpu_is_offline() check is also added to > sched_tick_remote() to avoid mucking with the state of an offlined CPU > (though it does appear safe to do so). I can't guarantee that it is safe myself to call the tick of an offline CPU. Better have that check indeed. > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 102dfcf0a29a..9a10ee9afcbf 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3050,14 +3050,44 @@ void scheduler_tick(void) > > struct tick_work { > int cpu; > + int state; > struct delayed_work work; > }; > +// Values for ->state, see diagram below. > +#define TICK_SCHED_REMOTE_IDLE 0 So it took me some time to understand that the IDLE state is the tick work that ackowledged OFFLINING and finally completes the offline process. Therefore perhaps we can rename it to TICK_SCHED_REMOTE_OFFLINE so that we instantly get the state machine scenario. > +#define TICK_SCHED_REMOTE_RUNNING 1 > +#define TICK_SCHED_REMOTE_OFFLINING 2 > + > +// State diagram for ->state: > +// > +// > +// +----->IDLE-----------------------------+ > +// | | > +// | | > +// | | sched_tick_start() > +// | sched_tick_remote() | > +// | | > +// | V > +// | +---------->RUNNING > +// | | | > +// | | | > +// | | | > +// | sched_tick_start() | | sched_tick_stop() > +// | | | > +// | | | > +// | | | > +// +--------------------OFFLINING<---------+ > +// > +// > +// Other transitions get WARN_ON_ONCE(), except that sched_tick_remote() > +// and sched_tick_start() are happy to leave the state in RUNNING. > > static struct tick_work __percpu *tick_work_cpu; > > static void sched_tick_remote(struct work_struct *work) > { > struct delayed_work *dwork = to_delayed_work(work); > + int os; > struct tick_work *twork = container_of(dwork, struct tick_work, work); > int cpu = twork->cpu; > struct rq *rq = cpu_rq(cpu); > @@ -3077,7 +3107,7 @@ static void sched_tick_remote(struct work_struct *work) > > rq_lock_irq(rq, &rf); > curr = rq->curr; > - if (is_idle_task(curr)) > + if (is_idle_task(curr) || cpu_is_offline(cpu)) Or we could simply check rq->online, while we have rq locked. > goto out_unlock; > > update_rq_clock(rq); > @@ -3097,13 +3127,22 @@ static void sched_tick_remote(struct work_struct > *work) > /* > * Run the remote tick once per second (1Hz). This arbitrary > * frequency is large enough to avoid overload but short enough > - * to keep scheduler internal stats reasonably up to date. > + * to keep scheduler internal stats reasonably up to date. But > + * first update state to reflect hotplug activity if required. > */ > - queue_delayed_work(system_unbound_wq, dwork, HZ); > + do { > + os = READ_ONCE(twork->state); > + WARN_ON_ONCE(os == TICK_SCHED_REMOTE_IDLE); > + if (os == TICK_SCHED_REMOTE_RUNNING) > + break; > + } while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_IDLE) != os); > + if (os == TICK_SCHED_REMOTE_RUNNING) > + queue_delayed_work(system_unbound_wq, dwork, HZ); > } > > static void sched_tick_start(int cpu) > { > + int os; > struct tick_work *twork; > > if (housekeeping_cpu(cpu, HK_FLAG_TICK)) > @@ -3112,14 +3151,23 @@ static void sched_tick_start(int cpu) > WARN_ON_ONCE(!tick_work_cpu); > > twork = per_cpu_ptr(tick_work_cpu, cpu); > - twork->cpu = cpu; > - INIT_DELAYED_WORK(&twork->work, sched_tick_remote); > - queue_delayed_work(system_unbound_wq, &twork->work, HZ); > + do { > + os = READ_ONCE(twork->state); > + if (os == TICK_SCHED_REMOTE_RUNNING) Is it possible to have RUNNING at this stage? sched_tick_start() shouldn't be called twice without a sched_tick_stop() in the middle. In which case we should even warn if we meet that value here. > + break; > + // Either idle or offline for a short period > + } while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_RUNNING) != os); > + if (os == TICK_SCHED_REMOTE_IDLE) { > + twork->cpu = cpu; > + INIT_DELAYED_WORK(&twork->work, sched_tick_remote); > + queue_delayed_work(system_unbound_wq, &twork->work, HZ); > + } > } > > #ifdef CONFIG_HOTPLUG_CPU > static void sched_tick_stop(int cpu) > { > + int os; > struct tick_work *twork; > > if (housekeeping_cpu(cpu, HK_FLAG_TICK)) > @@ -3128,7 +3176,13 @@ static void sched_tick_stop(int cpu) > WARN_ON_ONCE(!tick_work_cpu); > > twork = per_cpu_ptr(tick_work_cpu, cpu); > - cancel_delayed_work_sync(&twork->work); > + // There cannot be competing actions, but don't rely on stop_machine. > + do { > + os = READ_ONCE(twork->state); > + WARN_ON_ONCE(os != TICK_SCHED_REMOTE_RUNNING); > + // Either idle or offline for a short period > + } while (cmpxchg(&twork->state, os, TICK_SCHED_REMOTE_OFFLINING) != os); > + // Don't cancel, as this would mess up the state machine. > } > #endif /* CONFIG_HOTPLUG_CPU */ I ran the state machine for a few hours inside my head through FWEISBEC_TORTURE and I see no error message. Which is of course not a guarantee of anything but: Reviewed-by: Frederic Weisbecker <frede...@kernel.org> Also, I believe that /* */ is the preferred comment style, FWIW ;-)