When looking to test SCHED_DEADLINE, I triggered a lockup. The lockup appears to be caused by WARN_ON() done inside the scheduling path, and I'm guessing it tried to grab the rq lock and caused a deadlock (all I would get would be the "--- cut here ---" line, and then nothing after that. But a bit of playing with the printks() I figured out it was happening at:
assert_clock_updated() { SCHED_WARN_ON(rq->clock_update_flags < RQCF_ACT_SKIP); } I bisected it down to commit e0367b126 ("sched/deadline: Move CPU frequency selection triggering points"). Reverting it indeed makes the deadlock go away. I commented out the WARN_ON that was being triggered, and that let the system continue to run. I did the following change: --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -914,7 +914,12 @@ static inline void assert_clock_updated(struct rq *rq) * The only reason for not seeing a clock update since the * last rq_pin_lock() is if we're currently skipping updates. */ - SCHED_WARN_ON(rq->clock_update_flags < RQCF_ACT_SKIP); + if (rq->clock_update_flags < RQCF_ACT_SKIP) + trace_printk("WARN_ON: [%d] rq->clock_update_flags (%d) < %d)\n", + rq->cpu, rq->clock_update_flags, RQCF_ACT_SKIP); + else + trace_printk("GOOD: [%d] rq->clock_update_flags (%d) >= %d)\n", + rq->cpu, rq->clock_update_flags, RQCF_ACT_SKIP); } To see what was happening. I also, added trace_printk()s to all the updates to clock_update_flags, and ran my test again. Here's what I got: deadline_test-1393 [002] 162.127132: bprint: push_dl_task.part.40: WARN_ON: [1] rq->clock_update_flags (0) < 2) deadline_test-1393 [002] 162.127133: bprint: update_rq_clock: [1] clock was 0 now 4 deadline_test-1393 [002] 162.127134: bprint: rq_clock: GOOD: [1] rq->clock_update_flags (4) >= 2) It appears that we hit this in this path: push_dl_task { add_running_bw() { __add_running_bw() { cpufreq_update_util() { data->func(data, rq_clock(rq), flags); rq_clock() { assert_clock_updated() And here the clock isn't updated and we get the splat. Reverting the stated patch works because it added the call to cpufreq_update_util() that does the rq_clock() at an inappropriate time, which causes the splat. I'm not sure what the right answer to this is. Reverting obviously works, but I'm also guessing a proper placement of update_rq_clock() may also work. I just don't know where that placement is, as I don't understand the rq_clock() updates enough. Help? -- Steve