>On Fri 2021-03-19 16:00:36, Wang Qing wrote: >> When touch_softlockup_watchdog() is called, only wq_watchdog_touched_cpu >> updated, while the unbound worker_pool running on its core uses >> wq_watchdog_touched to determine whether locked up. This may be mischecked. > >By other words, unbound workqueues are not aware of the more common >touch_softlockup_watchdog() because it updates only >wq_watchdog_touched_cpu for the affected CPU. As a result, >the workqueue watchdog might report lockup in unbound workqueue >even though it is blocked by a known slow code.
Yes, this is the problem I'm talking about. > >> My suggestion is to update both when touch_softlockup_watchdog() is called, >> use wq_watchdog_touched_cpu to check bound, and use wq_watchdog_touched >> to check unbound worker_pool. >> >> Signed-off-by: Wang Qing <wangq...@vivo.com> >> --- >> kernel/watchdog.c | 5 +++-- >> kernel/workqueue.c | 17 ++++++----------- >> 2 files changed, 9 insertions(+), 13 deletions(-) >> >> diff --git a/kernel/watchdog.c b/kernel/watchdog.c >> index 7110906..107bc38 >> --- a/kernel/watchdog.c >> +++ b/kernel/watchdog.c >> @@ -278,9 +278,10 @@ void touch_all_softlockup_watchdogs(void) >> * update as well, the only side effect might be a cycle delay for >> * the softlockup check. >> */ >> - for_each_cpu(cpu, &watchdog_allowed_mask) >> + for_each_cpu(cpu, &watchdog_allowed_mask) { >> per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET; >> - wq_watchdog_touch(-1); >> + wq_watchdog_touch(cpu); > >Note that wq_watchdog_touch(cpu) newly always updates >wq_watchdog_touched. This cycle will set the same jiffies >value cpu-times to the same variable. > Although there is a bit of redundancy here, but the most concise way of implementation, and it is certain that it will not affect performance. >> + } >> } >> >> void touch_softlockup_watchdog_sync(void) >> diff --git a/kernel/workqueue.c b/kernel/workqueue.c >> index 0d150da..be08295 >> --- a/kernel/workqueue.c >> +++ b/kernel/workqueue.c >> @@ -5787,22 +5787,17 @@ static void wq_watchdog_timer_fn(struct timer_list >> *unused) >> continue; >> >> /* get the latest of pool and touched timestamps */ >> + if (pool->cpu >= 0) >> + touched = READ_ONCE(per_cpu(wq_watchdog_touched_cpu, >> pool->cpu)); >> + else >> + touched = READ_ONCE(wq_watchdog_touched); >> pool_ts = READ_ONCE(pool->watchdog_ts); >> - touched = READ_ONCE(wq_watchdog_touched); >> >> if (time_after(pool_ts, touched)) >> ts = pool_ts; >> else >> ts = touched; >> >> - if (pool->cpu >= 0) { >> - unsigned long cpu_touched = >> - READ_ONCE(per_cpu(wq_watchdog_touched_cpu, >> - pool->cpu)); >> - if (time_after(cpu_touched, ts)) >> - ts = cpu_touched; >> - } >> - >> /* did we stall? */ >> if (time_after(jiffies, ts + thresh)) { >> lockup_detected = true; >> @@ -5826,8 +5821,8 @@ notrace void wq_watchdog_touch(int cpu) >> { >> if (cpu >= 0) >> per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies; >> - else >> - wq_watchdog_touched = jiffies; >> + >> + wq_watchdog_touched = jiffies; >> } >> >> static void wq_watchdog_set_thresh(unsigned long thresh) > >This last hunk is enough to fix the problem. wq_watchdog_touched will >get updated also from cpu-specific touch_softlockup_watchdog(). Not enough in fact, because wq_watchdog_touched was updated in wq_watchdog_touch(), so wq_watchdog_touched can no longer be used to judge the bound pool, we must update every wq_watchdog_touched_cpu in touch_all_softlockup_watchdogs() for bound judgment. Thanks, Qing Wang > >The original patch simplified the logic of wq_watchdog_timer_fn(). >But it added un-necessary assignments into >touch_all_softlockup_watchdogs(void). > >I do not have strong opinion what solution is better. I slightly >prefer to keep only this last hunk. > >Best Regards, >Petr