>On Tue 2021-03-23 20:37:35, 王擎 wrote: >> >> >On Fri 2021-03-19 16:00:36, Wang Qing wrote: >> >> When touch_softlockup_watchdog() is called, only wq_watchdog_touched_cpu >> >> updated, while the unbound worker_pool running on its core uses >> >> wq_watchdog_touched to determine whether locked up. This may be >> >> mischecked. >> > >> >By other words, unbound workqueues are not aware of the more common >> >touch_softlockup_watchdog() because it updates only >> >wq_watchdog_touched_cpu for the affected CPU. As a result, >> >the workqueue watchdog might report lockup in unbound workqueue >> >even though it is blocked by a known slow code. >> >> Yes, this is the problem I'm talking about. > >I thought more about it. This patch prevents a false positive. >Could it bring an opposite problem and hide real problems? > >I mean that an unbound workqueue might get blocked on CPU A >because of a real softlockup. But we might not notice it because >CPU B is touched. Well, there are other ways how to detect >this situation, e.g. the softlockup watchdog. > > >> >> My suggestion is to update both when touch_softlockup_watchdog() is >> >> called, >> >> use wq_watchdog_touched_cpu to check bound, and use wq_watchdog_touched >> >> to check unbound worker_pool. >> >> >> >> Signed-off-by: Wang Qing <wangq...@vivo.com> >> >> --- >> >> kernel/watchdog.c | 5 +++-- >> >> kernel/workqueue.c | 17 ++++++----------- >> >> 2 files changed, 9 insertions(+), 13 deletions(-) >> >> >> >> diff --git a/kernel/watchdog.c b/kernel/watchdog.c >> >> index 7110906..107bc38 >> >> --- a/kernel/watchdog.c >> >> +++ b/kernel/watchdog.c >> >> @@ -278,9 +278,10 @@ void touch_all_softlockup_watchdogs(void) >> >> * update as well, the only side effect might be a cycle delay for >> >> * the softlockup check. >> >> */ >> >> - for_each_cpu(cpu, &watchdog_allowed_mask) >> >> + for_each_cpu(cpu, &watchdog_allowed_mask) { >> >> per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET; >> >> - wq_watchdog_touch(-1); >> >> + wq_watchdog_touch(cpu); >> > >> >Note that wq_watchdog_touch(cpu) newly always updates >> >wq_watchdog_touched. This cycle will set the same jiffies >> >value cpu-times to the same variable. >> > >> Although there is a bit of redundancy here, but the most concise way of >> implementation, and it is certain that it will not affect performance. >> Another way to implement is wq_watchdog_touch() remain unchanged, but need to modify touch_softlockup_watchdog() and touch_all_softlockup_watchdogs(): notrace void touch_softlockup_watchdog(void) { touch_softlockup_watchdog_sched(); wq_watchdog_touch(raw_smp_processor_id()); + wq_watchdog_touch(-1); } void touch_all_softlockup_watchdogs(void) * update as well, the only side effect might be a cycle delay for * the softlockup check. */ - for_each_cpu(cpu, &watchdog_allowed_mask) + for_each_cpu(cpu, &watchdog_allowed_mask) { per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET; + wq_watchdog_touch(cpu); + } wq_watchdog_touch(-1); } So wq_watchdog_touched will not get updated many times, which do you think is better, Petr? >> >> >> >> void touch_softlockup_watchdog_sync(void) >> >> diff --git a/kernel/workqueue.c b/kernel/workqueue.c >> >> index 0d150da..be08295 >> >> --- a/kernel/workqueue.c >> >> +++ b/kernel/workqueue.c >> >> @@ -5826,8 +5821,8 @@ notrace void wq_watchdog_touch(int cpu) >> >> { >> >> if (cpu >= 0) >> >> per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies; >> >> - else >> >> - wq_watchdog_touched = jiffies; >> >> + >> >> + wq_watchdog_touched = jiffies; >> >> } >> >> >> >> static void wq_watchdog_set_thresh(unsigned long thresh) >> > >> >This last hunk is enough to fix the problem. wq_watchdog_touched will >> >get updated also from cpu-specific touch_softlockup_watchdog(). >> >> Not enough in fact, because wq_watchdog_touched was updated in >> wq_watchdog_touch(), >> so wq_watchdog_touched can no longer be used to judge the bound pool, we >> must update >> every wq_watchdog_touched_cpu in touch_all_softlockup_watchdogs() for bound >> judgment. > >I see. Your patch makes sense. > >I would just improve the commit message. It should better explain >why all the twists are needed. I would write something like: > ><proposal> >Subject: workqueue/watchdog: Make unbound workqueues aware of >touch_softlockup_watchdog() > >There are two workqueue-specific watchdog timestamps: > > + @wq_watchdog_touched_cpu (per-CPU) updated by > touch_softlockup_watchdog() > > + @wq_watchdog_touched (global) updated by > touch_all_softlockup_watchdogs() > >watchdog_timer_fn() checks only the global @wq_watchdog_touched for >unbound workqueues. As a result, unbound workqueues are not aware >of touch_softlockup_watchdog(). The watchdog might report a stall >even when the unbound workqueues are blocked by a known slow code. > >Solution: > >touch_softlockup_watchdog() must touch also the global >@wq_watchdog_touched timestamp. > >The global timestamp can not longer be used for bound workqueues >because it is updated on all CPUs. Instead, bound workqueues >have to check only @wq_watchdog_touched_cpu and these timestamp >has to be updated for all CPUs in touch_all_softlockup_watchdogs(). > >Beware: > >The change might cause the opposite problem. An unbound workqueue >might get blocked on CPU A because of a real softlockup. The workqueue >watchdog would miss it when the timestamp got touched on CPU B. > >It is acceptable because softlockups are detected by softlockup >watchdog. The workqueue watchdog is there to detect stalls where >a work never finishes, for example, because of dependencies of works >queued into the same workqueue. ></proposal> > >How does that sound?
It's fine to me, what's your opinion, Tejun? Thanks, Qing > >Best Regards, >Petr