Re:Re: Re: Re: [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking

2021-03-24 Thread 王擎

>On Wed 2021-03-24 10:16:46, 王擎 wrote:
>> 
>> >On Tue 2021-03-23 20:37:35, 王擎 wrote:
>> >> 
>> >> >On Fri 2021-03-19 16:00:36, Wang Qing wrote:
>> >> >> When touch_softlockup_watchdog() is called, only 
>> >> >> wq_watchdog_touched_cpu 
>> >> >> updated, while the unbound worker_pool running on its core uses 
>> >> >> wq_watchdog_touched to determine whether locked up. This may be 
>> >> >> mischecked.
>> >> >
>> >> >By other words, unbound workqueues are not aware of the more common
>> >> >touch_softlockup_watchdog() because it updates only
>> >> >wq_watchdog_touched_cpu for the affected CPU. As a result,
>> >> >the workqueue watchdog might report lockup in unbound workqueue
>> >> >even though it is blocked by a known slow code.
>> >> 
>> >> Yes, this is the problem I'm talking about.
>> >
>> >I thought more about it. This patch prevents a false positive.
>> >Could it bring an opposite problem and hide real problems?
>> >
>> >I mean that an unbound workqueue might get blocked on CPU A
>> >because of a real softlockup. But we might not notice it because
>> >CPU B is touched. Well, there are other ways how to detect
>> >this situation, e.g. the softlockup watchdog.
>> >
>> >
>> >> >> My suggestion is to update both when touch_softlockup_watchdog() is 
>> >> >> called, 
>> >> >> use wq_watchdog_touched_cpu to check bound, and use 
>> >> >> wq_watchdog_touched 
>> >> >> to check unbound worker_pool.
>> >> >> 
>> >> >> Signed-off-by: Wang Qing 
>> >> >> ---
>> >> >>  kernel/watchdog.c  |  5 +++--
>> >> >>  kernel/workqueue.c | 17 ++---
>> >> >>  2 files changed, 9 insertions(+), 13 deletions(-)
>> >> >> 
>> >> >> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>> >> >> index 7110906..107bc38
>> >> >> --- a/kernel/watchdog.c
>> >> >> +++ b/kernel/watchdog.c
>> >> >> @@ -278,9 +278,10 @@ void touch_all_softlockup_watchdogs(void)
>> >> >> * update as well, the only side effect might be a cycle delay 
>> >> >> for
>> >> >> * the softlockup check.
>> >> >> */
>> >> >> -  for_each_cpu(cpu, _allowed_mask)
>> >> >> +  for_each_cpu(cpu, _allowed_mask) {
>> >> >>per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
>> >> >> -  wq_watchdog_touch(-1);
>> >> >> +  wq_watchdog_touch(cpu);
>> >> >
>> >> >Note that wq_watchdog_touch(cpu) newly always updates
>> >> >wq_watchdog_touched. This cycle will set the same jiffies
>> >> >value cpu-times to the same variable.
>> >> >
>> >> Although there is a bit of redundancy here, but the most concise way of 
>> >> implementation, and it is certain that it will not affect performance.
>> >> 
>> Another way to implement is wq_watchdog_touch() remain unchanged, but need 
>> to modify touch_softlockup_watchdog() and touch_all_softlockup_watchdogs():
>> notrace void touch_softlockup_watchdog(void)
>> {
>>  touch_softlockup_watchdog_sched();
>>  wq_watchdog_touch(raw_smp_processor_id());
>> + wq_watchdog_touch(-1);
>> }
>> void touch_all_softlockup_watchdogs(void)
>>  * update as well, the only side effect might be a cycle delay for
>>  * the softlockup check.
>> */
>>  -   for_each_cpu(cpu, _allowed_mask)
>> +for_each_cpu(cpu, _allowed_mask) {
>>  per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
>>  +   wq_watchdog_touch(cpu);
>>  +   }
>>  wq_watchdog_touch(-1);
>>   }
>> So wq_watchdog_touched will not get updated many times,  
>> which do you think is better, Petr?
>
>I actually prefer the original patch. It makes wq_watchdog_touch()
>easy to use. The complexity is hidden in wq-specific code.
>
>The alternative way updates each timestamp only once but the use
>is more complicated. IMHO, it is more error prone.

I agree, so I will just modify the commit log based on V2 and resend.

Thanks,
Qing
>
>Best Regards,
>Petr




Re: Re: Re: [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking

2021-03-24 Thread Petr Mladek
On Wed 2021-03-24 10:16:46, 王擎 wrote:
> 
> >On Tue 2021-03-23 20:37:35, 王擎 wrote:
> >> 
> >> >On Fri 2021-03-19 16:00:36, Wang Qing wrote:
> >> >> When touch_softlockup_watchdog() is called, only 
> >> >> wq_watchdog_touched_cpu 
> >> >> updated, while the unbound worker_pool running on its core uses 
> >> >> wq_watchdog_touched to determine whether locked up. This may be 
> >> >> mischecked.
> >> >
> >> >By other words, unbound workqueues are not aware of the more common
> >> >touch_softlockup_watchdog() because it updates only
> >> >wq_watchdog_touched_cpu for the affected CPU. As a result,
> >> >the workqueue watchdog might report lockup in unbound workqueue
> >> >even though it is blocked by a known slow code.
> >> 
> >> Yes, this is the problem I'm talking about.
> >
> >I thought more about it. This patch prevents a false positive.
> >Could it bring an opposite problem and hide real problems?
> >
> >I mean that an unbound workqueue might get blocked on CPU A
> >because of a real softlockup. But we might not notice it because
> >CPU B is touched. Well, there are other ways how to detect
> >this situation, e.g. the softlockup watchdog.
> >
> >
> >> >> My suggestion is to update both when touch_softlockup_watchdog() is 
> >> >> called, 
> >> >> use wq_watchdog_touched_cpu to check bound, and use wq_watchdog_touched 
> >> >> to check unbound worker_pool.
> >> >> 
> >> >> Signed-off-by: Wang Qing 
> >> >> ---
> >> >>  kernel/watchdog.c  |  5 +++--
> >> >>  kernel/workqueue.c | 17 ++---
> >> >>  2 files changed, 9 insertions(+), 13 deletions(-)
> >> >> 
> >> >> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> >> >> index 7110906..107bc38
> >> >> --- a/kernel/watchdog.c
> >> >> +++ b/kernel/watchdog.c
> >> >> @@ -278,9 +278,10 @@ void touch_all_softlockup_watchdogs(void)
> >> >>  * update as well, the only side effect might be a cycle delay 
> >> >> for
> >> >>  * the softlockup check.
> >> >>  */
> >> >> -   for_each_cpu(cpu, _allowed_mask)
> >> >> +   for_each_cpu(cpu, _allowed_mask) {
> >> >> per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
> >> >> -   wq_watchdog_touch(-1);
> >> >> +   wq_watchdog_touch(cpu);
> >> >
> >> >Note that wq_watchdog_touch(cpu) newly always updates
> >> >wq_watchdog_touched. This cycle will set the same jiffies
> >> >value cpu-times to the same variable.
> >> >
> >> Although there is a bit of redundancy here, but the most concise way of 
> >> implementation, and it is certain that it will not affect performance.
> >> 
> Another way to implement is wq_watchdog_touch() remain unchanged, but need 
> to modify touch_softlockup_watchdog() and touch_all_softlockup_watchdogs():
> notrace void touch_softlockup_watchdog(void)
> {
>   touch_softlockup_watchdog_sched();
>   wq_watchdog_touch(raw_smp_processor_id());
> + wq_watchdog_touch(-1);
> }
> void touch_all_softlockup_watchdogs(void)
>  * update as well, the only side effect might be a cycle delay for
>  * the softlockup check.
> */
>  -for_each_cpu(cpu, _allowed_mask)
> + for_each_cpu(cpu, _allowed_mask) {
>   per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
>  +wq_watchdog_touch(cpu);
>  +}
>   wq_watchdog_touch(-1);
>   }
> So wq_watchdog_touched will not get updated many times,  
> which do you think is better, Petr?

I actually prefer the original patch. It makes wq_watchdog_touch()
easy to use. The complexity is hidden in wq-specific code.

The alternative way updates each timestamp only once but the use
is more complicated. IMHO, it is more error prone.

Best Regards,
Petr


Re:Re: Re: [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking

2021-03-23 Thread 王擎

>On Tue 2021-03-23 20:37:35, 王擎 wrote:
>> 
>> >On Fri 2021-03-19 16:00:36, Wang Qing wrote:
>> >> When touch_softlockup_watchdog() is called, only wq_watchdog_touched_cpu 
>> >> updated, while the unbound worker_pool running on its core uses 
>> >> wq_watchdog_touched to determine whether locked up. This may be 
>> >> mischecked.
>> >
>> >By other words, unbound workqueues are not aware of the more common
>> >touch_softlockup_watchdog() because it updates only
>> >wq_watchdog_touched_cpu for the affected CPU. As a result,
>> >the workqueue watchdog might report lockup in unbound workqueue
>> >even though it is blocked by a known slow code.
>> 
>> Yes, this is the problem I'm talking about.
>
>I thought more about it. This patch prevents a false positive.
>Could it bring an opposite problem and hide real problems?
>
>I mean that an unbound workqueue might get blocked on CPU A
>because of a real softlockup. But we might not notice it because
>CPU B is touched. Well, there are other ways how to detect
>this situation, e.g. the softlockup watchdog.
>
>
>> >> My suggestion is to update both when touch_softlockup_watchdog() is 
>> >> called, 
>> >> use wq_watchdog_touched_cpu to check bound, and use wq_watchdog_touched 
>> >> to check unbound worker_pool.
>> >> 
>> >> Signed-off-by: Wang Qing 
>> >> ---
>> >>  kernel/watchdog.c  |  5 +++--
>> >>  kernel/workqueue.c | 17 ++---
>> >>  2 files changed, 9 insertions(+), 13 deletions(-)
>> >> 
>> >> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>> >> index 7110906..107bc38
>> >> --- a/kernel/watchdog.c
>> >> +++ b/kernel/watchdog.c
>> >> @@ -278,9 +278,10 @@ void touch_all_softlockup_watchdogs(void)
>> >>* update as well, the only side effect might be a cycle delay for
>> >>* the softlockup check.
>> >>*/
>> >> - for_each_cpu(cpu, _allowed_mask)
>> >> + for_each_cpu(cpu, _allowed_mask) {
>> >>   per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
>> >> - wq_watchdog_touch(-1);
>> >> + wq_watchdog_touch(cpu);
>> >
>> >Note that wq_watchdog_touch(cpu) newly always updates
>> >wq_watchdog_touched. This cycle will set the same jiffies
>> >value cpu-times to the same variable.
>> >
>> Although there is a bit of redundancy here, but the most concise way of 
>> implementation, and it is certain that it will not affect performance.
>> 
Another way to implement is wq_watchdog_touch() remain unchanged, but need 
to modify touch_softlockup_watchdog() and touch_all_softlockup_watchdogs():
notrace void touch_softlockup_watchdog(void)
{
touch_softlockup_watchdog_sched();
wq_watchdog_touch(raw_smp_processor_id());
+ wq_watchdog_touch(-1);
}
void touch_all_softlockup_watchdogs(void)
 * update as well, the only side effect might be a cycle delay for
 * the softlockup check.
*/
 -  for_each_cpu(cpu, _allowed_mask)
+   for_each_cpu(cpu, _allowed_mask) {
per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
 +  wq_watchdog_touch(cpu);
 +  }
wq_watchdog_touch(-1);
  }
So wq_watchdog_touched will not get updated many times,  
which do you think is better, Petr?
>> >>  
>> >>  void touch_softlockup_watchdog_sync(void)
>> >> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> >> index 0d150da..be08295
>> >> --- a/kernel/workqueue.c
>> >> +++ b/kernel/workqueue.c
>> >> @@ -5826,8 +5821,8 @@ notrace void wq_watchdog_touch(int cpu)
>> >>  {
>> >>   if (cpu >= 0)
>> >>   per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
>> >> - else
>> >> - wq_watchdog_touched = jiffies;
>> >> +
>> >> + wq_watchdog_touched = jiffies;
>> >>  }
>> >>  
>> >>  static void wq_watchdog_set_thresh(unsigned long thresh)
>> >
>> >This last hunk is enough to fix the problem. wq_watchdog_touched will
>> >get updated also from cpu-specific touch_softlockup_watchdog().
>> 
>> Not enough in fact, because wq_watchdog_touched was updated in 
>> wq_watchdog_touch(), 
>> so wq_watchdog_touched can no longer be used to judge the bound pool, we 
>> must update 
>> every wq_watchdog_touched_cpu in touch_all_softlockup_watchdogs() for bound 
>> judgment.
>
>I see. Your patch makes sense.
>
>I would just improve the commit message. It should better explain
>why all the twists are needed. I would write something like:
>
>
>Subject: workqueue/watchdog: Make unbound workqueues aware of
>touch_softlockup_watchdog()
>
>There are two workqueue-specific watchdog timestamps:
>
>+ @wq_watchdog_touched_cpu (per-CPU) updated by
>  touch_softlockup_watchdog()
>
>+ @wq_watchdog_touched (global) updated by
>  touch_all_softlockup_watchdogs()
>
>watchdog_timer_fn() checks only the global @wq_watchdog_touched for
>unbound workqueues. As a result, unbound workqueues are not aware
>of touch_softlockup_watchdog(). The watchdog might report a stall
>even when the unbound workqueues are blocked by a known slow code.
>
>Solution:
>
>touch_softlockup_watchdog() must touch also the global
>@wq_watchdog_touched 

Re: Re: [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking

2021-03-23 Thread Petr Mladek
On Tue 2021-03-23 20:37:35, 王擎 wrote:
> 
> >On Fri 2021-03-19 16:00:36, Wang Qing wrote:
> >> When touch_softlockup_watchdog() is called, only wq_watchdog_touched_cpu 
> >> updated, while the unbound worker_pool running on its core uses 
> >> wq_watchdog_touched to determine whether locked up. This may be mischecked.
> >
> >By other words, unbound workqueues are not aware of the more common
> >touch_softlockup_watchdog() because it updates only
> >wq_watchdog_touched_cpu for the affected CPU. As a result,
> >the workqueue watchdog might report lockup in unbound workqueue
> >even though it is blocked by a known slow code.
> 
> Yes, this is the problem I'm talking about.

I thought more about it. This patch prevents a false positive.
Could it bring an opposite problem and hide real problems?

I mean that an unbound workqueue might get blocked on CPU A
because of a real softlockup. But we might not notice it because
CPU B is touched. Well, there are other ways how to detect
this situation, e.g. the softlockup watchdog.


> >> My suggestion is to update both when touch_softlockup_watchdog() is 
> >> called, 
> >> use wq_watchdog_touched_cpu to check bound, and use wq_watchdog_touched 
> >> to check unbound worker_pool.
> >> 
> >> Signed-off-by: Wang Qing 
> >> ---
> >>  kernel/watchdog.c  |  5 +++--
> >>  kernel/workqueue.c | 17 ++---
> >>  2 files changed, 9 insertions(+), 13 deletions(-)
> >> 
> >> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> >> index 7110906..107bc38
> >> --- a/kernel/watchdog.c
> >> +++ b/kernel/watchdog.c
> >> @@ -278,9 +278,10 @@ void touch_all_softlockup_watchdogs(void)
> >> * update as well, the only side effect might be a cycle delay for
> >> * the softlockup check.
> >> */
> >> -  for_each_cpu(cpu, _allowed_mask)
> >> +  for_each_cpu(cpu, _allowed_mask) {
> >>per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
> >> -  wq_watchdog_touch(-1);
> >> +  wq_watchdog_touch(cpu);
> >
> >Note that wq_watchdog_touch(cpu) newly always updates
> >wq_watchdog_touched. This cycle will set the same jiffies
> >value cpu-times to the same variable.
> >
> Although there is a bit of redundancy here, but the most concise way of 
> implementation, and it is certain that it will not affect performance.
> 
> >> +  }
> >>  }
> >>  
> >>  void touch_softlockup_watchdog_sync(void)
> >> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> >> index 0d150da..be08295
> >> --- a/kernel/workqueue.c
> >> +++ b/kernel/workqueue.c
> >> @@ -5826,8 +5821,8 @@ notrace void wq_watchdog_touch(int cpu)
> >>  {
> >>if (cpu >= 0)
> >>per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
> >> -  else
> >> -  wq_watchdog_touched = jiffies;
> >> +
> >> +  wq_watchdog_touched = jiffies;
> >>  }
> >>  
> >>  static void wq_watchdog_set_thresh(unsigned long thresh)
> >
> >This last hunk is enough to fix the problem. wq_watchdog_touched will
> >get updated also from cpu-specific touch_softlockup_watchdog().
> 
> Not enough in fact, because wq_watchdog_touched was updated in 
> wq_watchdog_touch(), 
> so wq_watchdog_touched can no longer be used to judge the bound pool, we must 
> update 
> every wq_watchdog_touched_cpu in touch_all_softlockup_watchdogs() for bound 
> judgment.

I see. Your patch makes sense.

I would just improve the commit message. It should better explain
why all the twists are needed. I would write something like:


Subject: workqueue/watchdog: Make unbound workqueues aware of
touch_softlockup_watchdog()

There are two workqueue-specific watchdog timestamps:

+ @wq_watchdog_touched_cpu (per-CPU) updated by
  touch_softlockup_watchdog()

+ @wq_watchdog_touched (global) updated by
  touch_all_softlockup_watchdogs()

watchdog_timer_fn() checks only the global @wq_watchdog_touched for
unbound workqueues. As a result, unbound workqueues are not aware
of touch_softlockup_watchdog(). The watchdog might report a stall
even when the unbound workqueues are blocked by a known slow code.

Solution:

touch_softlockup_watchdog() must touch also the global
@wq_watchdog_touched timestamp.

The global timestamp can not longer be used for bound workqueues
because it is updated on all CPUs. Instead, bound workqueues
have to check only @wq_watchdog_touched_cpu and these timestamp
has to be updated for all CPUs in touch_all_softlockup_watchdogs().

Beware:

The change might cause the opposite problem. An unbound workqueue
might get blocked on CPU A because of a real softlockup. The workqueue
watchdog would miss it when the timestamp got touched on CPU B.

It is acceptable because softlockups are detected by softlockup
watchdog. The workqueue watchdog is there to detect stalls where
a work never finishes, for example, because of dependencies of works
queued into the same workqueue.


How does that sound?

Best Regards,
Petr


Re: [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking

2021-03-22 Thread Petr Mladek
On Fri 2021-03-19 16:00:36, Wang Qing wrote:
> When touch_softlockup_watchdog() is called, only wq_watchdog_touched_cpu 
> updated, while the unbound worker_pool running on its core uses 
> wq_watchdog_touched to determine whether locked up. This may be mischecked.

By other words, unbound workqueues are not aware of the more common
touch_softlockup_watchdog() because it updates only
wq_watchdog_touched_cpu for the affected CPU. As a result,
the workqueue watchdog might report lockup in unbound workqueue
even though it is blocked by a known slow code.

> My suggestion is to update both when touch_softlockup_watchdog() is called, 
> use wq_watchdog_touched_cpu to check bound, and use wq_watchdog_touched 
> to check unbound worker_pool.
> 
> Signed-off-by: Wang Qing 
> ---
>  kernel/watchdog.c  |  5 +++--
>  kernel/workqueue.c | 17 ++---
>  2 files changed, 9 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 7110906..107bc38
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -278,9 +278,10 @@ void touch_all_softlockup_watchdogs(void)
>* update as well, the only side effect might be a cycle delay for
>* the softlockup check.
>*/
> - for_each_cpu(cpu, _allowed_mask)
> + for_each_cpu(cpu, _allowed_mask) {
>   per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
> - wq_watchdog_touch(-1);
> + wq_watchdog_touch(cpu);

Note that wq_watchdog_touch(cpu) newly always updates
wq_watchdog_touched. This cycle will set the same jiffies
value cpu-times to the same variable.

> + }
>  }
>  
>  void touch_softlockup_watchdog_sync(void)
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 0d150da..be08295
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -5787,22 +5787,17 @@ static void wq_watchdog_timer_fn(struct timer_list 
> *unused)
>   continue;
>  
>   /* get the latest of pool and touched timestamps */
> + if (pool->cpu >= 0)
> + touched = READ_ONCE(per_cpu(wq_watchdog_touched_cpu, 
> pool->cpu));
> + else
> + touched = READ_ONCE(wq_watchdog_touched);
>   pool_ts = READ_ONCE(pool->watchdog_ts);
> - touched = READ_ONCE(wq_watchdog_touched);
>  
>   if (time_after(pool_ts, touched))
>   ts = pool_ts;
>   else
>   ts = touched;
>  
> - if (pool->cpu >= 0) {
> - unsigned long cpu_touched =
> - READ_ONCE(per_cpu(wq_watchdog_touched_cpu,
> -   pool->cpu));
> - if (time_after(cpu_touched, ts))
> - ts = cpu_touched;
> - }
> -
>   /* did we stall? */
>   if (time_after(jiffies, ts + thresh)) {
>   lockup_detected = true;
> @@ -5826,8 +5821,8 @@ notrace void wq_watchdog_touch(int cpu)
>  {
>   if (cpu >= 0)
>   per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
> - else
> - wq_watchdog_touched = jiffies;
> +
> + wq_watchdog_touched = jiffies;
>  }
>  
>  static void wq_watchdog_set_thresh(unsigned long thresh)

This last hunk is enough to fix the problem. wq_watchdog_touched will
get updated also from cpu-specific touch_softlockup_watchdog().

The original patch simplified the logic of wq_watchdog_timer_fn().
But it added un-necessary assignments into
touch_all_softlockup_watchdogs(void).

I do not have strong opinion what solution is better. I slightly
prefer to keep only this last hunk.

Best Regards,
Petr


Re: [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking

2021-03-20 Thread Tejun Heo
On Fri, Mar 19, 2021 at 04:00:36PM +0800, Wang Qing wrote:
> When touch_softlockup_watchdog() is called, only wq_watchdog_touched_cpu 
> updated, while the unbound worker_pool running on its core uses 
> wq_watchdog_touched to determine whether locked up. This may be mischecked.

Can you please elaborate here, preferably with a concrete scenario where the
new code is better?

Thanks.

-- 
tejun