Re: heartbeat panic by heavy traffic

2023-09-15 Thread Manuel Bouyer
On Fri, Sep 15, 2023 at 02:00:31PM -, Michael van Elst wrote:
> bou...@antioche.eu.org (Manuel Bouyer) writes:
> 
> >But the clock softint shouldn't be locked out for 16s, ever.
> 
> Then the clock softint must have a higher priority than
> everything else including hard interrupts.
> 
> Obviously that's not how the system is designed, there
> are no limits on how long specific events may take and
> thus no guarantee for lower priority tasks to actually
> execute with a certain time. That would be some kind
> of real-time system.

But obviously such events are not expected to take a long time, or
they would have been deffered to lower priority, preemptible tasks.
Letting such events run for a long time wedges the system.

I still maintain that the bug here is the network soft interrupt running
for such a long time, without gigving a chance to other tasks

> 
> Such systems also rarely panic if they detect a violation
> of their rules.
> 
> In any case, locking out lower priority tasks by an
> overwhelmed network layer probably isn't the bug that
> we look for.

I disagree. And the heartbeat panic is here to help locate such bugs.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: heartbeat panic by heavy traffic

2023-09-15 Thread Michael van Elst
bou...@antioche.eu.org (Manuel Bouyer) writes:

>But the clock softint shouldn't be locked out for 16s, ever.

Then the clock softint must have a higher priority than
everything else including hard interrupts.

Obviously that's not how the system is designed, there
are no limits on how long specific events may take and
thus no guarantee for lower priority tasks to actually
execute with a certain time. That would be some kind
of real-time system.

Such systems also rarely panic if they detect a violation
of their rules.

In any case, locking out lower priority tasks by an
overwhelmed network layer probably isn't the bug that
we look for.



Re: heartbeat panic by heavy traffic

2023-09-15 Thread Manuel Bouyer
On Fri, Sep 15, 2023 at 09:19:04AM -, Michael van Elst wrote:
> mar...@duskware.de (Martin Husemann) writes:
> 
> >On Fri, Sep 15, 2023 at 12:17:58PM +0900, Masanobu SAITOH wrote:
> >> I think it would be good to change the default behavior from
> >> panic to something others because GENERIC kernel enables HEARTBEAT.
> >> by default. One of idea is to print warning message at sufficient 
> >> intervals.
> 
> >I disagree. It is very important that we fix the underlying problem
> >instead. Without hearbeat, this behaviour is still visible (but 
> >undiagnosable).
> 
> The crash here comes from how the network stack operates. Running at
> a higher priority, it locks out the lower priority clock softint
> and heartbeat detects that and crashes the system intentionally.

But the clock softint shouldn't be locked out for 16s, ever.
It means that userland processes are stuck too, as well as kernel threads.

This is a real bug, the network stack should be fixed to relax at
periodic intervals.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: heartbeat panic by heavy traffic

2023-09-15 Thread Michael van Elst
mar...@duskware.de (Martin Husemann) writes:

>On Fri, Sep 15, 2023 at 12:17:58PM +0900, Masanobu SAITOH wrote:
>> I think it would be good to change the default behavior from
>> panic to something others because GENERIC kernel enables HEARTBEAT.
>> by default. One of idea is to print warning message at sufficient intervals.

>I disagree. It is very important that we fix the underlying problem
>instead. Without hearbeat, this behaviour is still visible (but undiagnosable).

The crash here comes from how the network stack operates. Running at
a higher priority, it locks out the lower priority clock softint
and heartbeat detects that and crashes the system intentionally.

I don't consider that useful even in a test environment.



Re: heartbeat panic by heavy traffic

2023-09-15 Thread Martin Husemann
On Fri, Sep 15, 2023 at 12:17:58PM +0900, Masanobu SAITOH wrote:
> I think it would be good to change the default behavior from
> panic to something others because GENERIC kernel enables HEARTBEAT.
> by default. One of idea is to print warning message at sufficient intervals.

I disagree. It is very important that we fix the underlying problem
instead. Without hearbeat, this behaviour is still visible (but undiagnosable).

Martin


heartbeat panic by heavy traffic

2023-09-14 Thread Masanobu SAITOH
Hi.

I can see the following heartbeat panic when a machine is forwarding
heavy short packets:

[ 745.0068385] cpu14: found cpu15 heart stopped beating after 16 seconds
[ 745.0068385] panic: cpu15: softints stuck for 16 seconds
[ 745.0168386] cpu15: Begin traceback...
[ 745.0168386] cpu14: found cpu15 heart stopped beating after 16 seconds
[ 745.0268387] vpanic() at cpu14: found cpu15 heart stopped beating after 16 
seconds
[ 745.0268387] netbsd:vpanic+0x173
[ 745.0368390] cpu14: found cpu15 heart stopped beating after 16 seconds
[ 745.0368390] panic() at cpu14: found cpu15 heart stopped beating after 16 
seconds
[ 745.0468390] netbsd:panic+0x3c
[ 745.0468390] heartbeat() at netbsd:heartbeat+0x353
[ 745.0568392] hardclock() at netbsd:hardclock+0x8b
[ 745.0668393] Xresume_lapic_ltimer() at netbsd:Xresume_lapic_ltimer+0x1e
[ 745.0668393] --- interrupt ---
[ 745.0768393] psref_release() at netbsd:psref_release+0x83
[ 745.0768393] ipintr() at netbsd:ipintr+0xef
[ 745.0868396] softint_dispatch() at netbsd:softint_dispatch+0x103
[ 745.0868396] DDB lost frame for netbsd:Xsoftintr+0x4c, trying 
0x8288589fc0f0
[ 745.0968395] Xsoftintr() at netbsd:Xsoftintr+0x4c
[ 745.0968395] --- interrupt ---
[ 745.1068397] f9faeac0f5baeac4:
[ 745.1068397] cpu15: End traceback...
[ 745.1068397] fatal breakpoint trap in supervisor mode
[ 745.1168399] trap type 1 code 0 rip 0x80235425 cs 0x8 rflags 0x202 
cr2 0 ilevel 0x7 rsp 0x8288589fbc68
[ 745.1268401] curlwp 0xd8070facf6c0 pid 0.175 lowest kstack 
0x8288589f72c0
Stopped in pid 0.175 (system) atnetbsd:breakpoint+0x5:  leave
breakpoint() at netbsd:breakpoint+0x5
vpanic() at netbsd:vpanic+0x173
panic() at netbsd:panic+0x3c
heartbeat() at netbsd:heartbeat+0x353
hardclock() at netbsd:hardclock+0x8b
Xresume_lapic_ltimer() at netbsd:Xresume_lapic_ltimer+0x1e
--- interrupt ---
psref_release() at netbsd:psref_release+0x83
ipintr() at netbsd:ipintr+0xef
softint_dispatch() at netbsd:softint_dispatch+0x103
DDB lost frame for netbsd:Xsoftintr+0x4c, trying 0x8288589fc0f0
Xsoftintr() at netbsd:Xsoftintr+0x4c
(snip)

wm and ixg have hw.{wm,ixg}N.txrx_workqueue sysctl.
If we set them from 0 to 1, we can avoid the panic. Many drivers
have no way to avoid the problem.

I think it would be good to change the default behavior from
panic to something others because GENERIC kernel enables HEARTBEAT.
by default. One of idea is to print warning message at sufficient intervals.

 Regards.

-- 
---
SAITOH Masanobu (msai...@execsw.org
 msai...@netbsd.org)