Re: Kernel (9.99.44) responsiveness issues

2020-02-04 Thread Andrew Doran
On Sun, Feb 02, 2020 at 01:02:58PM +0100, Kamil Rytarowski wrote:

> I keep observing responsiveness issues on NetBSD-current. This happened
> in last 2 months.
> 
> Whenever I start building something with -j${CORES}, I have significant
> delays of responsiveness in other applications.

What does your disk I/O look like during this time?  Are you using ffs
logging?  Do you have a tmpfs /tmp or a custom TMPDIR set?  You have enough
CPUs there to hammer the disk with a compile job given the right
circumstances.

At the end of the crash(8) manual page someone has added a nice recipe to
get backtraces from all of the kernel stacks.  If you added a "grep tstile"
in the pipeline you could see where these particular threads are blocking.

Andrew

> load averages:  2.69,  5.56,  6.22;   up 0+01:32:42 12:12:34
> 71 processes: 69 sleeping, 2 on CPU
> CPU states:  0.0% user,  0.0% nice,  0.0% system,  0.1% interrupt, 99.8%
> idle
> Memory: 19G Act, 9639M Inact, 416K Wired, 34M Exec, 19G File, 43M Free
> Swap: 64G Total, 64G Free
> 
>   PID USERNAME PRI NICE   SIZE   RES STATE  TIME   WCPUCPU COMMAND
> 0 root   00 0K   87M CPU/7  0:40  0.00%  0.49% [system]
> 15823 root  85016M 2508K poll/1 0:01  0.00%  0.00% nbmake
> 25446 kamil 43028M 2452K CPU/0  0:00  0.00%  0.00% top
> 14117 root 114027M 3356K tstile/0   0:00  0.00%  0.00% ld
> 29088 root 114027M 3280K tstile/3   0:00  0.00%  0.00% ld
> 20839 root 114027M 3208K tstile/1   0:00  0.00%  0.00% ld
> 19550 root 114026M 3184K tstile/6   0:00  0.00%  0.00% ld
> 13716 root 114026M 3104K tstile/2   0:00  0.00%  0.00% ld
>  8758 root 114026M 3048K tstile/7   0:00  0.00%  0.00% ld
>   240 root 114026M 2580K tstile/0   0:00  0.00%  0.00% ld
> 
> I can see in top(1) that processes are locked in turnstiles and load
> goes down.
> 
> $ uname -a
> NetBSD chieftec 9.99.44 NetBSD 9.99.44 (GENERIC) #0: Fri Jan 31 19:26:07
> CET 2020
> root@chieftec:/public/netbsd-root/sys/arch/amd64/compile/GENERIC amd64
> 
> 135 kamil@chieftec /home/kamil $ cpuctl list
> 
> Num  HwId Unbound LWPs Interrupts Last change  #Intr
>    --  -
> 00online   intr   Sun Feb  2 10:40:26 2020 13
> 12online   intr   Sun Feb  2 10:40:26 2020 0
> 24online   intr   Sun Feb  2 10:40:26 2020 0
> 36online   intr   Sun Feb  2 10:40:26 2020 0
> 41online   intr   Sun Feb  2 10:40:26 2020 0
> 53online   intr   Sun Feb  2 10:40:26 2020 0
> 65online   intr   Sun Feb  2 10:40:26 2020 0
> 77online   intr   Sun Feb  2 10:40:26 2020 0
> 136 kamil@chieftec /home/kamil $ cpuctl identify 0
> Cannot bind to target CPU.  Output may not accurately describe the target.
> Run as root to allow binding.
> 
> cpu0: highest basic info 000d
> cpu0: highest extended info 8008
> cpu0: "Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz"
> cpu0: Intel Xeon E3-1200v2 and 3rd gen core, Ivy Bridge (686-class),
> 3392.48 MHz
> cpu0: family 0x6 model 0x3a stepping 0x9 (id 0x306a9)
> cpu0: features
> 0xbfebfbff
> cpu0: features
> 0xbfebfbff
> cpu0: features 0xbfebfbff
> cpu0: features1 0x7fbae3ff
> cpu0: features1 0x7fbae3ff
> cpu0: features1
> 0x7fbae3ff
> cpu0: features2 0x28100800
> cpu0: features3 0x1
> cpu0: features5 0x281
> cpu0: xsave features 0x7
> cpu0: xsave instructions 0x1
> cpu0: xsave area size: current 832, maximum 832, xgetbv enabled
> cpu0: enabled xsave 0x7
> cpu0: I-cache 32KB 64B/line 8-way, D-cache 32KB 64B/line 8-way
> cpu0: L2 cache 256KB 64B/line 8-way
> cpu0: L3 cache 8MB 64B/line 16-way
> cpu0: 64B prefetching
> cpu0: ITLB 64 4KB entries 4-way, 2M/4M: 8 entries
> cpu0: DTLB 64 4KB entries 4-way, 2M/4M: 32 entries (L0)
> cpu0: L2 STLB 512 4KB entries 4-way
> cpu0: Initial APIC ID 1
> cpu0: Cluster/Package ID 0
> cpu0: Core ID 0
> cpu0: SMT ID 1
> cpu0: MONITOR/MWAIT extensions 0x3
> cpu0: monitor-line size 64
> cpu0: C1 substates 2
> cpu0: C2 substates 1
> cpu0: C3 substates 1
> cpu0: DSPM-eax 0x77
> cpu0: DSPM-ecx 0x9
> cpu0: SEF highest subleaf 
> cpu0: Perfmon-eax 0x7300403
> cpu0: Perfmon-eax 0x7300403
> cpu0: Perfmon-edx 0x603
> cpu0: microcode version 0x15, platform ID 1
> 





Re: panic: softint screwup

2020-02-04 Thread Andrew Doran
On Tue, Feb 04, 2020 at 07:03:28AM -0400, Jared McNeill wrote:

> First time seeing this one.. an arm64 board sitting idle at the login prompt
> rebooted itself with this panic. Unfortunately the default ddb.onpanic=0
> strikes again and I can't get any more information than this:

I added this recently to replace a vague KASSERT.  Thanks for grabbing the
output.

> [ 364.3342263] curcpu=0, spl=4 curspl=7
> [ 364.3342263] onproc=0x00237f743080 => l_stat=7 l_flag=2201 l_cpu=0
> [ 364.3342263] curlwp=0x00237f71e580 => l_stat=1 l_flag=0200 l_cpu=0
> [ 364.3342263] pinned=0x00237f71e100 => l_stat=7 l_flag=0200 l_cpu=0
> [ 364.3342263] panic: softint screwup
> [ 364.3342263] cpu0: Begin traceback...
> [ 364.3342263] trace fp ffc101da7be0
> [ 364.3342263] fp ffc101da7c00 vpanic() at ffc0004ad728 
> netbsd:vpanic+0x160
> [ 364.3342263] fp ffc101da7c70 panic() at ffc0004ad81c 
> netbsd:panic+0x44
> [ 364.3342263] fp ffc101da7d40 softint_dispatch() at ffc00047bda4 
> netbsd:softint_dispatch+0x5c4
> [ 364.3342263] fp ffc101d9fc30 cpu_switchto_softint() at ffc85198 
> netbsd:cpu_switchto_softint+0x68
> [ 364.3342263] fp ffc101d9fc80 splx() at ffc040d4 netbsd:splx+0xbc
> [ 364.3342263] fp ffc101d9fcb0 callout_softclock() at ffc000489e04 
> netbsd:callout_softclock+0x36c
> [ 364.3342263] fp ffc101d9fd40 softint_dispatch() at ffc00047b8dc 
> netbsd:softint_dispatch+0xfc
> [ 364.3342263] fp ffc101d3fcc0 cpu_switchto_softint() at ffc85198 
> netbsd:cpu_switchto_softint+0x68
> [ 364.3342263] fp ffc101d3fdf8 cpu_idle() at ffc86128 
> netbsd:cpu_idle+0x58
> [ 364.3342263] fp ffc101d3fe40 idle_loop() at ffc0004546a4 
> netbsd:idle_loop+0x174

Something has cleared the LW_RUNNING flag on softclk/0 between where it is
set (unlocked) at line 884 of kern_softint.c and callout_softclock(). 
Should not be possible.  Hmm.

Andrew


Automatic dump possiblity detection [was: panic: softint screwup]

2020-02-04 Thread Martin Husemann
On Tue, Feb 04, 2020 at 07:03:28AM -0400, Jared McNeill wrote:
> [...] Unfortunately the default ddb.onpanic=0
> strikes again and I can't get any more information than this:
[..]
> [ 364.3342263] dump to dev 92,33 not possible
> [ 364.3342263] rebooting...

I wonder if we should try to find out if a dump would be possible and if
not override the the default for ddb.onpanic.

I'm not sure how to do that in an rc.d script with no further kernel help
right now though. Any good ideas? Maybe swapctl could gain some new
op to query this?

Martin


panic: softint screwup

2020-02-04 Thread Jared McNeill
First time seeing this one.. an arm64 board sitting idle at the login 
prompt rebooted itself with this panic. Unfortunately the default 
ddb.onpanic=0 strikes again and I can't get any more information than 
this:


[ 364.3342263] curcpu=0, spl=4 curspl=7
[ 364.3342263] onproc=0x00237f743080 => l_stat=7 l_flag=2201 l_cpu=0
[ 364.3342263] curlwp=0x00237f71e580 => l_stat=1 l_flag=0200 l_cpu=0
[ 364.3342263] pinned=0x00237f71e100 => l_stat=7 l_flag=0200 l_cpu=0
[ 364.3342263] panic: softint screwup
[ 364.3342263] cpu0: Begin traceback...
[ 364.3342263] trace fp ffc101da7be0
[ 364.3342263] fp ffc101da7c00 vpanic() at ffc0004ad728 
netbsd:vpanic+0x160
[ 364.3342263] fp ffc101da7c70 panic() at ffc0004ad81c netbsd:panic+0x44
[ 364.3342263] fp ffc101da7d40 softint_dispatch() at ffc00047bda4 
netbsd:softint_dispatch+0x5c4
[ 364.3342263] fp ffc101d9fc30 cpu_switchto_softint() at ffc85198 
netbsd:cpu_switchto_softint+0x68
[ 364.3342263] fp ffc101d9fc80 splx() at ffc040d4 netbsd:splx+0xbc
[ 364.3342263] fp ffc101d9fcb0 callout_softclock() at ffc000489e04 
netbsd:callout_softclock+0x36c
[ 364.3342263] fp ffc101d9fd40 softint_dispatch() at ffc00047b8dc 
netbsd:softint_dispatch+0xfc
[ 364.3342263] fp ffc101d3fcc0 cpu_switchto_softint() at ffc85198 
netbsd:cpu_switchto_softint+0x68
[ 364.3342263] fp ffc101d3fdf8 cpu_idle() at ffc86128 
netbsd:cpu_idle+0x58
[ 364.3342263] fp ffc101d3fe40 idle_loop() at ffc0004546a4 
netbsd:idle_loop+0x174
address 0x100 is invalid
address 0xe8 is invalid
[ 364.3342263] cpu0: End traceback...

[ 364.3342263] dump to dev 92,33 not possible
[ 364.3342263] rebooting...