HP DL380p Gen8 interrupt storm

2021-09-03 Thread Havard Eidnes
Hi,

one machine I'm testing NetBSD on feels sort of sluggish, which
is strange because it's got lots of RAM (128GB) and a pair of
Xeon(R) CPU E5-2650 CPUs, for a total of 16 physical cores and 32
with hyperthreading.

It looks like one of the CPUs is using most of its time doing
interrupt processing, "systat vm" often shows * in "Intr" and
I have a constant buzz of 6.3% System CPU:

Proc:r  d  sCsw  Traps SysCal  Intr   Soft  Fault PAGING   SWAPPING
 1 7557281355 * 64277 in  out   in  out
ops
   6.3% Sy   0.0% Us   0.0% Ni   0.1% In  93.6% Idpages
|||||||||||
=== 2 forks
2 fkppw
Checking further:

stest: {8} vmstat -i
interrupt   total   rate
TLB shootdown 4677209  0
cpu0 timer 1046424629 99
msix2 vec 0  62702425  5
msix6 vec 0   3294854  0
ioapic0 pin 21 84  0
ioapic0 pin 20   21074226  2
ioapic0 pin 17  3344590700017 319462
ioapic0 pin 4   12722  0
Total   3345728886166 319570

stest: {9} grep 'ioapic0 pin 17' /var/run/dmesg.boot
pciide0: using ioapic0 pin 17 for native-PCI interrupt
stest: {10}

pciide0 only has the built-in CD drive, if I see correctly.

Full dmesg attached below.

Any hints about what's going on and how to further diagnose and
eventually cure it?

Regards,

- Håvard
Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017,
2018, 2019, 2020, 2021 The NetBSD Foundation, Inc.  All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
The Regents of the University of California.  All rights reserved.

NetBSD 9.99.81 (GENERIC) #2: Wed May  5 11:59:20 UTC 2021
h...@stest.urc.uninett.no:/usr/obj/sys/arch/amd64/compile/GENERIC
total memory = 127 GB
avail memory = 123 GB
entropy: entering seed from bootloader with 256 bits of entropy
entropy: ready
timecounter: Timecounters tick every 10.000 msec
Kernelized RAIDframe activated
timecounter: Timecounter "i8254" frequency 1193182 Hz quality 100
HP ProLiant DL380p Gen8
mainbus0 (root)
ACPI: RSDP 0x000F4F00 24 (v02 HP)
ACPI: XSDT 0xBDDAED00 EC (v01 HP ProLiant 0002 ??   
162E)
ACPI: FACP 0xBDDAEE40 F4 (v03 HP ProLiant 0002 ??   
162E)
Firmware Warning (ACPI): Invalid length for FADT/Pm1aControlBlock: 32, using 
default 16 (20210331/tbfadt-742)
Firmware Warning (ACPI): Invalid length for FADT/Pm2ControlBlock: 32, using 
default 8 (20210331/tbfadt-742)
ACPI: DSDT 0xBDDAEF40 0026DC (v01 HP DSDT 0001 INTL 
20030228)
ACPI: FACS 0xBDDAC140 40
ACPI: SPCR 0xBDDAC180 50 (v01 HP SPCRRBSU 0001 ??   
162E)
ACPI: MCFG 0xBDDAC200 3C (v01 HP ProLiant 0001  
)
ACPI: HPET 0xBDDAC240 38 (v01 HP ProLiant 0002 ??   
162E)
ACPI:  0xBDDAC280 64 (v02 HP ProLiant 0002 ??   
162E)
ACPI: SPMI 0xBDDAC300 40 (v05 HP ProLiant 0001 ??   
162E)
ACPI:  0xBDDAC340 000230 (v01 HP ProLiant 0001 ??   
162E)
ACPI: APIC 0xBDDAC580 00026A (v01 HP ProLiant 0002  
)
ACPI: SRAT 0xBDDAC800 000750 (v01 HP Proliant 0001 ??   
162E)
ACPI:  0xBDDACF80 000176 (v01 HP ProLiant 0001 ??   
162E)
ACPI:  0xBDDAD100 30 (v01 HP ProLiant 0001 ??   
162E)
ACPI:  0xBDDAD140 BC (v01 HP ProLiant 0001 ??   
162E)
ACPI: DMAR 0xBDDAD200 000558 (v01 HP ProLiant 0001 ??   
162E)
ACPI:  0xBDDAEC40 30 (v01 HP ProLiant 0001  
)
ACPI: PCCT 0xBDDAEC80 6E (v01 HP Proliant 0001 PH   
504D)
ACPI: SSDT 0xBDDB1640 0007EA (v01 HP DEV_PCI1 0001 INTL 
20120503)
ACPI: SSDT 0xBDDB1E40 000103 (v03 HP CRSPCI0  0002 HP   
0001)
ACPI: SSDT 0xBDDB1F80 98 (v03 HP CRSPCI1  0002 HP   
0001)
ACPI: SSDT 0xBDDB2040 00038A (v02 HP riser0   0002 INTL 
20030228)
ACPI: SSDT 0xBDDB2400 000536 (v03 HP riser1a  0002 INTL 
20030228)
ACPI: SSDT 0xBDDB2940 000537 (v03 HP riser2a  0002 INTL 
20030228)
ACPI: SSDT 0xBDDB2E80 000BB9 (v01 HP pcc  0001 INTL 
20120503)
ACPI: SSDT 0xBDDB3A40 000377 (v01 HP pmab 0001 INTL 
20120503)
ACPI: SSDT 0xBDDB3DC0 005524 (v01 HP pcc2 0001 INTL 
20120503)
ACPI: SSDT 0xBDDB9300 004604 (v01 INTEL  PPM RCM  0001 INTL 
20061109)
ACPI: 11 ACPI AML tables successfully acquired and loaded
io

Re: HP DL380p Gen8 interrupt storm

2021-09-03 Thread Paul Goyette

Try booting with the CD drive disabled (via userconf)



On Fri, 3 Sep 2021, Havard Eidnes wrote:


Hi,

one machine I'm testing NetBSD on feels sort of sluggish, which
is strange because it's got lots of RAM (128GB) and a pair of
Xeon(R) CPU E5-2650 CPUs, for a total of 16 physical cores and 32
with hyperthreading.

It looks like one of the CPUs is using most of its time doing
interrupt processing, "systat vm" often shows * in "Intr" and
I have a constant buzz of 6.3% System CPU:

Proc:r  d  sCsw  Traps SysCal  Intr   Soft  Fault PAGING   SWAPPING
1 7557281355 * 64277 in  out   in  out
   ops
  6.3% Sy   0.0% Us   0.0% Ni   0.1% In  93.6% Idpages
|||||||||||
=== 2 forks
   2 fkppw
Checking further:

stest: {8} vmstat -i
interrupt   total   rate
TLB shootdown 4677209  0
cpu0 timer 1046424629 99
msix2 vec 0  62702425  5
msix6 vec 0   3294854  0
ioapic0 pin 21 84  0
ioapic0 pin 20   21074226  2
ioapic0 pin 17  3344590700017 319462
ioapic0 pin 4   12722  0
Total   3345728886166 319570

stest: {9} grep 'ioapic0 pin 17' /var/run/dmesg.boot
pciide0: using ioapic0 pin 17 for native-PCI interrupt
stest: {10}

pciide0 only has the built-in CD drive, if I see correctly.

Full dmesg attached below.

Any hints about what's going on and how to further diagnose and
eventually cure it?

Regards,

- H?vard


!DSPAM:61324ecd137045439215331!



++--+--+
| Paul Goyette   | PGP Key fingerprint: | E-mail addresses:|
| (Retired)  | FA29 0E3B 35AF E8AE 6651 | p...@whooppee.com|
| Software Developer | 0786 F758 55DE 53BA 7731 | pgoye...@netbsd.org  |
||  | pgoyett...@gmail.com |
++--+--+

Re: HP DL380p Gen8 interrupt storm

2021-09-03 Thread Havard Eidnes
> Try booting with the CD drive disabled (via userconf)

with "disable cd*", basically no change:

stest: {5} vmstat -i
interrupt  total   rate
TLB shootdown   1693 28
cpu0 timer  5716 95
msix2 vec 0 1465 24
msix6 vec 0 2438 40
ioapic0 pin 2145  0
ioapic0 pin 2052  0
ioapic0 pin 17  1095 181481
ioapic0 pin 4   1136 18
Total   10901440 181690

stest: {6} 

With "disable pciide*" as well, well, gone:

stest: {1} vmstat -i
interrupt   total rate
TLB shootdown1736   27
cpu0 timer   6120   95
msix2 vec 0  2168   33
msix6 vec 0  2559   39
ioapic0 pin 21 440
ioapic0 pin 20 831
ioapic0 pin 41153   18
Total   13863  216

stest: {2} 

Seems disk I/O is quite a bit snappier now.

Looks like a custom kernel is in the cards for this one.

(I had to sneak in "-c" via /boot.cfg, because the second-stage
boot loader didn't want to listen to my serial console input,
while the first-stage boot selector did...)

As for the actual root cause: input sought.

Thanks,

- Håvard