On Wed, Jun 18, 2025 at 7:01 PM Alexander Bluhm <[email protected]> wrote: > > On Wed, Jun 18, 2025 at 04:54:34PM -0300, K R wrote: > > >Synopsis: server freezes under heavy CPU usage > > >Category: kernel > > >Environment: > > System : OpenBSD 7.7 > > Details : OpenBSD 7.7-current (GENERIC.MP) #21: Tue Jun > > 17 17:40:27 MDT 2025 > > > > [email protected]:/usr/src/sys/arch/amd64/compile/GENERIC.MP > > > > Architecture: OpenBSD.amd64 > > Machine : amd64 > > >Description: > > > > This machine is a Dell PowerEdge R440 with 16 CPUs and 128GM of RAM. > > It freezes under heavy CPU usage, specially with lots of threads. > > This started with 7.7-release + syspatches but continues with a > > -current as of today. > > > > No panic, nothing, just freezes. Can't even force into ddb (with > > ddb.console=1). During last test, top(1) froze with this last output: > > > > load averages: 10.73, 11.12, 10.53 test > > 16:46:44 > > 125 processes: 93 idle, 32 on processor up 0 days > > 00:59:55 > > 16 CPUs: 17.7% user, 51.5% nice, 3.6% sys, 1.1% spin, 1.1% intr, 25.0% > > idle > > Memory: Real: 13G/37G act/tot Free: 87G Cache: 22G Swap: 0K/64G > > > > PID USERNAME PRI NICE SIZE RES STATE WAIT TIME CPU COMMAND > > 60756 root 64 0 13G 13G onproc/1 - 35:05 100.78% > > python3.12 > > 27129 root 10 20 9124K 1532K onproc/2 fsleep 8:31 75.98% > > semaphore > > 42272 root 10 20 9644K 1540K onproc/4 fsleep 8:36 75.83% > > semaphore > > 60257 root 10 20 9644K 1560K onproc/14 fsleep 8:32 74.76% > > semaphore > > 27054 root 64 0 384K 328K onproc/7 - 0:26 71.24% rm > > 58070 root 10 0 15M 4428K sleep/13 fsleep 23:45 36.04% nfdump > > 11522 root 10 20 9636K 1524K onproc/0 fsleep 8:44 31.93% > > semaphore > > 40359 root 10 20 9648K 1556K onproc/2 fsleep 8:44 29.88% > > semaphore > > 72237 root 10 20 9632K 1520K onproc/0 fsleep 8:41 27.20% > > semaphore > > 42031 root 10 20 9648K 1576K onproc/8 fsleep 8:39 27.10% > > semaphore > > 97960 root 10 20 9644K 1536K onproc/8 fsleep 8:39 26.46% > > semaphore > > 76525 root 10 0 95M 57M sleep/12 fsleep 10:01 12.84% nfdump > > 68093 root 10 20 96M 64M sleep/3 fsleep 10:07 12.11% nfdump > > 94072 root -5 20 27M 11M sleep/3 biowait 4:42 1.03% pigz > > 52734 root 2 0 1640K 2740K sleep/4 kqread 0:37 0.98% top > > 84043 root 10 20 27M 11M sleep/3 inode 2:07 0.34% pigz > > 95028 root 10 20 27M 11M sleep/4 inode 2:09 0.15% pigz > > 66823 root 10 20 26M 11M sleep/0 inode 2:07 0.05% pigz > > 59751 root 2 0 2768K 3244K sleep/0 kqread 0:09 0.05% tmux > > 58124 root -22 0 0K 4K sleep/1 - 37:01 0.00% idle1 > > 59513 root -22 0 0K 4K sleep/2 - 36:27 0.00% idle2 > > > > Any recommendations on what could help debugging? > > Run a witness kernel. Remove comment '#' in #option WITNESS > src/sys/arch/amd64/conf/GENERIC.MP and rebuild fresh kernel after > make clean and make config. Set sysctl kern.witness.watch=2 to get > stacktraces. It might report some false positives or known bugs.
Thanks for the recommendation. I've just started running a -current kernel with WITNESS enabled and with kern.witness.watch=2. > Maybe it finds something. Best we can expect is a panic instead > of hang. Then show all locks in ddb and trace on all CPU would be > useful. Stress tests running, let's see if I can send more useful debug info. I'll keep the list posted. Thanks again, --Kor > > >How-To-Repeat: > > > > Start lots of thread-intensive programs, like pigz(1), nfdump(1), etc. > > I also had a simple C test program using SYSV IPC semaphores running. > > The problem seems to require a reasonable number of CPUs (16 or more) > > to manifest itself. > > > > >Fix: > > > > Unknown. > > > > Thanks, > > --Kor
