On Thu, Jun 19, 2025 at 9:03 AM K R <[email protected]> wrote: > > On Wed, Jun 18, 2025 at 7:01 PM Alexander Bluhm <[email protected]> > wrote: > > > > On Wed, Jun 18, 2025 at 04:54:34PM -0300, K R wrote: > > > >Synopsis: server freezes under heavy CPU usage > > > >Category: kernel > > > >Environment: > > > System : OpenBSD 7.7 > > > Details : OpenBSD 7.7-current (GENERIC.MP) #21: Tue Jun > > > 17 17:40:27 MDT 2025 > > > > > > [email protected]:/usr/src/sys/arch/amd64/compile/GENERIC.MP > > > > > > Architecture: OpenBSD.amd64 > > > Machine : amd64 > > > >Description: > > > > > > This machine is a Dell PowerEdge R440 with 16 CPUs and 128GM of RAM. > > > It freezes under heavy CPU usage, specially with lots of threads. > > > This started with 7.7-release + syspatches but continues with a > > > -current as of today. > > > > > > No panic, nothing, just freezes. Can't even force into ddb (with > > > ddb.console=1). During last test, top(1) froze with this last output: > > > > > > load averages: 10.73, 11.12, 10.53 test > > > 16:46:44 > > > 125 processes: 93 idle, 32 on processor up 0 days > > > 00:59:55 > > > 16 CPUs: 17.7% user, 51.5% nice, 3.6% sys, 1.1% spin, 1.1% intr, > > > 25.0% idle > > > Memory: Real: 13G/37G act/tot Free: 87G Cache: 22G Swap: 0K/64G > > > > > > PID USERNAME PRI NICE SIZE RES STATE WAIT TIME CPU > > > COMMAND > > > 60756 root 64 0 13G 13G onproc/1 - 35:05 100.78% > > > python3.12 > > > 27129 root 10 20 9124K 1532K onproc/2 fsleep 8:31 75.98% > > > semaphore > > > 42272 root 10 20 9644K 1540K onproc/4 fsleep 8:36 75.83% > > > semaphore > > > 60257 root 10 20 9644K 1560K onproc/14 fsleep 8:32 74.76% > > > semaphore > > > 27054 root 64 0 384K 328K onproc/7 - 0:26 71.24% rm > > > 58070 root 10 0 15M 4428K sleep/13 fsleep 23:45 36.04% nfdump > > > 11522 root 10 20 9636K 1524K onproc/0 fsleep 8:44 31.93% > > > semaphore > > > 40359 root 10 20 9648K 1556K onproc/2 fsleep 8:44 29.88% > > > semaphore > > > 72237 root 10 20 9632K 1520K onproc/0 fsleep 8:41 27.20% > > > semaphore > > > 42031 root 10 20 9648K 1576K onproc/8 fsleep 8:39 27.10% > > > semaphore > > > 97960 root 10 20 9644K 1536K onproc/8 fsleep 8:39 26.46% > > > semaphore > > > 76525 root 10 0 95M 57M sleep/12 fsleep 10:01 12.84% nfdump > > > 68093 root 10 20 96M 64M sleep/3 fsleep 10:07 12.11% nfdump > > > 94072 root -5 20 27M 11M sleep/3 biowait 4:42 1.03% pigz > > > 52734 root 2 0 1640K 2740K sleep/4 kqread 0:37 0.98% top > > > 84043 root 10 20 27M 11M sleep/3 inode 2:07 0.34% pigz > > > 95028 root 10 20 27M 11M sleep/4 inode 2:09 0.15% pigz > > > 66823 root 10 20 26M 11M sleep/0 inode 2:07 0.05% pigz > > > 59751 root 2 0 2768K 3244K sleep/0 kqread 0:09 0.05% tmux > > > 58124 root -22 0 0K 4K sleep/1 - 37:01 0.00% idle1 > > > 59513 root -22 0 0K 4K sleep/2 - 36:27 0.00% idle2 > > > > > > Any recommendations on what could help debugging? > > > > Run a witness kernel. Remove comment '#' in #option WITNESS > > src/sys/arch/amd64/conf/GENERIC.MP and rebuild fresh kernel after > > make clean and make config. Set sysctl kern.witness.watch=2 to get > > stacktraces. It might report some false positives or known bugs. > > Thanks for the recommendation. I've just started running a -current > kernel with WITNESS enabled and with kern.witness.watch=2. > > > Maybe it finds something. Best we can expect is a panic instead > > of hang. Then show all locks in ddb and trace on all CPU would be > > useful. > > Stress tests running, let's see if I can send more useful debug info. > I'll keep the list posted.
Hi Alexander, The good news: I can consistently reproduce the hang problem. But the bad news is that even with a WITNESS kernel and kern.witness.watch=2 (or even 3) I don't see any message or kernel panic. Any additional suggestion in order to increase debug information or force the machine to go into ddb? Thanks, --Kor > > Thanks again, > --Kor > > > > >How-To-Repeat: > > > > > > Start lots of thread-intensive programs, like pigz(1), nfdump(1), etc. > > > I also had a simple C test program using SYSV IPC semaphores running. > > > The problem seems to require a reasonable number of CPUs (16 or more) > > > to manifest itself. > > > > > > >Fix: > > > > > > Unknown. > > > > > > Thanks, > > > --Kor
