On Wed, Jun 18, 2025 at 7:01 PM Alexander Bluhm <[email protected]> wrote:
>
> On Wed, Jun 18, 2025 at 04:54:34PM -0300, K R wrote:
> > >Synopsis: server freezes under heavy CPU usage
> > >Category:      kernel
> > >Environment:
> >         System      : OpenBSD 7.7
> >          Details     : OpenBSD 7.7-current (GENERIC.MP) #21: Tue Jun
> > 17 17:40:27 MDT 2025
> >
> > [email protected]:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> >
> >         Architecture: OpenBSD.amd64
> >         Machine     : amd64
> > >Description:
> >
> > This machine is  a Dell PowerEdge R440 with 16 CPUs and 128GM of RAM.
> > It freezes under heavy CPU usage, specially with lots of threads.
> > This started with 7.7-release + syspatches but continues with a
> > -current as of today.
> >
> > No panic, nothing, just freezes.  Can't even force into ddb (with
> > ddb.console=1).  During last test, top(1) froze with this last output:
> >
> > load averages: 10.73, 11.12, 10.53                                 test 
> > 16:46:44
> > 125 processes: 93 idle, 32 on processor                       up 0 days 
> > 00:59:55
> > 16  CPUs: 17.7% user, 51.5% nice,  3.6% sys,  1.1% spin,  1.1% intr, 25.0% 
> > idle
> > Memory: Real: 13G/37G act/tot Free: 87G Cache: 22G Swap: 0K/64G
> >
> >   PID USERNAME PRI NICE  SIZE   RES STATE     WAIT      TIME    CPU COMMAND
> > 60756 root      64    0   13G   13G onproc/1  -        35:05 100.78% 
> > python3.12
> > 27129 root      10   20 9124K 1532K onproc/2  fsleep    8:31 75.98% 
> > semaphore
> > 42272 root      10   20 9644K 1540K onproc/4  fsleep    8:36 75.83% 
> > semaphore
> > 60257 root      10   20 9644K 1560K onproc/14 fsleep    8:32 74.76% 
> > semaphore
> > 27054 root      64    0  384K  328K onproc/7  -         0:26 71.24% rm
> > 58070 root      10    0   15M 4428K sleep/13  fsleep   23:45 36.04% nfdump
> > 11522 root      10   20 9636K 1524K onproc/0  fsleep    8:44 31.93% 
> > semaphore
> > 40359 root      10   20 9648K 1556K onproc/2  fsleep    8:44 29.88% 
> > semaphore
> > 72237 root      10   20 9632K 1520K onproc/0  fsleep    8:41 27.20% 
> > semaphore
> > 42031 root      10   20 9648K 1576K onproc/8  fsleep    8:39 27.10% 
> > semaphore
> > 97960 root      10   20 9644K 1536K onproc/8  fsleep    8:39 26.46% 
> > semaphore
> > 76525 root      10    0   95M   57M sleep/12  fsleep   10:01 12.84% nfdump
> > 68093 root      10   20   96M   64M sleep/3   fsleep   10:07 12.11% nfdump
> > 94072 root      -5   20   27M   11M sleep/3   biowait   4:42  1.03% pigz
> > 52734 root       2    0 1640K 2740K sleep/4   kqread    0:37  0.98% top
> > 84043 root      10   20   27M   11M sleep/3   inode     2:07  0.34% pigz
> > 95028 root      10   20   27M   11M sleep/4   inode     2:09  0.15% pigz
> > 66823 root      10   20   26M   11M sleep/0   inode     2:07  0.05% pigz
> > 59751 root       2    0 2768K 3244K sleep/0   kqread    0:09  0.05% tmux
> > 58124 root     -22    0    0K    4K sleep/1   -        37:01  0.00% idle1
> > 59513 root     -22    0    0K    4K sleep/2   -        36:27  0.00% idle2
> >
> > Any recommendations on what could help debugging?
>
> Run a witness kernel.  Remove comment '#' in #option WITNESS
> src/sys/arch/amd64/conf/GENERIC.MP and rebuild fresh kernel after
> make clean and make config.  Set sysctl kern.witness.watch=2 to get
> stacktraces.  It might report some false positives or known bugs.

Thanks for the recommendation.  I've just started running a -current
kernel with WITNESS enabled and with kern.witness.watch=2.

> Maybe it finds something.  Best we can expect is a panic instead
> of hang.  Then show all locks in ddb and trace on all CPU would be
> useful.

Stress tests running, let's see if I can send more useful debug info.
I'll keep the list posted.

Thanks again,
--Kor

> > >How-To-Repeat:
> >
> > Start lots of thread-intensive programs, like pigz(1), nfdump(1), etc.
> > I also had a simple C test program using SYSV IPC semaphores running.
> > The problem seems to require a reasonable number of CPUs (16 or more)
> > to manifest itself.
> >
> > >Fix:
> >
> > Unknown.
> >
> > Thanks,
> > --Kor

Reply via email to