On Thu, Jun 19, 2025 at 9:03 AM K R <[email protected]> wrote:
>
> On Wed, Jun 18, 2025 at 7:01 PM Alexander Bluhm <[email protected]> 
> wrote:
> >
> > On Wed, Jun 18, 2025 at 04:54:34PM -0300, K R wrote:
> > > >Synopsis: server freezes under heavy CPU usage
> > > >Category:      kernel
> > > >Environment:
> > >         System      : OpenBSD 7.7
> > >          Details     : OpenBSD 7.7-current (GENERIC.MP) #21: Tue Jun
> > > 17 17:40:27 MDT 2025
> > >
> > > [email protected]:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> > >
> > >         Architecture: OpenBSD.amd64
> > >         Machine     : amd64
> > > >Description:
> > >
> > > This machine is  a Dell PowerEdge R440 with 16 CPUs and 128GM of RAM.
> > > It freezes under heavy CPU usage, specially with lots of threads.
> > > This started with 7.7-release + syspatches but continues with a
> > > -current as of today.
> > >
> > > No panic, nothing, just freezes.  Can't even force into ddb (with
> > > ddb.console=1).  During last test, top(1) froze with this last output:
> > >
> > > load averages: 10.73, 11.12, 10.53                                 test 
> > > 16:46:44
> > > 125 processes: 93 idle, 32 on processor                       up 0 days 
> > > 00:59:55
> > > 16  CPUs: 17.7% user, 51.5% nice,  3.6% sys,  1.1% spin,  1.1% intr, 
> > > 25.0% idle
> > > Memory: Real: 13G/37G act/tot Free: 87G Cache: 22G Swap: 0K/64G
> > >
> > >   PID USERNAME PRI NICE  SIZE   RES STATE     WAIT      TIME    CPU 
> > > COMMAND
> > > 60756 root      64    0   13G   13G onproc/1  -        35:05 100.78% 
> > > python3.12
> > > 27129 root      10   20 9124K 1532K onproc/2  fsleep    8:31 75.98% 
> > > semaphore
> > > 42272 root      10   20 9644K 1540K onproc/4  fsleep    8:36 75.83% 
> > > semaphore
> > > 60257 root      10   20 9644K 1560K onproc/14 fsleep    8:32 74.76% 
> > > semaphore
> > > 27054 root      64    0  384K  328K onproc/7  -         0:26 71.24% rm
> > > 58070 root      10    0   15M 4428K sleep/13  fsleep   23:45 36.04% nfdump
> > > 11522 root      10   20 9636K 1524K onproc/0  fsleep    8:44 31.93% 
> > > semaphore
> > > 40359 root      10   20 9648K 1556K onproc/2  fsleep    8:44 29.88% 
> > > semaphore
> > > 72237 root      10   20 9632K 1520K onproc/0  fsleep    8:41 27.20% 
> > > semaphore
> > > 42031 root      10   20 9648K 1576K onproc/8  fsleep    8:39 27.10% 
> > > semaphore
> > > 97960 root      10   20 9644K 1536K onproc/8  fsleep    8:39 26.46% 
> > > semaphore
> > > 76525 root      10    0   95M   57M sleep/12  fsleep   10:01 12.84% nfdump
> > > 68093 root      10   20   96M   64M sleep/3   fsleep   10:07 12.11% nfdump
> > > 94072 root      -5   20   27M   11M sleep/3   biowait   4:42  1.03% pigz
> > > 52734 root       2    0 1640K 2740K sleep/4   kqread    0:37  0.98% top
> > > 84043 root      10   20   27M   11M sleep/3   inode     2:07  0.34% pigz
> > > 95028 root      10   20   27M   11M sleep/4   inode     2:09  0.15% pigz
> > > 66823 root      10   20   26M   11M sleep/0   inode     2:07  0.05% pigz
> > > 59751 root       2    0 2768K 3244K sleep/0   kqread    0:09  0.05% tmux
> > > 58124 root     -22    0    0K    4K sleep/1   -        37:01  0.00% idle1
> > > 59513 root     -22    0    0K    4K sleep/2   -        36:27  0.00% idle2
> > >
> > > Any recommendations on what could help debugging?
> >
> > Run a witness kernel.  Remove comment '#' in #option WITNESS
> > src/sys/arch/amd64/conf/GENERIC.MP and rebuild fresh kernel after
> > make clean and make config.  Set sysctl kern.witness.watch=2 to get
> > stacktraces.  It might report some false positives or known bugs.
>
> Thanks for the recommendation.  I've just started running a -current
> kernel with WITNESS enabled and with kern.witness.watch=2.
>
> > Maybe it finds something.  Best we can expect is a panic instead
> > of hang.  Then show all locks in ddb and trace on all CPU would be
> > useful.
>
> Stress tests running, let's see if I can send more useful debug info.
> I'll keep the list posted.

Hi Alexander,

The good news: I can consistently reproduce the hang problem.  But the
bad news is that even with a WITNESS kernel and kern.witness.watch=2
(or even 3) I don't see any message or kernel panic.

Any additional suggestion in order to increase debug information or
force the machine to go into ddb?

Thanks,
--Kor

>
> Thanks again,
> --Kor
>
> > > >How-To-Repeat:
> > >
> > > Start lots of thread-intensive programs, like pigz(1), nfdump(1), etc.
> > > I also had a simple C test program using SYSV IPC semaphores running.
> > > The problem seems to require a reasonable number of CPUs (16 or more)
> > > to manifest itself.
> > >
> > > >Fix:
> > >
> > > Unknown.
> > >
> > > Thanks,
> > > --Kor

Reply via email to