Re: How to debug hanging machines / proc: table is full

Raimo Niskanen Tue, 14 Jan 2020 06:20:27 -0800

Sorry about not having anything solid to go on, but in (amd64) OpenBSD 6.5
I had a problem that during our daily tests of Erlang/OTP the test machines
after a while froze.


I did not manage to find a root cause, but think it had to do with some
lock on the process table that caused the table fill.

The problem was fixed in 6.6, (also patched), but is back again in Current,
at least up to Jan 10.

The frozen machine does not even respond to Ctrl+Alt+Esc on the console to
enter ddb.  I do not know where to begin, except upgrading to later Current
to see if it persists.

Best regards
/ Raimo Niskanen


On Mon, Nov 25, 2019 at 09:53:23AM +0100, Raimo Niskanen wrote:
> I have upgraded the machines to 6.6 and the problem seems to be gone.
> The machines now have been up for 10 and 16 days, which is a record.
> 
> Great!
> 
> / Raimo Niskanen
> 
> 
> 
> On Fri, Nov 01, 2019 at 11:50:18AM +0100, Andreas Kusalananda Kähäri wrote:
> > On Fri, Nov 01, 2019 at 11:38:06AM +0100, Raimo Niskanen wrote:
> > > On Mon, Sep 09, 2019 at 05:44:32PM +0200, Raimo Niskanen wrote:
> > > > On Mon, Sep 09, 2019 at 05:42:02PM +0200, Raimo Niskanen wrote:
> > > > > On Wed, Jul 31, 2019 at 05:46:08PM +0200, Raimo Niskanen wrote:
> > > > > > On Mon, Jul 29, 2019 at 01:20:58PM +0000, Stuart Henderson wrote:
> > > > > > > On 2019-07-29, Raimo Niskanen <raimo+open...@erix.ericsson.se> 
> > > > > > > wrote:
> > > > > > > > A new hang, I tried to invstigate:
> > > > > > > >
> > > > > > > > At July 19 the last log entry from my 'ps' log was from 14:55, 
> > > > > > > > which is
> > > > > > > > also the time on the 'systat vmstat' screen when it froze.  
> > > > > > > > Then the machine
> > > > > > > > hums along but just after midnight at 00:42:01 the first "/bsd: 
> > > > > > > > process:
> > > > > > > > table is full" entry appears.  That message repeats until I 
> > > > > > > > rebooted it
> > > > > > > > today at July 29 10:48.
> > > > > > > >
> > > > > > > > I had a terminal with top running.  It was still updating.  It 
> > > > > > > > showed about
> > > > > > > > 98% sys and 2% spin on one of 4 CPUs, the others 100% idle.  
> > > > > > > > Then (after
> > > > > > > > the process table had gotten full) it had 1282 idle processes 
> > > > > > > > and 1 on
> > > > > > > > processor, which was 'top' itself.
> > > > > > > > Memory: Real: 456M/1819M act/tot Free: 14G Cache: 676M Swap: 
> > > > > > > > 0K/16G.
> > > > > > > >
> > > > > > > > I had 8 shells under tmux ready for debugging.  'ls worked.
> > > > > > > > 'systat' on one hung.  'top' on another failed with "cannot 
> > > > > > > > fork".
> > > > > > > > 'exec ps ajxww" printed two lines with /sbin/init and 
> > > > > > > > /sbin/slaac
> > > > > > > > and then hung.  'exec reboot' did not succeed.  Neither did a 
> > > > > > > > short power
> > > > > > > > button, that at least caused a printout "stopping daemon 
> > > > > > > > nginx(failed)",
> > > > > > > > but got no further.  I had to do a hard power off. 
> > > > > > > >
> > > > > > > > My theory now is that our daily tests right before 14:55 
> > > > > > > > started a process
> > > > > > > > (this process is the top 'top' process with 10:14 execution 
> > > > > > > > time) that
> > > > > > > > triggers a lock or other contention problem in the kernel which 
> > > > > > > > causes
> > > > > > > > one CPU to spin in the system, and blocks processes from dying.
> > > > > > > > About 10 hours later the process table gets full.
> > > > > > > >
> > > > > > > > Any, ANY ideas of how to proceed would be appreciated!
> > > > > > > >
> > > > > > > > Best Regards
> > > > > > > 
> > > > > > > Did you notice any odd waitchan's (WAIT in top output)?
> > > > > > > 
> > > > > > > Maybe set ddb.console=1 in sysctl.conf and reboot (if not already
> > > > > > > set), then try to break into DDB during a hang and see how things 
> > > > > > > look
> > > > > > > in ps there. (Test breaking into DDB before a hang first so you 
> > > > > > > know
> > > > > > > that you can do it .. you can just "c" to continue).
> > > > > > > 
> > > > > > > There might also be clues in things like "sh malloc" or "sh all 
> > > > > > > pools".
> > > > > > > 
> > > > > > > Perhaps you could also get clues from running a kernel built with
> > > > > > > 'option WITNESS', you may get some messages in dmesg, or it adds 
> > > > > > > commands
> > > > > > > to ddb like "show locks", "show all locks", "show witness" (see 
> > > > > > > ddb(4) for
> > > > > > > details).
> > > > > > 
> > > > > > I have enabled Witness, it went so-so.  We'll see what it catches.
> > > > > > 
> > > > > > I downloaded 6.5 amd64 src.tar.gz and sys.tar.gz, unpacked them,
> > > > > > applied all patches for stable 001-006 and built a kernel with:
> > > > > >   include "arch/amd64/conf/GENERIC"
> > > > > >   option    MULTIPROCESSOR
> > > > > >   option    MP_LOCKDEBUG
> > > > > >   option    WITNESS
> > > > > > 
> > > > > > Then I activated in /etc/sysctl.conf:
> > > > > >   ddb.console=1
> > > > > >   kern.witness.locktrace=1
> > > > > >   kern.witness.watch=3
> > > > > > 
> > > > > > For fun, I pressed Ctrl+Alt+Esc at the console, got a ddb> prompt 
> > > > > > and typed
> > > > > > "show witness".  It printed lots of info, I scrolled down to the 
> > > > > > end, but
> > > > > > during the printout there was an UVM fault:
> > > > > > 
> > > > > >   Spin locks:
> > > > > >   /usr/src/sys/....
> > > > > >   :
> > > > > >   bla bla bla
> > > > > >   :
> > > > > >   uvm_fault(0xffffffff81e03b50, 0xffff800022368360, 0, 1) -> e
> > > > > >   kernel: page fault trap, code=0
> > > > > >   Faulted in DDB: continuing...
> > > > > > 
> > > > > > Then I typed "cont" and it panicked.
> > > > > > If anybody want details I took a picture.
> > > > > > 
> > > > > > Have I combined too many debugging options, or is this sh*t that 
> > > > > > happens?
> > > > > > 
> > > > > > Nevertheless, now the machine is running again, with Witness...
> > > > > > 
> > > > > > I'll be back.
> > > > > 
> > > > > I have encountered some kind of stop, oddly enought not a panic - it
> > > > > just sat in ddb and I missed it for a week (or more).  Then I did not
> > > > > remember what I had planned to do so I "improvised" X-| , but anyway:
> > > > > 
> > > > > ddb{0}> ps
> > > > > shows about 350 processes from cron, half of them in state netlock, 
> > > > > half
> > > > 
> > > > Sorry, that should have been about 1350 processes...
> > > > 
> > > > > in state piperd.  Then I have my test processes beam.smp: 6 in 
> > > > > netlock, 6
> > > > > in piperd, about 70 in fsleep, 3 in poll, 3 in select, 4 in kqread.
> > > > > Then about 100 more ordinary looking processes...
> > > > > 
> > > > > ddb{0}> trace
> > > > > db_enter()...
> > > > > softclock(0)...
> > > > > softintr_dispatch(0)...
> > > > > Xsoftclock(0,0,1388,....)...
> > > > > acpicpu_idle()...
> > > > > shed_idle(ffffffff81ceff0)...
> > > > > end trace frame: 0x0, count: -6
> > > > > 
> > > > > ddb{0}> show locks
> > > > > exclusive kernel_lock &kernel_lock r = 0 (0xffffffff81e37b10) locked @
> > > > > /usr/src/sys/arch/amd64/amd64/softintr.c:87
> > > > > #0  witness_lock+0x41f
> > > > > #1  softintr_dispatc+0x56
> > > > > #2  Xsoftclock+0x1f
> > > > > #3  acpicpu_idle+0x271
> > > > > #4  sched_idle+0x235
> > > > > #5  proc_trampoline+0x1c
> > > 
> > > I have gotten two more hangs, no panik.  Had to go to the debug console
> > > with Alt+Ctrl+Esc.  show all locks shows this:
> > 
> > Did you ever investigate what left those 1350 cron jobs running (i.e.
> > actually look at the crontabs to see if they were correct and test run
> > the jobs to make sure that they run as expected)?  Is that amount of
> > cron jobs executing at once normal on your system?
> > 
> > 
> > > 
> > >   Process nnnnn (softnet) thread 0xffff8000ffffe4b0 (461575)
> > >   exclusive rwlock netlock r = 0 (0xffffffff81cfb420) locked @
> > >   /usr/src/sys/net/if.c:950
> > >   #0 witness_lock+0x41f
> > >   #1 if_netisr+0x22
> > >   #2 taskq_thread+0x4d
> > >   #3 proc_trampoline+0x1c
> > > 
> > >   Process nnnnn (idle0) thread 0xffff8000fffff518 (155781)
> > >   exclusive kernel_lock &kernel_lock r= 0 (0xffffffff81e177c0) locked @
> > >   /usr/src/sys/arch/amd64/amd64/softintr.c:87
> > >   #0 witness_lock+0x41f
> > >   #1 softintr_dispatch+0x56
> > >   #2 Xsoftclock+0x1f
> > >   #3 acpicpu_idle+0x271
> > >   #4 shec_idle+0x235
> > >   #5 proc_trampoline+0x1c
> > > 
> > > Any clues, anyone?
> > > The machine hangs about once a week.
> > > 
> > > 
> > > > > 
> > > > > ddb{0}> show nfsnode
> > > > > size 5476515891122113356 flag 0 vnode 0xd080c7d8 accstamp 
> > > > > 1099511681152
> > > > > 
> > > > > (I think the size looks strange)
> > > > > 
> > > > > Then I tried show map and got a protection fault trap, gave up and
> > > > > rebooted.
> > > > > 
> > > > > That was it!  Next time I will try:
> > > > >   trace
> > > > >   ps
> > > > >   show malloc
> > > > >   show all pools
> > > > >   show locks
> > > > >   show all locks
> > > > > unless anyone has got more or better suggestions...
> > > > > 
> > > > > Best Regards
> > > > > -- 
> > > > > 
> > > > > / Raimo Niskanen, Erlang/OTP, Ericsson AB
> > > > 
> > > > -- 
> > > > 
> > > > / Raimo Niskanen, Erlang/OTP, Ericsson AB
> > > 
> > > -- 
> > > 
> > > / Raimo Niskanen, Erlang/OTP, Ericsson AB
> > 
> > -- 
> > Andreas (Kusalananda) Kähäri
> > SciLifeLab, NBIS, ICM
> > Uppsala University, Sweden
> 
> -- 
> 
> / Raimo Niskanen, Erlang/OTP, Ericsson AB

-- 

/ Raimo Niskanen, Erlang/OTP, Ericsson AB

Re: How to debug hanging machines / proc: table is full

Reply via email to