Re: How to debug hanging machines / proc: table is full

Raimo Niskanen Fri, 27 Mar 2020 02:07:10 -0700

On Tue, Jan 14, 2020 at 11:39:04AM +0100, Raimo Niskanen wrote:
> Sorry about not having anything solid to go on, but in (amd64) OpenBSD 6.5
> I had a problem that during our daily tests of Erlang/OTP the test machines
> after a while froze.
> 
> I did not manage to find a root cause, but think it had to do with some
> lock on the process table that caused the table fill.
> 
> The problem was fixed in 6.6, (also patched), but is back again in Current,
> at least up to Jan 10.
> 
> The frozen machine does not even respond to Ctrl+Alt+Esc on the console to
> enter ddb.  I do not know where to begin, except upgrading to later Current
> to see if it persists.


This monday (March 23) I upgraded to a new snapshot and now the machine
survives our daily tests.  The previous snapshot was maybe 2 weeks older.

There is a section of the daily tests that always fails, though, and it
consists of starting many child processes and communicating with them over
TCP on the loopback interface.  I also noticed high CPU utilization by the
softnet thread, and a helper program we use that does name resolving.

I will try to investigate more.

Best regards
/ Raimo Niskanen


> 
> Best regards
> / Raimo Niskanen
> 
> 
> On Mon, Nov 25, 2019 at 09:53:23AM +0100, Raimo Niskanen wrote:
> > I have upgraded the machines to 6.6 and the problem seems to be gone.
> > The machines now have been up for 10 and 16 days, which is a record.
> > 
> > Great!
> > 
> > / Raimo Niskanen
> > 
> > 
> > 
> > On Fri, Nov 01, 2019 at 11:50:18AM +0100, Andreas Kusalananda Kähäri wrote:
> > > On Fri, Nov 01, 2019 at 11:38:06AM +0100, Raimo Niskanen wrote:
> > > > On Mon, Sep 09, 2019 at 05:44:32PM +0200, Raimo Niskanen wrote:
> > > > > On Mon, Sep 09, 2019 at 05:42:02PM +0200, Raimo Niskanen wrote:
> > > > > > On Wed, Jul 31, 2019 at 05:46:08PM +0200, Raimo Niskanen wrote:
> > > > > > > On Mon, Jul 29, 2019 at 01:20:58PM +0000, Stuart Henderson wrote:
> > > > > > > > On 2019-07-29, Raimo Niskanen <raimo+open...@erix.ericsson.se> 
> > > > > > > > wrote:
> > > > > > > > > A new hang, I tried to invstigate:
> > > > > > > > >
> > > > > > > > > At July 19 the last log entry from my 'ps' log was from 
> > > > > > > > > 14:55, which is
> > > > > > > > > also the time on the 'systat vmstat' screen when it froze.  
> > > > > > > > > Then the machine
> > > > > > > > > hums along but just after midnight at 00:42:01 the first 
> > > > > > > > > "/bsd: process:
> > > > > > > > > table is full" entry appears.  That message repeats until I 
> > > > > > > > > rebooted it
> > > > > > > > > today at July 29 10:48.
> > > > > > > > >
> > > > > > > > > I had a terminal with top running.  It was still updating.  
> > > > > > > > > It showed about
> > > > > > > > > 98% sys and 2% spin on one of 4 CPUs, the others 100% idle.  
> > > > > > > > > Then (after
> > > > > > > > > the process table had gotten full) it had 1282 idle processes 
> > > > > > > > > and 1 on
> > > > > > > > > processor, which was 'top' itself.
> > > > > > > > > Memory: Real: 456M/1819M act/tot Free: 14G Cache: 676M Swap: 
> > > > > > > > > 0K/16G.
> > > > > > > > >
> > > > > > > > > I had 8 shells under tmux ready for debugging.  'ls worked.
> > > > > > > > > 'systat' on one hung.  'top' on another failed with "cannot 
> > > > > > > > > fork".
> > > > > > > > > 'exec ps ajxww" printed two lines with /sbin/init and 
> > > > > > > > > /sbin/slaac
> > > > > > > > > and then hung.  'exec reboot' did not succeed.  Neither did a 
> > > > > > > > > short power
> > > > > > > > > button, that at least caused a printout "stopping daemon 
> > > > > > > > > nginx(failed)",
> > > > > > > > > but got no further.  I had to do a hard power off. 
> > > > > > > > >
> > > > > > > > > My theory now is that our daily tests right before 14:55 
> > > > > > > > > started a process
> > > > > > > > > (this process is the top 'top' process with 10:14 execution 
> > > > > > > > > time) that
> > > > > > > > > triggers a lock or other contention problem in the kernel 
> > > > > > > > > which causes
> > > > > > > > > one CPU to spin in the system, and blocks processes from 
> > > > > > > > > dying.
> > > > > > > > > About 10 hours later the process table gets full.
> > > > > > > > >
> > > > > > > > > Any, ANY ideas of how to proceed would be appreciated!
> > > > > > > > >
> > > > > > > > > Best Regards
> > > > > > > > 
> > > > > > > > Did you notice any odd waitchan's (WAIT in top output)?
> > > > > > > > 
> > > > > > > > Maybe set ddb.console=1 in sysctl.conf and reboot (if not 
> > > > > > > > already
> > > > > > > > set), then try to break into DDB during a hang and see how 
> > > > > > > > things look
> > > > > > > > in ps there. (Test breaking into DDB before a hang first so you 
> > > > > > > > know
> > > > > > > > that you can do it .. you can just "c" to continue).
> > > > > > > > 
> > > > > > > > There might also be clues in things like "sh malloc" or "sh all 
> > > > > > > > pools".
> > > > > > > > 
> > > > > > > > Perhaps you could also get clues from running a kernel built 
> > > > > > > > with
> > > > > > > > 'option WITNESS', you may get some messages in dmesg, or it 
> > > > > > > > adds commands
> > > > > > > > to ddb like "show locks", "show all locks", "show witness" (see 
> > > > > > > > ddb(4) for
> > > > > > > > details).
> > > > > > > 
> > > > > > > I have enabled Witness, it went so-so.  We'll see what it catches.
> > > > > > > 
> > > > > > > I downloaded 6.5 amd64 src.tar.gz and sys.tar.gz, unpacked them,
> > > > > > > applied all patches for stable 001-006 and built a kernel with:
> > > > > > >   include "arch/amd64/conf/GENERIC"
> > > > > > >   option  MULTIPROCESSOR
> > > > > > >   option  MP_LOCKDEBUG
> > > > > > >   option  WITNESS
> > > > > > > 
> > > > > > > Then I activated in /etc/sysctl.conf:
> > > > > > >   ddb.console=1
> > > > > > >   kern.witness.locktrace=1
> > > > > > >   kern.witness.watch=3
> > > > > > > 
> > > > > > > For fun, I pressed Ctrl+Alt+Esc at the console, got a ddb> prompt 
> > > > > > > and typed
> > > > > > > "show witness".  It printed lots of info, I scrolled down to the 
> > > > > > > end, but
> > > > > > > during the printout there was an UVM fault:
> > > > > > > 
> > > > > > >   Spin locks:
> > > > > > >   /usr/src/sys/....
> > > > > > >   :
> > > > > > >   bla bla bla
> > > > > > >   :
> > > > > > >   uvm_fault(0xffffffff81e03b50, 0xffff800022368360, 0, 1) -> e
> > > > > > >   kernel: page fault trap, code=0
> > > > > > >   Faulted in DDB: continuing...
> > > > > > > 
> > > > > > > Then I typed "cont" and it panicked.
> > > > > > > If anybody want details I took a picture.
> > > > > > > 
> > > > > > > Have I combined too many debugging options, or is this sh*t that 
> > > > > > > happens?
> > > > > > > 
> > > > > > > Nevertheless, now the machine is running again, with Witness...
> > > > > > > 
> > > > > > > I'll be back.
> > > > > > 
> > > > > > I have encountered some kind of stop, oddly enought not a panic - it
> > > > > > just sat in ddb and I missed it for a week (or more).  Then I did 
> > > > > > not
> > > > > > remember what I had planned to do so I "improvised" X-| , but 
> > > > > > anyway:
> > > > > > 
> > > > > > ddb{0}> ps
> > > > > > shows about 350 processes from cron, half of them in state netlock, 
> > > > > > half
> > > > > 
> > > > > Sorry, that should have been about 1350 processes...
> > > > > 
> > > > > > in state piperd.  Then I have my test processes beam.smp: 6 in 
> > > > > > netlock, 6
> > > > > > in piperd, about 70 in fsleep, 3 in poll, 3 in select, 4 in kqread.
> > > > > > Then about 100 more ordinary looking processes...
> > > > > > 
> > > > > > ddb{0}> trace
> > > > > > db_enter()...
> > > > > > softclock(0)...
> > > > > > softintr_dispatch(0)...
> > > > > > Xsoftclock(0,0,1388,....)...
> > > > > > acpicpu_idle()...
> > > > > > shed_idle(ffffffff81ceff0)...
> > > > > > end trace frame: 0x0, count: -6
> > > > > > 
> > > > > > ddb{0}> show locks
> > > > > > exclusive kernel_lock &kernel_lock r = 0 (0xffffffff81e37b10) 
> > > > > > locked @
> > > > > > /usr/src/sys/arch/amd64/amd64/softintr.c:87
> > > > > > #0  witness_lock+0x41f
> > > > > > #1  softintr_dispatc+0x56
> > > > > > #2  Xsoftclock+0x1f
> > > > > > #3  acpicpu_idle+0x271
> > > > > > #4  sched_idle+0x235
> > > > > > #5  proc_trampoline+0x1c
> > > > 
> > > > I have gotten two more hangs, no panik.  Had to go to the debug console
> > > > with Alt+Ctrl+Esc.  show all locks shows this:
> > > 
> > > Did you ever investigate what left those 1350 cron jobs running (i.e.
> > > actually look at the crontabs to see if they were correct and test run
> > > the jobs to make sure that they run as expected)?  Is that amount of
> > > cron jobs executing at once normal on your system?
> > > 
> > > 
> > > > 
> > > >   Process nnnnn (softnet) thread 0xffff8000ffffe4b0 (461575)
> > > >   exclusive rwlock netlock r = 0 (0xffffffff81cfb420) locked @
> > > >   /usr/src/sys/net/if.c:950
> > > >   #0 witness_lock+0x41f
> > > >   #1 if_netisr+0x22
> > > >   #2 taskq_thread+0x4d
> > > >   #3 proc_trampoline+0x1c
> > > > 
> > > >   Process nnnnn (idle0) thread 0xffff8000fffff518 (155781)
> > > >   exclusive kernel_lock &kernel_lock r= 0 (0xffffffff81e177c0) locked @
> > > >   /usr/src/sys/arch/amd64/amd64/softintr.c:87
> > > >   #0 witness_lock+0x41f
> > > >   #1 softintr_dispatch+0x56
> > > >   #2 Xsoftclock+0x1f
> > > >   #3 acpicpu_idle+0x271
> > > >   #4 shec_idle+0x235
> > > >   #5 proc_trampoline+0x1c
> > > > 
> > > > Any clues, anyone?
> > > > The machine hangs about once a week.
> > > > 
> > > > 
> > > > > > 
> > > > > > ddb{0}> show nfsnode
> > > > > > size 5476515891122113356 flag 0 vnode 0xd080c7d8 accstamp 
> > > > > > 1099511681152
> > > > > > 
> > > > > > (I think the size looks strange)
> > > > > > 
> > > > > > Then I tried show map and got a protection fault trap, gave up and
> > > > > > rebooted.
> > > > > > 
> > > > > > That was it!  Next time I will try:
> > > > > >   trace
> > > > > >   ps
> > > > > >   show malloc
> > > > > >   show all pools
> > > > > >   show locks
> > > > > >   show all locks
> > > > > > unless anyone has got more or better suggestions...
> > > > > > 
> > > > > > Best Regards
> > > > > > -- 
> > > > > > 
> > > > > > / Raimo Niskanen, Erlang/OTP, Ericsson AB
> > > > > 
> > > > > -- 
> > > > > 
> > > > > / Raimo Niskanen, Erlang/OTP, Ericsson AB
> > > > 
> > > > -- 
> > > > 
> > > > / Raimo Niskanen, Erlang/OTP, Ericsson AB
> > > 
> > > -- 
> > > Andreas (Kusalananda) Kähäri
> > > SciLifeLab, NBIS, ICM
> > > Uppsala University, Sweden
> > 
> > -- 
> > 
> > / Raimo Niskanen, Erlang/OTP, Ericsson AB
> 
> -- 
> 
> / Raimo Niskanen, Erlang/OTP, Ericsson AB

-- 

/ Raimo Niskanen, Erlang/OTP, Ericsson AB

Re: How to debug hanging machines / proc: table is full

Reply via email to