Re: Random lockups on an email server - possibly kern/50168

Taylor R Campbell Sun, 03 Apr 2016 07:36:33 -0700

   Date: Sun, 3 Apr 2016 09:51:08 -0400
   From: "D'Arcy J.M. Cain" <da...@netbsd.org>


   This led me to the following PR.

   http://gnats.netbsd.org/39016

   There is a bit of discussion and then it was closed with "This
   particular problem has been fixed. Other problems that lead to "tstile
   syndrome" still exist, because "tstile syndrome" is any generic
   deadlock."  It doesn't say what the fix was.  Could this be some sort
   of code regression?

Every mutex in the kernel is supposed to be held for at most some
constant duration.  When someone tries to an acquire a mutex that is
already held, it will wait with wchan `tstile'.  There are hundreds or
thousands of different mutexes in any given system -- a bug with any
one of them could manifest that way.  

Was your system completely locked up and unresponsive, or just the
services that mattered?  Can you get a stack trace from crash(8) for
the processes that are wedged?  If not, can you enter ddb, e.g. by
typing C-A-ESC, and do it there?

>From either crash(8) or ddb, you can list the processes with `show
proc' and get a stack trace for any individual one with `bt 0t<pid>'.
(`0t' is the notation for decimal; ddb reads input as hexadecimal by
default, for whatever reason.)

   I am copying tech-kern as we seem to be getting deeper into the
   kernel.  Replies set there as well.

   Meanwhile I am running the following script, a modification of one
   suggested by Robert Elz.

   ...
       case "${wchan}" in
         tstile*)  x="`ps -p "${pid}" | grep tstile`"
                   if [ "X$x" = "X" ]; then continue; fi
                   dt=`date`
                   echo "TSTILE: ${dt} $x"
                   ;;

If you can get a stack trace out of crash(8), that would be more
helpful.  Maybe something like:

printf 'bt 0t%d\n' "${pid}" | crash

Usually the culprit is *not* the process or thread that is stuck in
tstile, but that stack trace will help to find what mutex is at issue.

Re: Random lockups on an email server - possibly kern/50168

Reply via email to