Re: Random lockups on an email server - possibly kern/50168

D'Arcy J.M. Cain Sun, 03 Apr 2016 06:52:06 -0700

On Mon, 28 Mar 2016 12:02:27 -0400
"D'Arcy J.M. Cain" <da...@netbsd.org> wrote:
> As far as I can tell I am seeing a total of 2GB of memory used by all
> processes and resident in memory but the system (top
> and /proc/meminfo) are telling me that 17GB of memory is in use.
> What's using the other 15GB?


Meanwhile, my system crashed again.  I have taken to rebooting every
morning (better a controlled five minute down time than a minimum half
hour crash).  Here is what was on the screen when it locked up.

load averages:  1.74,  2.53,  2.39;               up 4+16:45:53        04:42:39
491 processes: 446 sleeping, 43 zombie, 2 on CPU
CPU states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Memory: 18G Act, 9227M Inact, 11M Wired, 86M Exec, 26G File, 12M Free
Swap: 32G Total, 32G Free

  PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
25248 root      85    0   148M   72M select/1   0:02  0.00%  0.00% perl
  212 root     117    0    87M   52M tstile/2   1:02  0.00%  0.00% auth
 8125 kontakt  117    0    64M   51M tstile/4   0:00  0.00%  0.00% imap
17669 ennis    117    0    59M   47M tstile/8   0:02  0.00%  0.00% imap
 2550 mailman  117    0   134M   46M tstile/0   0:34  0.00%  0.00% python2.7
    0 root       0    0     0K   45M CPU/15    51:28  0.00%  0.00% [system]
 1305 mailman  117    0   136M   44M tstile/0   0:27  0.00%  0.00% python2.7
 1691 mailman  117    0   134M   44M tstile/4   0:28  0.00%  0.00% python2.7
29932 www       85    0   362M   37M semwai/1   0:07  0.00%  0.00% httpd
17758 www       85    0   365M   34M semwai/5   0:11  0.00%  0.00% httpd
 2143 www       85    0   362M   33M semwai/0   0:10  0.00%  0.00% httpd
 2908 mailman  117    0   123M   32M tstile/8   0:25  0.00%  0.00% python2.7
 1434 root      85    0   347M   30M select/1   0:05  0.00%  0.00% httpd
 2718 sgh       85    0    43M   29M kqueue/0   0:14  0.00%  0.00% imap
12296 www       85    0   359M   29M semwai/7   0:06  0.00%  0.00% httpd
27886 www       85    0   356M   28M kqueue/1   0:03  0.00%  0.00% httpd
 5943 www       85    0   357M   27M semwai/1   0:02  0.00%  0.00% httpd
25826 www       85    0   356M   26M semwai/1   0:02  0.00%  0.00% httpd
14331 www       85    0   352M   23M semwai/4   0:01  0.00%  0.00% httpd
 2039 mailman  117    0   118M   23M tstile/1   0:25  0.00%  0.00% python2.7
 1863 postgrey  85    0    82M   21M select/1   0:41  0.00%  0.00% perl
27179 moegross 117    0    32M   20M tstile/1   0:04  0.00%  0.00% imap
27262 root     117    0    96M   18M tstile/9   0:00  0.00%  0.00% python3.4
15158 root      85    0    95M   18M flt_no/8   0:00  0.00%  0.00% python3.4
 1594 mailman  117    0   115M   16M tstile/8   0:24  0.00%  0.00% python2.7
 1720 mailman  117    0   115M   16M tstile/1   0:23  0.00%  0.00% python2.7
 2238 mailman   85    0   101M   15M select/1   0:00  0.00%  0.00% python2.7
26659 eref3     85    0    97M   15M flt_no/1   0:00  0.00%  0.00% python3.4
 7355 root      85    0   148M   15M select/9   0:00  0.00%  0.00% perl

And my memory test:

Fri Apr  1 04:39:12 EDT 2016
PS:        2085092
PROC:     32033408

It was pointed out to me that the interesting bit was that so many
processes were waiting on tstile.  This may not be a swap issue after
all, at least not directly.

"Those indicate a kernel lock problem of some kind.  tstile is the wchan
used for a process sleeping on an internal lock - to debug this, you
need to find out which lock it is - and probably which locks they all
are. Some of that (in fact, probably most of it) is probably
legitimate, most likely there is one process there that is locking
something (trying to) which is never going to be unlocked - either
because of a deadlock with another of them, or because something in the
code simply missed an exit path and forgot to unlock.   If that process
has other locks held, then eventually some other process is going to
want one of those locks, and it hangs, perhaps while holding more locks
- then some other process is going to need a lock that's already
locked, ...

"Eventually something that is important gets locked, and everything
stops working when processes try to get that important lock that some
process that is being blocked by one of the other less important locks
has held, and the system seems to freeze - actually it is probably
still working "correctly" - if only that one, original lock, was
released..."

This led me to the following PR.

http://gnats.netbsd.org/39016

There is a bit of discussion and then it was closed with "This
particular problem has been fixed. Other problems that lead to "tstile
syndrome" still exist, because "tstile syndrome" is any generic
deadlock."  It doesn't say what the fix was.  Could this be some sort
of code regression?

I am copying tech-kern as we seem to be getting deeper into the
kernel.  Replies set there as well.

Meanwhile I am running the following script, a modification of one
suggested by Robert Elz.

while true
do
  ps -ax -o pid= -o wchan= | while read pid wchan
  do
    case "${wchan}" in
      tstile*)  x="`ps -p "${pid}" | grep tstile`"
                if [ "X$x" = "X" ]; then continue; fi
                dt=`date`
                echo "TSTILE: ${dt} $x"
                ;;
    esac
  done
  sleep 1
done


-- 
D'Arcy J.M. Cain <da...@netbsd.org>
http://www.NetBSD.org/ IM:da...@vex.net

Re: Random lockups on an email server - possibly kern/50168

Reply via email to