On Mon, 28 Mar 2016 12:02:27 -0400 "D'Arcy J.M. Cain" <da...@netbsd.org> wrote: > As far as I can tell I am seeing a total of 2GB of memory used by all > processes and resident in memory but the system (top > and /proc/meminfo) are telling me that 17GB of memory is in use. > What's using the other 15GB?
Meanwhile, my system crashed again. I have taken to rebooting every morning (better a controlled five minute down time than a minimum half hour crash). Here is what was on the screen when it locked up. load averages: 1.74, 2.53, 2.39; up 4+16:45:53 04:42:39 491 processes: 446 sleeping, 43 zombie, 2 on CPU CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle Memory: 18G Act, 9227M Inact, 11M Wired, 86M Exec, 26G File, 12M Free Swap: 32G Total, 32G Free PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND 25248 root 85 0 148M 72M select/1 0:02 0.00% 0.00% perl 212 root 117 0 87M 52M tstile/2 1:02 0.00% 0.00% auth 8125 kontakt 117 0 64M 51M tstile/4 0:00 0.00% 0.00% imap 17669 ennis 117 0 59M 47M tstile/8 0:02 0.00% 0.00% imap 2550 mailman 117 0 134M 46M tstile/0 0:34 0.00% 0.00% python2.7 0 root 0 0 0K 45M CPU/15 51:28 0.00% 0.00% [system] 1305 mailman 117 0 136M 44M tstile/0 0:27 0.00% 0.00% python2.7 1691 mailman 117 0 134M 44M tstile/4 0:28 0.00% 0.00% python2.7 29932 www 85 0 362M 37M semwai/1 0:07 0.00% 0.00% httpd 17758 www 85 0 365M 34M semwai/5 0:11 0.00% 0.00% httpd 2143 www 85 0 362M 33M semwai/0 0:10 0.00% 0.00% httpd 2908 mailman 117 0 123M 32M tstile/8 0:25 0.00% 0.00% python2.7 1434 root 85 0 347M 30M select/1 0:05 0.00% 0.00% httpd 2718 sgh 85 0 43M 29M kqueue/0 0:14 0.00% 0.00% imap 12296 www 85 0 359M 29M semwai/7 0:06 0.00% 0.00% httpd 27886 www 85 0 356M 28M kqueue/1 0:03 0.00% 0.00% httpd 5943 www 85 0 357M 27M semwai/1 0:02 0.00% 0.00% httpd 25826 www 85 0 356M 26M semwai/1 0:02 0.00% 0.00% httpd 14331 www 85 0 352M 23M semwai/4 0:01 0.00% 0.00% httpd 2039 mailman 117 0 118M 23M tstile/1 0:25 0.00% 0.00% python2.7 1863 postgrey 85 0 82M 21M select/1 0:41 0.00% 0.00% perl 27179 moegross 117 0 32M 20M tstile/1 0:04 0.00% 0.00% imap 27262 root 117 0 96M 18M tstile/9 0:00 0.00% 0.00% python3.4 15158 root 85 0 95M 18M flt_no/8 0:00 0.00% 0.00% python3.4 1594 mailman 117 0 115M 16M tstile/8 0:24 0.00% 0.00% python2.7 1720 mailman 117 0 115M 16M tstile/1 0:23 0.00% 0.00% python2.7 2238 mailman 85 0 101M 15M select/1 0:00 0.00% 0.00% python2.7 26659 eref3 85 0 97M 15M flt_no/1 0:00 0.00% 0.00% python3.4 7355 root 85 0 148M 15M select/9 0:00 0.00% 0.00% perl And my memory test: Fri Apr 1 04:39:12 EDT 2016 PS: 2085092 PROC: 32033408 It was pointed out to me that the interesting bit was that so many processes were waiting on tstile. This may not be a swap issue after all, at least not directly. "Those indicate a kernel lock problem of some kind. tstile is the wchan used for a process sleeping on an internal lock - to debug this, you need to find out which lock it is - and probably which locks they all are. Some of that (in fact, probably most of it) is probably legitimate, most likely there is one process there that is locking something (trying to) which is never going to be unlocked - either because of a deadlock with another of them, or because something in the code simply missed an exit path and forgot to unlock. If that process has other locks held, then eventually some other process is going to want one of those locks, and it hangs, perhaps while holding more locks - then some other process is going to need a lock that's already locked, ... "Eventually something that is important gets locked, and everything stops working when processes try to get that important lock that some process that is being blocked by one of the other less important locks has held, and the system seems to freeze - actually it is probably still working "correctly" - if only that one, original lock, was released..." This led me to the following PR. http://gnats.netbsd.org/39016 There is a bit of discussion and then it was closed with "This particular problem has been fixed. Other problems that lead to "tstile syndrome" still exist, because "tstile syndrome" is any generic deadlock." It doesn't say what the fix was. Could this be some sort of code regression? I am copying tech-kern as we seem to be getting deeper into the kernel. Replies set there as well. Meanwhile I am running the following script, a modification of one suggested by Robert Elz. while true do ps -ax -o pid= -o wchan= | while read pid wchan do case "${wchan}" in tstile*) x="`ps -p "${pid}" | grep tstile`" if [ "X$x" = "X" ]; then continue; fi dt=`date` echo "TSTILE: ${dt} $x" ;; esac done sleep 1 done -- D'Arcy J.M. Cain <da...@netbsd.org> http://www.NetBSD.org/ IM:da...@vex.net