On Sun, 3 Apr 2016 09:51:08 -0400 "D'Arcy J.M. Cain" <da...@netbsd.org> wrote: > Meanwhile, my system crashed again. I have taken to rebooting every > morning (better a controlled five minute down time than a minimum half > hour crash). Here is what was on the screen when it locked up.
Based on discussions with David Maxwell I took out the daily reboot and ran crash in a screen(1) terminal. The idea was that if I was already in crash I could run some commands. Today it hung again. Here's the output of top when it hung: load averages: 0.33, 0.31, 0.55; up 2+21:36:26 08:11:40 494 processes: 461 sleeping, 31 zombie, 2 on CPU CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 99.9% idle Memory: 19G Act, 9272M Inact, 11M Wired, 86M Exec, 26G File, 8584K Free Swap: 32G Total, 32G Free PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND 0 root 0 0 0K 45M CPU/14 27:17 0.00% 0.00% [system] 597 root 117 0 24M 2252K tstile/1 1:12 0.00% 0.00% syslogd 29434 root 117 0 25M 14M tstile/8 1:04 0.00% 0.00% rsync 673 root 43 0 18M 3380K CPU/15 0:39 0.00% 0.00% top 15161 root 85 0 12M 2124K kqueue/1 0:18 0.00% 0.00% log 1713 postgrey 85 0 83M 21M select/3 0:17 0.00% 0.00% perl 234 mailman 117 0 129M 37M tstile/1 0:16 0.00% 0.00% python2.7 1796 mailman 117 0 122M 25M tstile/1 0:16 0.00% 0.00% python2.7 22368 druid 85 0 16M 5024K kqueue/1 0:16 0.00% 0.00% imap 2943 mailman 117 0 124M 30M tstile/1 0:15 0.00% 0.00% python2.7 2469 mailman 117 0 115M 17M tstile/1 0:15 0.00% 0.00% python2.7 21549 root 85 0 16M 6824K kqueue/5 0:15 0.00% 0.00% config 26849 root 117 0 89M 55M tstile/1 0:14 0.00% 0.00% auth 235 mailman 117 0 124M 29M tstile/1 0:14 0.00% 0.00% python2.7 233 mailman 117 0 115M 16M tstile/1 0:14 0.00% 0.00% python2.7 3024 mailman 117 0 115M 16M tstile/2 0:14 0.00% 0.00% python2.7 16888 darcy 85 0 16M 5048K kqueue/0 0:12 0.00% 0.00% imap 16363 www 85 0 354M 38M flt_no/8 0:11 0.00% 0.00% httpd 14358 www 85 0 355M 35M kqueue/1 0:11 0.00% 0.00% httpd 1532 root 85 0 22M 10M pause/3 0:11 0.00% 0.00% ntpd 2245 root 85 0 48M 2472K kqueue/0 0:10 0.00% 0.00% master 25121 root 85 0 12M 1940K flt_no/1 0:10 0.00% 0.00% dovecot 21209 www 85 0 355M 34M semwai/1 0:08 0.00% 0.00% httpd 19179 root 85 0 78M 7324K select/8 0:06 0.00% 0.00% sshd 18999 gogo2 117 0 17M 5000K tstile/1 0:05 0.00% 0.00% imap 27442 www 85 0 353M 33M semwai/9 0:05 0.00% 0.00% httpd 13590 www 85 0 351M 29M semwai/5 0:05 0.00% 0.00% httpd 2430 darcy 85 0 20M 2156K select/0 0:04 0.00% 0.00% screen-4.3.1 2807 jbelknap 117 0 19M 7716K tstile/0 0:03 0.00% 0.00% imap 160 root 85 0 337M 26M select/8 0:03 0.00% 0.00% httpd Crash didn't help. When I pressed enter it dumped a ps output to the screen, probably the last command I ran when the system was up. Here is a partial output of that as far back as screen would go. 0 129 3 4 200 fffffe813ac685e0 coretemp1 coretemp1 0 128 3 10 200 fffffe813ac68a00 coretemp0 coretemp0 0 127 3 11 200 fffffe813ac3f1a0 ciss0 ciss0 0 118 3 0 200 fffffe813ab61140 pms0 pmsreset 0 117 3 0 200 fffffe813ab61560 atabus5 atath 0 116 3 0 200 fffffe813ab61980 atabus4 atath 0 115 3 1 200 fffffe813ab44120 atabus3 atath 0 114 3 1 200 fffffe813ab44540 atabus2 atath 0 113 3 0 200 fffffe813ab44960 atabus1 atath 0 112 3 0 200 fffffe813aa7e100 atabus0 atath 0 111 3 0 200 fffffe813aa7e520 usbtask-dr usbtsk 0 110 3 0 200 fffffe813aa7e940 usbtask-hc usbtsk 0 109 3 0 200 fffffe813a8720e0 scsibus0 sccomp 0 108 3 1 200 fffffe813a872500 lnxsyswq lnxsyswq 0 107 3 4 200 fffffe813a872920 ipmi ipmipoll 0 106 3 15 200 fffffe813a7f20c0 xcall/15 xcall 0 105 1 15 200 fffffe813a7f24e0 softser/15 0 104 1 15 200 fffffe813a7f2900 softclk/15 0 103 1 15 200 fffffe813a7db0a0 softbio/15 0 102 1 15 200 fffffe813a7db4c0 softnet/15 0 101 1 15 201 fffffe813a7db8e0 idle/15 0 100 3 14 200 fffffe813a7ce080 xcall/14 xcall 0 99 1 14 200 fffffe813a7ce4a0 softser/14 0 98 1 14 200 fffffe813a7ce8c0 softclk/14 0 97 1 14 200 fffffe813a7b9060 softbio/14 0 96 1 14 200 fffffe813a7b9480 softnet/14 0 > 95 7 14 201 fffffe813a7b98a0 idle/14 0 94 3 13 200 fffffe813a7aa040 xcall/13 xcall 0 93 1 13 200 fffffe813a7aa460 softser/13 0 92 1 13 200 fffffe813a7aa880 softclk/13 0 91 1 13 200 fffffe813a795020 softbio/13 0 90 1 13 200 fffffe813a795440 softnet/13 0 > 89 7 13 201 fffffe813a795860 idle/13 0 88 3 12 200 fffffe813a776000 xcall/12 xcall 0 87 1 12 200 fffffe813a776420 softser/12 0 86 1 12 200 fffffe813a776840 softclk/12 0 85 1 12 200 fffffe813a757360 softbio/12 0 84 1 12 200 fffffe813a757780 softnet/12 0 > 83 7 12 201 fffffe813a757ba0 idle/12 0 82 3 11 200 fffffe813a752340 xcall/11 xcall 0 81 1 11 200 fffffe813a752760 softser/11 0 80 1 11 200 fffffe813a752b80 softclk/11 0 79 1 11 200 fffffe813a75c320 softbio/11 0 78 1 11 200 fffffe813a75c740 softnet/11 0 > 77 7 11 201 fffffe813a75cb60 idle/11 0 76 3 10 200 fffffe813a736300 xcall/10 xcall 0 75 1 10 200 fffffe813a736720 softser/10 0 74 1 10 200 fffffe813a736b40 softclk/10 0 73 1 10 200 fffffe813a70f2e0 softbio/10 0 72 1 10 200 fffffe813a70f700 softnet/10 0 > 71 7 10 201 fffffe813a70fb20 idle/10 0 70 3 9 200 fffffe813a70a2c0 xcall/9 xcall 0 69 1 9 200 fffffe813a70a6e0 softser/9 0 68 1 9 200 fffffe813a70ab00 softclk/9 0 67 1 9 200 fffffe813a70b2a0 softbio/9 0 66 1 9 200 fffffe813a70b6c0 softnet/9 0 > 65 7 9 201 fffffe813a70bae0 idle/9 0 64 3 8 200 fffffe813a6fe280 xcall/8 xcall 0 63 1 8 200 fffffe813a6fe6a0 softser/8 0 62 1 8 200 fffffe813a6feac0 softclk/8 0 61 1 8 200 fffffe813a6e8260 softbio/8 0 60 1 8 200 fffffe813a6e8680 softnet/8 0 > 59 7 8 201 fffffe813a6e8aa0 idle/8 0 58 3 7 200 fffffe813a6b2240 xcall/7 xcall 0 57 1 7 200 fffffe813a6b2660 softser/7 0 56 1 7 200 fffffe813a6b2a80 softclk/7 0 55 1 7 200 fffffe813a6c3220 softbio/7 0 54 1 7 200 fffffe813a6c3640 softnet/7 0 > 53 7 7 201 fffffe813a6c3a60 idle/7 0 52 3 6 200 fffffe813a6b6200 xcall/6 xcall 0 51 1 6 200 fffffe813a6b6620 softser/6 0 50 1 6 200 fffffe813a6b6a40 softclk/6 0 49 1 6 200 fffffe813a6a01e0 softbio/6 0 48 1 6 200 fffffe813a6a0600 softnet/6 0 > 47 7 6 201 fffffe813a6a0a20 idle/6 0 46 3 5 200 fffffe813a67a1c0 xcall/5 xcall 0 45 1 5 200 fffffe813a67a5e0 softser/5 0 44 1 5 200 fffffe813a67aa00 softclk/5 0 43 1 5 200 fffffe813a6831a0 softbio/5 0 42 1 5 200 fffffe813a6835c0 softnet/5 0 > 41 7 5 201 fffffe813a6839e0 idle/5 0 40 3 4 200 fffffe813a66f180 xcall/4 xcall 0 39 1 4 200 fffffe813a66f5a0 softser/4 0 38 1 4 200 fffffe813a66f9c0 softclk/4 0 37 1 4 200 fffffe813a65f160 softbio/4 0 36 1 4 200 fffffe813a65f580 softnet/4 0 > 35 7 4 201 fffffe813a65f9a0 idle/4 0 34 3 3 200 fffffe813a629140 xcall/3 xcall 0 33 1 3 200 fffffe813a629560 softser/3 0 32 1 3 200 fffffe813a629980 softclk/3 0 31 1 3 200 fffffe813a61a120 softbio/3 0 30 1 3 200 fffffe813a61a540 softnet/3 0 > 29 7 3 201 fffffe813a61a960 idle/3 0 28 3 2 200 fffffe813a62d100 xcall/2 xcall 0 27 1 2 200 fffffe813a62d520 softser/2 0 26 1 2 200 fffffe813a62d940 softclk/2 0 25 1 2 200 fffffe813a6130e0 softbio/2 0 24 1 2 200 fffffe813a613500 softnet/2 0 > 23 7 2 201 fffffe813a613920 idle/2 0 22 3 1 200 fffffe813a6050c0 xcall/1 xcall 0 21 1 1 200 fffffe813a6054e0 softser/1 0 20 1 1 200 fffffe813a605900 softclk/1 0 19 1 1 200 fffffe813a5e80a0 softbio/1 0 18 1 1 200 fffffe813a5e84c0 softnet/1 0 17 1 1 201 fffffe813a5e88e0 idle/1 0 16 3 0 200 fffffe8836ef4080 sysmon smtaskq 0 15 3 0 200 fffffe8836ef44a0 pmfsuspend pmfsuspend 0 14 3 6 200 fffffe8836ef48c0 pmfevent pmfevent 0 13 3 0 200 fffffe883af10060 sopendfree sopendfr 0 12 3 0 200 fffffe883af10480 nfssilly nfssilly 0 11 3 11 200 fffffe883af108a0 cachegc cachegc 0 10 3 4 200 fffffe883df18040 vrele vrele 0 9 3 15 200 fffffe883df18460 vdrain vdrain 0 8 3 3 200 fffffe883df18880 modunload mod_unld 0 7 3 0 200 fffffe883df24020 xcall/0 xcall 0 6 1 0 200 fffffe883df24440 softser/0 0 5 1 0 200 fffffe883df24860 softclk/0 0 4 1 0 200 fffffe883df2a000 softbio/0 0 3 1 0 200 fffffe883df2a420 softnet/0 0 2 1 0 201 fffffe883df2a840 idle/0 0 1 3 7 200 ffffffff810345a0 swapper uvm I tried doing ps/n|more and crash just hung. I was able to get someone to plug in a monitor and keyboard. He read this off the screen. 07:56:55 smaug dovecot: imap (eref): fatal: master: service (imap): child 11193 killed with signal 6 (core not dumped) set service imap (drop_priv_before_exec=yes) 08:07:09 smaug dovecot: imap (eref): panic: file imap-client.c: line 841 (client_check_command_hangs): assertion failed: (!have_wait_unfinished || unfinished_count > 0) 08:07:09 smaug dovecot: imap (eref): fatal: master: service (imap): child 4798 killed with signal 6 (core not dumped) set service imap (drop_priv_before_exec=yes) I am going to look at those sources but I suspect that this is a symptom, not a cause. I had the on-site person press <CTRL><ALT><ESC> but it did not drop into the debugger. -- D'Arcy J.M. Cain <da...@netbsd.org> http://www.NetBSD.org/ IM:da...@vex.net