Occasionally we see a process get stuck in an unkillable state and
the only solution is a hard reboot.
Occasionally == once every two weeks across 60+ servers, which are spread
across the globe in customer sites. We have no remote access to these boxes.
The process that most often that gets stuck, but not limited to, is a large
scale Ping/SNMP poller. It is a fairly simplistic C program that just fires
out lots of ping (raw ICMP socket) and SNMP (UDP socket) requests
asynchronously.
We've managed to trap the problem a few times on a test server running in
VirtualBox, but it also occurs on customer sites who run VMware, Hyper-V,
QEMU and on bare metal.
We raise this PR
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204081
but suspect it is a similar/same issue as
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200992
This is the info we've gathered from the most recent time it has occurred:
# uname -a
FreeBSD shed153.akips.com 10.2-RELEASE-p12 FreeBSD 10.2-RELEASE-p12 #0 r295070:
Sat Jan 30 20:03:44 UTC 2016
r...@shed21.akips.com:/usr/obj/usr/src/sys/GENERIC amd64
The nm-poller has no state in top for some reason ??
last pid: 1847; load averages: 0.62, 1.20, 1.33up 13+16:06:04 13:36:46
103 processes: 1 running, 102 sleeping
CPU: 1.0% user, 0.0% nice, 4.3% system, 0.0% interrupt, 94.7% idle
Mem: 650M Active, 541M Inact, 2527M Wired, 16M Cache, 417M Buf, 217M Free
ARC: 2087M Total, 102M MFU, 1968M MRU, 18K Anon, 9409K Header, 9088K Other
Swap: 4096M Total, 256M Used, 3840M Free, 6% Inuse
PID USERNAMETHR PRI NICE SIZERES STATE C TIMEWCPU COMMAND
1013 akips 1 200 74076K 5544K select 1 195:41 0.59% nm-httpd
1003 akips 1 200 164M 54328K select 0 236:18 0.49%
nm-flow-collector
888 root 1 200 101M 14920K select 0 163:56 0.39% nm-joatd
885 akips 1 200 74004K 3092K nanslp 1 116:52 0.29% nm-timed
1014 akips 1 40 851M 104M 0 18.0H 0.00% nm-poller
1086 akips 1 200 21940K 2680K nanslp 0 66:25 0.00% top
1015 akips 1 200 819M 256M nanslp 1 56:45 0.00%
nm-poller-db
1023 akips 1 200 114M 44760K select 0 55:00 0.00%
nm-flow-meter
1005 akips 1 200 159M 4172K select 0 51:00 0.00% nm-msgd
1025 akips 1 200 114M 45644K select 0 44:22 0.00%
nm-flow-meter
1012 akips 1 200 60360K 5132K piperd 1 20:08 0.00% perl
1027 akips 1 200 110M 34564K select 1 18:58 0.00%
nm-flow-meter
997 akips 1 200 819M 27600K select 0 12:59 0.00%
nm-snmp-trapd
991 akips 1 200 78104K 5384K select 1 10:53 0.00%
nm-fifo-tee
989 akips 1 200 78104K 5764K select 1 10:34 0.00%
nm-fifo-tee
990 akips 1 200 78104K 5496K select 0 10:31 0.00%
nm-fifo-tee
1047 akips 1 200 102M 29108K select 0 10:25 0.00%
nm-flow-meter
akips 1 200 102M 36000K select 0 9:18 0.00%
nm-flow-meter
1231 akips 1 200 102M 35952K select 1 9:17 0.00%
nm-flow-meter
1239 akips 1 200 102M 33132K select 0 8:51 0.00%
nm-flow-meter
1240 akips 1 200 102M 33132K select 1 8:51 0.00%
nm-flow-meter
1002 akips 1 200 74016K 3480K select 1 8:50 0.00% nm-syslogd
1234 akips 1 200 102M 35920K select 1 8:49 0.00%
nm-flow-meter
1243 akips 1 200 102M 33148K select 0 8:46 0.00%
nm-flow-meter
1039 akips 1 200 820M 31388K select 0 8:46 0.00% nm-db
1233 akips 1 200 102M 31256K select 0 8:43 0.00%
nm-flow-meter
1237 akips 1 200 102M 33168K select 0 8:43 0.00%
nm-flow-meter
1235 akips 1 200 102M 29040K select 1 8:41 0.00%
nm-flow-meter
1259 akips 1 200 102M 29096K select 0 8:40 0.00%
nm-flow-meter
1255 akips 1 200 102M 31756K select 1 8:40 0.00%
nm-flow-meter
1232 akips 1 200 102M 31780K select 1 8:39 0.00%
nm-flow-meter
1041 akips 1 200 820M 45284K select 0 8:34 0.00% nm-db
1044 akips 1 200 820M 26172K select 1 8:28 0.00% nm-db
1060 akips 1 200 74008K 3380K select 1 8:22 0.00% nm-syslog
1077 akips 1 200 820M 26076K select 1 8:22 0.00% nm-db
1048 akips 1 200 820M 26076K select 1 8:16 0.00% nm-db
1045 akips 1 200 820M 27056K select 1 8:16 0.00% nm-db
1046 akips 1 200 820M 26156K select 1 8:16 0.00% nm-db
22541 akips 1 200 820M 26092K select 1 8:16 0.00% nm-db
1049 root 1 200 820M 26076K select 0 8:15 0.00% nm-db
1043 akips 1 200 820M 26076K select 0 8:15 0.00% nm-db
1006 akips 1 200 74004K