Re: 10.2 - Process stuck in unkillable sleep
On 2016-02-24 08:18, Konstantin Belousov wrote: On Wed, Feb 24, 2016 at 02:26:19PM +1000, Paul Koch wrote: Occasionally we see a process get stuck in an unkillable state and the only solution is a hard reboot. Occasionally == once every two weeks across 60+ servers, which are spread across the globe in customer sites. We have no remote access to these boxes. The process that most often that gets stuck, but not limited to, is a large scale Ping/SNMP poller. It is a fairly simplistic C program that just fires out lots of ping (raw ICMP socket) and SNMP (UDP socket) requests asynchronously. We've managed to trap the problem a few times on a test server running in VirtualBox, but it also occurs on customer sites who run VMware, Hyper-V, QEMU and on bare metal. We raise this PR https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204081 but suspect it is a similar/same issue as https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200992 This is the info we've gathered from the most recent time it has occurred: # uname -a FreeBSD shed153.akips.com 10.2-RELEASE-p12 FreeBSD 10.2-RELEASE-p12 #0 r295070: Sat Jan 30 20:03:44 UTC 2016 r...@shed21.akips.com:/usr/obj/usr/src/sys/GENERIC amd64 # ps auxww | grep nm-poller akips1014 0.0 2.6 871820 106540 - Ds 10Feb16 1078:59.06 nm-poller # procstat -k 1014 PIDTID COMM TDNAME KSTACK 1014 100365 nm-poller -mi_switch sleepq_timedwait_sig _cv_timedwait_sig_sbt seltdwait kern_select sys_select amd64_syscall Xfast_syscall Yes, on HEAD it was reported that the https://reviews.freebsd.org/D5221 fixed the problem. Still not reviewed. I did back-port to stable/10, the patch below is probably not applicable to 10.2, you would need 10.3 for it. Some revisions are missed from stable/10, but I think that the issue worked around in the patch is at the core of troubles many people reported. Index: sys/kern/kern_timeout.c === --- sys/kern/kern_timeout.c (revision 295966) +++ sys/kern/kern_timeout.c (working copy) @@ -1127,7 +1127,7 @@ _callout_stop_safe(c, safe) * Some old subsystems don't hold Giant while running a callout_stop(), * so just discard this check for the moment. */ - if (!safe && c->c_lock != NULL) { + if ((safe & CS_DRAIN) == 0 && c->c_lock != NULL) { if (c->c_lock == _object) use_lock = mtx_owned(); else { @@ -1207,7 +1207,7 @@ again: return (0); } - if (safe) { + if ((safe & CS_DRAIN) != 0) { /* * The current callout is running (or just * about to run) and blocking is allowed, so @@ -1319,7 +1319,7 @@ again: CTR3(KTR_CALLOUT, "postponing stop %p func %p arg %p", c, c->c_func, c->c_arg); CC_UNLOCK(cc); - return (0); + return ((safe & CS_MIGRBLOCK) != 0); } CTR3(KTR_CALLOUT, "failed to stop %p func %p arg %p", c, c->c_func, c->c_arg); Index: sys/kern/subr_sleepqueue.c === --- sys/kern/subr_sleepqueue.c (revision 295966) +++ sys/kern/subr_sleepqueue.c (working copy) @@ -572,7 +572,8 @@ sleepq_check_timeout(void) * another CPU, so synchronize with it to avoid having it * accidentally wake up a subsequent sleep. */ - else if (callout_stop(>td_slpcallout) == 0) { + else if (_callout_stop_safe(>td_slpcallout, CS_MIGRBLOCK) + == 0) { td->td_flags |= TDF_TIMEOUT; TD_SET_SLEEPING(td); mi_switch(SW_INVOL | SWT_SLEEPQTIMO, NULL); Index: sys/sys/callout.h === --- sys/sys/callout.h (revision 295966) +++ sys/sys/callout.h (working copy) @@ -62,6 +62,9 @@ struct callout_handle { struct callout *callout; }; +#defineCS_DRAIN0x0001 +#defineCS_MIGRBLOCK0x0002 + #ifdef _KERNEL /* * Note the flags field is actually *two* fields. The c_flags @@ -81,7 +84,7 @@ struct callout_handle { */ #definecallout_active(c) ((c)->c_flags & CALLOUT_ACTIVE) #definecallout_deactivate(c) ((c)->c_flags &= ~CALLOUT_ACTIVE) -#definecallout_drain(c)_callout_stop_safe(c, 1) +#definecallout_drain(c)_callout_stop_safe(c, CS_DRAIN) void callout_init(struct callout *, int); void _callout_init_lock(struct callout *, struct lock_object *, int); #definecallout_init_mtx(c, mtx, flags) \ ___ freebsd-stable@freebsd.org mailing list
Re: 10.2 - Process stuck in unkillable sleep
On Wed, Feb 24, 2016 at 02:26:19PM +1000, Paul Koch wrote: > > Occasionally we see a process get stuck in an unkillable state and > the only solution is a hard reboot. > > Occasionally == once every two weeks across 60+ servers, which are spread > across the globe in customer sites. We have no remote access to these boxes. > > The process that most often that gets stuck, but not limited to, is a large > scale Ping/SNMP poller. It is a fairly simplistic C program that just fires > out lots of ping (raw ICMP socket) and SNMP (UDP socket) requests > asynchronously. > > We've managed to trap the problem a few times on a test server running in > VirtualBox, but it also occurs on customer sites who run VMware, Hyper-V, > QEMU and on bare metal. > > > We raise this PR > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204081 > > but suspect it is a similar/same issue as > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200992 > > This is the info we've gathered from the most recent time it has occurred: > > > # uname -a > FreeBSD shed153.akips.com 10.2-RELEASE-p12 FreeBSD 10.2-RELEASE-p12 #0 > r295070: > Sat Jan 30 20:03:44 UTC 2016 > r...@shed21.akips.com:/usr/obj/usr/src/sys/GENERIC amd64 > # ps auxww | grep nm-poller > akips1014 0.0 2.6 871820 106540 - Ds 10Feb16 1078:59.06 nm-poller > > > # procstat -k 1014 > PIDTID COMM TDNAME KSTACK > 1014 100365 nm-poller -mi_switch sleepq_timedwait_sig > _cv_timedwait_sig_sbt seltdwait kern_select sys_select amd64_syscall > Xfast_syscall > Yes, on HEAD it was reported that the https://reviews.freebsd.org/D5221 fixed the problem. Still not reviewed. I did back-port to stable/10, the patch below is probably not applicable to 10.2, you would need 10.3 for it. Some revisions are missed from stable/10, but I think that the issue worked around in the patch is at the core of troubles many people reported. Index: sys/kern/kern_timeout.c === --- sys/kern/kern_timeout.c (revision 295966) +++ sys/kern/kern_timeout.c (working copy) @@ -1127,7 +1127,7 @@ _callout_stop_safe(c, safe) * Some old subsystems don't hold Giant while running a callout_stop(), * so just discard this check for the moment. */ - if (!safe && c->c_lock != NULL) { + if ((safe & CS_DRAIN) == 0 && c->c_lock != NULL) { if (c->c_lock == _object) use_lock = mtx_owned(); else { @@ -1207,7 +1207,7 @@ again: return (0); } - if (safe) { + if ((safe & CS_DRAIN) != 0) { /* * The current callout is running (or just * about to run) and blocking is allowed, so @@ -1319,7 +1319,7 @@ again: CTR3(KTR_CALLOUT, "postponing stop %p func %p arg %p", c, c->c_func, c->c_arg); CC_UNLOCK(cc); - return (0); + return ((safe & CS_MIGRBLOCK) != 0); } CTR3(KTR_CALLOUT, "failed to stop %p func %p arg %p", c, c->c_func, c->c_arg); Index: sys/kern/subr_sleepqueue.c === --- sys/kern/subr_sleepqueue.c (revision 295966) +++ sys/kern/subr_sleepqueue.c (working copy) @@ -572,7 +572,8 @@ sleepq_check_timeout(void) * another CPU, so synchronize with it to avoid having it * accidentally wake up a subsequent sleep. */ - else if (callout_stop(>td_slpcallout) == 0) { + else if (_callout_stop_safe(>td_slpcallout, CS_MIGRBLOCK) + == 0) { td->td_flags |= TDF_TIMEOUT; TD_SET_SLEEPING(td); mi_switch(SW_INVOL | SWT_SLEEPQTIMO, NULL); Index: sys/sys/callout.h === --- sys/sys/callout.h (revision 295966) +++ sys/sys/callout.h (working copy) @@ -62,6 +62,9 @@ struct callout_handle { struct callout *callout; }; +#defineCS_DRAIN0x0001 +#defineCS_MIGRBLOCK0x0002 + #ifdef _KERNEL /* * Note the flags field is actually *two* fields. The c_flags @@ -81,7 +84,7 @@ struct callout_handle { */ #definecallout_active(c) ((c)->c_flags & CALLOUT_ACTIVE) #definecallout_deactivate(c) ((c)->c_flags &= ~CALLOUT_ACTIVE) -#definecallout_drain(c)_callout_stop_safe(c, 1) +#definecallout_drain(c)_callout_stop_safe(c, CS_DRAIN) void callout_init(struct callout *, int); void _callout_init_lock(struct callout *, struct lock_object *, int); #definecallout_init_mtx(c, mtx, flags) \ ___
10.2 - Process stuck in unkillable sleep
Occasionally we see a process get stuck in an unkillable state and the only solution is a hard reboot. Occasionally == once every two weeks across 60+ servers, which are spread across the globe in customer sites. We have no remote access to these boxes. The process that most often that gets stuck, but not limited to, is a large scale Ping/SNMP poller. It is a fairly simplistic C program that just fires out lots of ping (raw ICMP socket) and SNMP (UDP socket) requests asynchronously. We've managed to trap the problem a few times on a test server running in VirtualBox, but it also occurs on customer sites who run VMware, Hyper-V, QEMU and on bare metal. We raise this PR https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204081 but suspect it is a similar/same issue as https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200992 This is the info we've gathered from the most recent time it has occurred: # uname -a FreeBSD shed153.akips.com 10.2-RELEASE-p12 FreeBSD 10.2-RELEASE-p12 #0 r295070: Sat Jan 30 20:03:44 UTC 2016 r...@shed21.akips.com:/usr/obj/usr/src/sys/GENERIC amd64 The nm-poller has no state in top for some reason ?? last pid: 1847; load averages: 0.62, 1.20, 1.33up 13+16:06:04 13:36:46 103 processes: 1 running, 102 sleeping CPU: 1.0% user, 0.0% nice, 4.3% system, 0.0% interrupt, 94.7% idle Mem: 650M Active, 541M Inact, 2527M Wired, 16M Cache, 417M Buf, 217M Free ARC: 2087M Total, 102M MFU, 1968M MRU, 18K Anon, 9409K Header, 9088K Other Swap: 4096M Total, 256M Used, 3840M Free, 6% Inuse PID USERNAMETHR PRI NICE SIZERES STATE C TIMEWCPU COMMAND 1013 akips 1 200 74076K 5544K select 1 195:41 0.59% nm-httpd 1003 akips 1 200 164M 54328K select 0 236:18 0.49% nm-flow-collector 888 root 1 200 101M 14920K select 0 163:56 0.39% nm-joatd 885 akips 1 200 74004K 3092K nanslp 1 116:52 0.29% nm-timed 1014 akips 1 40 851M 104M 0 18.0H 0.00% nm-poller 1086 akips 1 200 21940K 2680K nanslp 0 66:25 0.00% top 1015 akips 1 200 819M 256M nanslp 1 56:45 0.00% nm-poller-db 1023 akips 1 200 114M 44760K select 0 55:00 0.00% nm-flow-meter 1005 akips 1 200 159M 4172K select 0 51:00 0.00% nm-msgd 1025 akips 1 200 114M 45644K select 0 44:22 0.00% nm-flow-meter 1012 akips 1 200 60360K 5132K piperd 1 20:08 0.00% perl 1027 akips 1 200 110M 34564K select 1 18:58 0.00% nm-flow-meter 997 akips 1 200 819M 27600K select 0 12:59 0.00% nm-snmp-trapd 991 akips 1 200 78104K 5384K select 1 10:53 0.00% nm-fifo-tee 989 akips 1 200 78104K 5764K select 1 10:34 0.00% nm-fifo-tee 990 akips 1 200 78104K 5496K select 0 10:31 0.00% nm-fifo-tee 1047 akips 1 200 102M 29108K select 0 10:25 0.00% nm-flow-meter akips 1 200 102M 36000K select 0 9:18 0.00% nm-flow-meter 1231 akips 1 200 102M 35952K select 1 9:17 0.00% nm-flow-meter 1239 akips 1 200 102M 33132K select 0 8:51 0.00% nm-flow-meter 1240 akips 1 200 102M 33132K select 1 8:51 0.00% nm-flow-meter 1002 akips 1 200 74016K 3480K select 1 8:50 0.00% nm-syslogd 1234 akips 1 200 102M 35920K select 1 8:49 0.00% nm-flow-meter 1243 akips 1 200 102M 33148K select 0 8:46 0.00% nm-flow-meter 1039 akips 1 200 820M 31388K select 0 8:46 0.00% nm-db 1233 akips 1 200 102M 31256K select 0 8:43 0.00% nm-flow-meter 1237 akips 1 200 102M 33168K select 0 8:43 0.00% nm-flow-meter 1235 akips 1 200 102M 29040K select 1 8:41 0.00% nm-flow-meter 1259 akips 1 200 102M 29096K select 0 8:40 0.00% nm-flow-meter 1255 akips 1 200 102M 31756K select 1 8:40 0.00% nm-flow-meter 1232 akips 1 200 102M 31780K select 1 8:39 0.00% nm-flow-meter 1041 akips 1 200 820M 45284K select 0 8:34 0.00% nm-db 1044 akips 1 200 820M 26172K select 1 8:28 0.00% nm-db 1060 akips 1 200 74008K 3380K select 1 8:22 0.00% nm-syslog 1077 akips 1 200 820M 26076K select 1 8:22 0.00% nm-db 1048 akips 1 200 820M 26076K select 1 8:16 0.00% nm-db 1045 akips 1 200 820M 27056K select 1 8:16 0.00% nm-db 1046 akips 1 200 820M 26156K select 1 8:16 0.00% nm-db 22541 akips 1 200 820M 26092K select 1 8:16 0.00% nm-db 1049 root 1 200 820M 26076K select 0 8:15 0.00% nm-db 1043 akips 1 200 820M 26076K select 0 8:15 0.00% nm-db 1006 akips 1 200 74004K