Re: Intermittent system hangs on 7.2-RELEASE-p1
On Thursday 10 September 2009 9:34:30 pm Linda Messerschmidt wrote: Just to follow up, I've been doing some testing with masking for KTR_LOCK rather than KTR_SCHED. I'm having trouble with this because I have the KTR buffer size set to 1048576 entries, and with only KTR_LOCK enabled, this isn't enough for even a full second of tracing; the sample I'm working with now is just under 0.9s. It's an average of one entry every 2001 TSC ticks. That *seems* like a lot of locking activity, but some of the lock points are only a couple of lines apart, so maybe it's just incredibly verbose. Since it's so much data and I'm still working on a way to correlate it (lockgraph.py?), all I've got so far is a list of what trace points are coming up the most: 51927 src/sys/kern/kern_lock.c:215 (_lockmgr UNLOCK mtx_unlock() when flags LK_INTERLOCK) 48033 src/sys/kern/vfs_subr.c:2284 (vdropl UNLOCK) 41548 src/sys/kern/vfs_subr.c:2187 (vput VI_LOCK) 29359 src/sys/kern/vfs_subr.c:2067 (vget VI_LOCK) 29358 src/sys/kern/vfs_subr.c:2079 (vget VI_UNLOCK) 23799 src/sys/nfsclient/nfs_subs.c:755 (nfs_getattrcache mtx_lock) 23460 src/sys/nfsclient/nfs_vnops.c:645 (nfs_getattr mtx_unlock) 23460 src/sys/nfsclient/nfs_vnops.c:642 (nfs_getattr mtx_lock) 23460 src/sys/nfsclient/nfs_subs.c:815 (nfs_getattrcache mtx_unlock) 23138 src/sys/kern/vfs_cache.c:345 (cache_lookup CACHE_LOCK) Unfortunately, it kind of sounds like I'm on my way to answering why is this system slow? even though it really isn't slow. (And I rush to point out that the Apache process in question doesn't at any point in its life touch NFS, though some of the other ones on the machine do.) In order to be the cause of my Apache problem, all this goobering around with NFS would have to be relatively infrequent but so intense that it shoves everything else out of the way. I'm skeptical, but I'm sure one of you guys can offer a more informed opinion. The only other thing I can think of is maybe all this is running me out of something I need (vnodes?) so everybody else blocks until it finishes and lets go of whatever finite resource it's using up? But that doesn't make a ton of sense either, because why would a lack of vnodes cause stalls in accept() or select() in unrelated processes? Not sure if I'm going in the right direction here or not. Try turning off KTR_LOCK for spin mutexes (just force LO_QUIET on in mtx_init() if MTX_SPIN is set) and use a schedgraph.py from the latest RELENG_7. It knows how to parse KTR_LOCK events and drop event bars for locks showing when they are held. A more recently schedgraph.py might also fix the bugs you were seeing with the idle threads looking too long (esp. at the start and end of graphs). -- John Baldwin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Intermittent system hangs on 7.2-RELEASE-p1
John Baldwin wrote: A more recently schedgraph.py might also fix the bugs you were seeing with the idle threads looking too long (esp. at the start and end of graphs). not unless something has been fixed in the last week or so. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Intermittent system hangs on 7.2-RELEASE-p1
On Fri, Sep 11, 2009 at 11:02 AM, John Baldwin j...@freebsd.org wrote: Try turning off KTR_LOCK for spin mutexes (just force LO_QUIET on in mtx_init() if MTX_SPIN is set) I have *no* idea what you just said. :) Which is fine. But more to the point, I have no idea how to do it. :) A more recently schedgraph.py might also fix the bugs you were seeing with the idle threads looking too long (esp. at the start and end of graphs). We are already on RELENG_7 due to the KTR-enabling rebuild, so that'd be the version we're using unless, as Julian observed, it's been fixed in the past week or so. Thanks! ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Intermittent system hangs on 7.2-RELEASE-p1
On Friday 11 September 2009 11:35:14 am Julian Elischer wrote: John Baldwin wrote: A more recently schedgraph.py might also fix the bugs you were seeing with the idle threads looking too long (esp. at the start and end of graphs). not unless something has been fixed in the last week or so. Well, I wasn't sure how old of a schedgraph.py is being used. 7.1 would have the bugs, but I think 7.2 should be fine. -- John Baldwin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Intermittent system hangs on 7.2-RELEASE-p1
On Friday 11 September 2009 1:35:00 pm Linda Messerschmidt wrote: On Fri, Sep 11, 2009 at 11:02 AM, John Baldwin j...@freebsd.org wrote: Try turning off KTR_LOCK for spin mutexes (just force LO_QUIET on in mtx_init() if MTX_SPIN is set) I have *no* idea what you just said. :) Which is fine. But more to the point, I have no idea how to do it. :) Something like this: Index: sys/kern/kern_mutex.c === --- sys/kern/kern_mutex.c (.../mirror/FreeBSD/stable/7) (revision 195943) +++ sys/kern/kern_mutex.c (.../stable/7) (revision 195943) @@ -747,6 +747,10 @@ if (opts MTX_NOPROFILE) flags |= LO_NOPROFILE; + /* XXX: Only log for regular mutexes. */ + if (opts MTX_SPIN) + flags |= LO_QUIET; + /* Initialize mutex. */ m-mtx_lock = MTX_UNOWNED; m-mtx_recurse = 0; A more recently schedgraph.py might also fix the bugs you were seeing with the idle threads looking too long (esp. at the start and end of graphs). We are already on RELENG_7 due to the KTR-enabling rebuild, so that'd be the version we're using unless, as Julian observed, it's been fixed in the past week or so. Hmm. It works well for me for doing traces. -- John Baldwin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Problem in bin/sh stripping the * character through ${expansion%}
On Fri, Aug 07, 2009 at 03:26:50AM +0400, Eygene Ryabinkin wrote: Thu, Aug 06, 2009 at 11:15:12AM -0700, Doug Barton wrote: I came across this problem during a recent portmaster update. When trying to strip off the * character using variable expansion in bin/sh it doesn't work. Other special characters do work if they are properly escaped. The attached mini-script clearly shows the problem: $ sh sh-strip-problem var before stripping: foo\* var after stripping: foo\* var before stripping: foo\$ var after stripping: foo\ According to the sh(1), it is not a problem. Namely, - \* being unquoted at all will produce a lone '*'; - '*' when treated as the smallest pattern, will result in a stripping of a zero-length string -- it is the smallest pattern in the case of '*' that matches anything. That is indeed an explanation why it works that way, but I think it is wrong. Generally, the shell command language avoids unnecessary levels of quoting. In the POSIX spec, Shell Command Language, note the part about ${x#*} (pattern) and ${x#*} (literal asterisk). Also compare with case $something in \*) echo asterisk;; esac which matches a literal asterisk. Two PRs already exist for aspects of stripping: bin/57554 (double quotes) and bin/117748 (trying to match pattern matching characters literally). In order to strip the trailing star you should use - var=${var%[*]} - This gives you the pattern of '[*]' that is properly treated as the single star -- it's a weird way to escape the star in the patterns. This is indeed a good workaround. -- Jilles Tjoelker ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Intermittent system hangs on 7.2-RELEASE-p1
On Fri, Sep 11, 2009 at 3:06 PM, John Baldwin j...@freebsd.org wrote: Something like this: Ah, I understand now. :) Got up to 17 seconds of trace with that change. Hmm. It works well for me for doing traces. It definitely works, it just always seems to have some-or-another weird artifact. But, with the lock info added, the locks that show big ugly gaping multi-second lock acquire bars are: unp_mtx and so_rcv_sx. I'm not 100% confident in this data yet, so I will try to get more data to confirm, but if that offers any clues about where to look, I'm all ears. I'm also a bit hazy on what the dark grey vs. light grey background is about. Thanks! ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Intermittent system hangs on 7.2-RELEASE-p1
OK, I have learned that ktrdump looks up the name of the process associated with a particular KSE at the the time of the dump, so if it's changed since tracing stopped, it will blissfully blame the wrong process. I understand why that's the case, but it still sucks for troubleshooting. :( This time, pf task mtx and vnode_free_list are the locks getting the blame. The processes fingered are an httpd ( (the root parent of the one doing the work, which does nothing but select() for 1s and wait to see if its children died), and vnlru. No correlation at all to the previous results, and this machine is now utterly quiescent except for the httpd process and the PHP exerciser. Hard to imagine vnlru has 1s worth of running to do on a machine with 949 total vnodes in use. A third run produced a 997ms lock acquire for buffer daemon lock, a 497ms one for ip6qlock (no, there's no IPv6 in use on this machine), and an 8s (!!!) one on unp_mtx. bufdaemon had a 997s running bar, but according to the raw TSC values, that happened on the same CPU 1.999s *after* the 997ms buffer daemon lock acquire. I really don't know where to go from here. There's so little consistency that I'm just not sure if the data is bad, the tool is bad, the operator is bad, or there's some problem so fundamentally horrible that all I'm seeing is random side effects. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Intermittent system hangs on 7.2-RELEASE-p1
Linda Messerschmidt wrote: OK, I have learned that ktrdump looks up the name of the process associated with a particular KSE at the the time of the dump, so if it's changed since tracing stopped, it will blissfully blame the wrong process. I understand why that's the case, but it still sucks for troubleshooting. :( This time, pf task mtx and vnode_free_list are the locks getting the blame. The processes fingered are an httpd ( (the root parent of the one doing the work, which does nothing but select() for 1s and wait to see if its children died), and vnlru. No correlation at all to the previous results, and this machine is now utterly quiescent except for the httpd process and the PHP exerciser. Hard to imagine vnlru has 1s worth of running to do on a machine with 949 total vnodes in use. A third run produced a 997ms lock acquire for buffer daemon lock, a 497ms one for ip6qlock (no, there's no IPv6 in use on this machine), and an 8s (!!!) one on unp_mtx. bufdaemon had a 997s running bar, but according to the raw TSC values, that happened on the same CPU 1.999s *after* the 997ms buffer daemon lock acquire. I really don't know where to go from here. There's so little consistency that I'm just not sure if the data is bad, the tool is bad, the operator is bad, or there's some problem so fundamentally horrible that all I'm seeing is random side effects. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org does the system have a serial console? how about a normal console /keyboard? how often deos it hang? and for how long? is there a chance that you could notice when it is hung and hit CTLLATESC and drop it into the debugger IN teh hung state? It is possible if you have a serial port to make a program that sends a char back and forth and when the machine hangs, sends teh magic sequence. (I think it's CRtildeCTL-D for serial debugger break, but I'm sure you can look up the kernel options and the chars in google.) if you can drop the machine into DDB (teh kernel debugger) in teh hung state, then there are lots of comands you can do to find out what is wrong. jhb actually gave a short talk that I videod and put on youtube on the topic. ps will show you what is actually running on which CPU and you an see what locks all the other processes are waiting on. then you can examine those locks and see who owns them. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Intermittent system hangs on 7.2-RELEASE-p1
On Sat, Sep 12, 2009 at 12:06 AM, Julian Elischer jul...@elischer.org wrote: does the system have a serial console? how about a normal console /keyboard? It has an IP KVM. how often deos it hang? and for how long? Well, this is interesting. I got really frustrated with the other approach, so I thought I'd thin a machine down absolutely as far as I could, eliminate every possible source of delay, and see what happens. I killed everything... cron, RPC, NFS, devd, gmon, nrpe, everything. The Apache and its exerciser are now the only things running on the machine, and the Apache is only touching an md0 swap device mounted on /mnt. I *still* get the hangs. It hangs for all sorts of different periods, but the duration of the stall is approximately inversely proportional to the chance of seeing it. To get a short delay, you need wait only a little bit. If you want a 2-3 second delay, you may have to wait 15-20 minutes. *However* in order to answer your question, I changed up the test program, which up til now has been cycling requests every 50 ms until it gets one 2s, at which point it sysctls to stop ktr and aborts. Now it prints the timestamp of all too long requests. But I also dropped the threshold for too long from 2s to 100ms, since with everything on RAM disk, there's no longer any reason to expect a request to take more than 1-2ms in the worst case. The results are pretty profound: 1252729876: request 82 131ms 1252729883: request 210 388ms 1252729890: request 338 380ms 1252729897: request 466 388ms 1252729904: request 594 404ms 1252729919: request 849 810ms 1252729926: request 977 386ms 1252729933: request 1105 370ms 1252729940: request 1233 366ms 1252729947: request 1361 400ms 1252729961: request 1617 746ms 1252729968: request 1744 477ms 1252729975: request 1872 388ms 1252729982: request 2000 380ms 1252729989: request 2128 384ms 1252729996: request 2256 395ms It goes on and on like this, I get a 380-400ms stall every seven seconds. I have had a few come back higher, in the 750-850ms range, usually after missing a beat: 1252729897: request 466 388ms 1252729904: request 594 404ms 1252729919: request 849 810ms 1252729926: request 977 386ms 1252730010: request 2512 416ms 1252730017: request 2640 390ms 1252730031: request 2896 774ms 1252730038: request 3023 431ms 1252730454: request 10568 378ms 1252730461: request 10696 397ms 1252730475: request 10952 733ms 1252730482: request 11080 366ms So far, nothing over 1s. So what happens every seven seconds?? ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Intermittent system hangs on 7.2-RELEASE-p1
Linda Messerschmidt wrote: On Sat, Sep 12, 2009 at 12:06 AM, Julian Elischer jul...@elischer.org wrote: does the system have a serial console? how about a normal console /keyboard? It has an IP KVM. how often deos it hang? and for how long? Well, this is interesting. I got really frustrated with the other approach, so I thought I'd thin a machine down absolutely as far as I could, eliminate every possible source of delay, and see what happens. I killed everything... cron, RPC, NFS, devd, gmon, nrpe, everything. The Apache and its exerciser are now the only things running on the machine, and the Apache is only touching an md0 swap device mounted on /mnt. I *still* get the hangs. ok now we need to describe the hang.. if you can predictably get a hang every 7 seconds does this mean that it doesn't respond to keyboard for a moment every 7 seconds? or that it doesn't accept packets every 7 seconds? if you lean on the A key, do you see echo stop every 7 seconds for a moment? Or is it just the apache process that hangs? Does the watching process that you refer to below also hang? would it hang if it tried to access the disk? if the watching process is on the same machine, does it only trigger AFTER teh request has taken a ling time or could it time out with a select DURING the delayed response? (another way of asking how hung is 'hung'? It hangs for all sorts of different periods, but the duration of the stall is approximately inversely proportional to the chance of seeing it. To get a short delay, you need wait only a little bit. If you want a 2-3 second delay, you may have to wait 15-20 minutes. *However* in order to answer your question, I changed up the test program, which up til now has been cycling requests every 50 ms until it gets one 2s, at which point it sysctls to stop ktr and aborts. Now it prints the timestamp of all too long requests. But I also dropped the threshold for too long from 2s to 100ms, since with everything on RAM disk, there's no longer any reason to expect a request to take more than 1-2ms in the worst case. The results are pretty profound: 1252729876: request 82 131ms 1252729883: request 210 388ms 1252729890: request 338 380ms 1252729897: request 466 388ms 1252729904: request 594 404ms 1252729919: request 849 810ms 1252729926: request 977 386ms 1252729933: request 1105 370ms 1252729940: request 1233 366ms 1252729947: request 1361 400ms 1252729961: request 1617 746ms 1252729968: request 1744 477ms 1252729975: request 1872 388ms 1252729982: request 2000 380ms 1252729989: request 2128 384ms 1252729996: request 2256 395ms It goes on and on like this, I get a 380-400ms stall every seven seconds. I have had a few come back higher, in the 750-850ms range, usually after missing a beat: 1252729897: request 466 388ms 1252729904: request 594 404ms 1252729919: request 849 810ms 1252729926: request 977 386ms 1252730010: request 2512 416ms 1252730017: request 2640 390ms 1252730031: request 2896 774ms 1252730038: request 3023 431ms 1252730454: request 10568 378ms 1252730461: request 10696 397ms 1252730475: request 10952 733ms 1252730482: request 11080 366ms So far, nothing over 1s. So what happens every seven seconds?? ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org