Re: Intermittent system hangs on 7.2-RELEASE-p1

2009-09-11 Thread John Baldwin
On Thursday 10 September 2009 9:34:30 pm Linda Messerschmidt wrote:
 Just to follow up, I've been doing some testing with masking for
 KTR_LOCK rather than KTR_SCHED.
 
 I'm having trouble with this because I have the KTR buffer size set to
 1048576 entries, and with only KTR_LOCK enabled, this isn't enough for
 even a full second of tracing; the sample I'm working with now is just
 under 0.9s.  It's an average of one entry every 2001 TSC ticks.  That
 *seems* like a lot of locking activity, but some of the lock points
 are only a couple of lines apart, so maybe it's just incredibly
 verbose.
 
 Since it's so much data and I'm still working on a way to correlate it
 (lockgraph.py?), all I've got so far is a list of what trace points
 are coming up the most:
 
 51927 src/sys/kern/kern_lock.c:215  (_lockmgr UNLOCK mtx_unlock() when
 flags  LK_INTERLOCK)
 48033 src/sys/kern/vfs_subr.c:2284  (vdropl UNLOCK)
 41548 src/sys/kern/vfs_subr.c:2187  (vput VI_LOCK)
 29359 src/sys/kern/vfs_subr.c:2067  (vget VI_LOCK)
 29358 src/sys/kern/vfs_subr.c:2079  (vget VI_UNLOCK)
 23799 src/sys/nfsclient/nfs_subs.c:755  (nfs_getattrcache mtx_lock)
 23460 src/sys/nfsclient/nfs_vnops.c:645  (nfs_getattr mtx_unlock)
 23460 src/sys/nfsclient/nfs_vnops.c:642  (nfs_getattr mtx_lock)
 23460 src/sys/nfsclient/nfs_subs.c:815  (nfs_getattrcache mtx_unlock)
 23138 src/sys/kern/vfs_cache.c:345  (cache_lookup CACHE_LOCK)
 
 Unfortunately, it kind of sounds like I'm on my way to answering why
 is this system slow? even though it really isn't slow.  (And I rush
 to point out that the Apache process in question doesn't at any point
 in its life touch NFS, though some of the other ones on the machine
 do.)
 
 In order to be the cause of my Apache problem, all this goobering
 around with NFS would have to be relatively infrequent but so intense
 that it shoves everything else out of the way.  I'm skeptical, but I'm
 sure one of you guys can offer a more informed opinion.
 
 The only other thing I can think of is maybe all this is running me
 out of something I need (vnodes?) so everybody else blocks until it
 finishes and lets go of whatever finite resource it's using up?  But
 that doesn't make a ton of sense either, because why would a lack of
 vnodes cause stalls in accept() or select() in unrelated processes?
 
 Not sure if I'm going in the right direction here or not.

Try turning off KTR_LOCK for spin mutexes (just force LO_QUIET on in 
mtx_init() if MTX_SPIN is set) and use a schedgraph.py from the latest 
RELENG_7.  It knows how to parse KTR_LOCK events and drop event bars for 
locks showing when they are held.  A more recently schedgraph.py might also 
fix the bugs you were seeing with the idle threads looking too long (esp. at 
the start and end of graphs).

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Intermittent system hangs on 7.2-RELEASE-p1

2009-09-11 Thread Julian Elischer

John Baldwin wrote:



 A more recently schedgraph.py might also 
fix the bugs you were seeing with the idle threads looking too long (esp. at 
the start and end of graphs).


not unless something has been fixed in the last week or so.



___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Intermittent system hangs on 7.2-RELEASE-p1

2009-09-11 Thread Linda Messerschmidt
On Fri, Sep 11, 2009 at 11:02 AM, John Baldwin j...@freebsd.org wrote:
 Try turning off KTR_LOCK for spin mutexes (just force LO_QUIET on in
 mtx_init() if MTX_SPIN is set)

I have *no* idea what you just said. :)

Which is fine.  But more to the point, I have no idea how to do it. :)

 A more recently schedgraph.py might also
 fix the bugs you were seeing with the idle threads looking too long (esp. at
 the start and end of graphs).

We are already on RELENG_7 due to the KTR-enabling rebuild, so that'd
be the version we're using unless, as Julian observed, it's been fixed
in the past week or so.

Thanks!
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Intermittent system hangs on 7.2-RELEASE-p1

2009-09-11 Thread John Baldwin
On Friday 11 September 2009 11:35:14 am Julian Elischer wrote:
 John Baldwin wrote:
 
  
   A more recently schedgraph.py might also 
  fix the bugs you were seeing with the idle threads looking too long (esp. 
at 
  the start and end of graphs).
 
 not unless something has been fixed in the last week or so.

Well, I wasn't sure how old of a schedgraph.py is being used.  7.1 would have 
the bugs, but I think 7.2 should be fine.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Intermittent system hangs on 7.2-RELEASE-p1

2009-09-11 Thread John Baldwin
On Friday 11 September 2009 1:35:00 pm Linda Messerschmidt wrote:
 On Fri, Sep 11, 2009 at 11:02 AM, John Baldwin j...@freebsd.org wrote:
  Try turning off KTR_LOCK for spin mutexes (just force LO_QUIET on in
  mtx_init() if MTX_SPIN is set)
 
 I have *no* idea what you just said. :)
 
 Which is fine.  But more to the point, I have no idea how to do it. :)

Something like this:

Index: sys/kern/kern_mutex.c
===
--- sys/kern/kern_mutex.c   (.../mirror/FreeBSD/stable/7)   (revision 
195943)
+++ sys/kern/kern_mutex.c   (.../stable/7)  (revision 195943)
@@ -747,6 +747,10 @@
if (opts  MTX_NOPROFILE)
flags |= LO_NOPROFILE;

+   /* XXX: Only log for regular mutexes. */
+   if (opts  MTX_SPIN)
+   flags |= LO_QUIET;
+
/* Initialize mutex. */
m-mtx_lock = MTX_UNOWNED;
m-mtx_recurse = 0;

  A more recently schedgraph.py might also
  fix the bugs you were seeing with the idle threads looking too long (esp. at
  the start and end of graphs).
 
 We are already on RELENG_7 due to the KTR-enabling rebuild, so that'd
 be the version we're using unless, as Julian observed, it's been fixed
 in the past week or so.

Hmm.  It works well for me for doing traces.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Problem in bin/sh stripping the * character through ${expansion%}

2009-09-11 Thread Jilles Tjoelker
On Fri, Aug 07, 2009 at 03:26:50AM +0400, Eygene Ryabinkin wrote:
 Thu, Aug 06, 2009 at 11:15:12AM -0700, Doug Barton wrote:
  I came across this problem during a recent portmaster update. When
  trying to strip off the * character using variable expansion in bin/sh
  it doesn't work. Other special characters do work if they are
  properly escaped.

  The attached mini-script clearly shows the problem:

  $ sh sh-strip-problem
  var before stripping: foo\*
  var after stripping: foo\*

  var before stripping: foo\$
  var after stripping: foo\

 According to the sh(1), it is not a problem.  Namely,
  - \* being unquoted at all will produce a lone '*';
  - '*' when treated as the smallest pattern, will result in a stripping
of a zero-length string -- it is the smallest pattern in the case of
'*' that matches anything.

That is indeed an explanation why it works that way, but I think it is
wrong. Generally, the shell command language avoids unnecessary levels
of quoting. In the POSIX spec, Shell Command Language, note the part
about ${x#*} (pattern) and ${x#*} (literal asterisk). Also compare
with  case $something in \*) echo asterisk;; esac  which matches a
literal asterisk.

Two PRs already exist for aspects of stripping: bin/57554 (double
quotes) and bin/117748 (trying to match pattern matching characters
literally).

 In order to strip the trailing star you should use
 -
 var=${var%[*]}
 -
 This gives you the pattern of '[*]' that is properly treated as the
 single star -- it's a weird way to escape the star in the patterns.

This is indeed a good workaround.

-- 
Jilles Tjoelker
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Intermittent system hangs on 7.2-RELEASE-p1

2009-09-11 Thread Linda Messerschmidt
On Fri, Sep 11, 2009 at 3:06 PM, John Baldwin j...@freebsd.org wrote:
 Something like this:

Ah, I understand now. :)

Got up to 17 seconds of trace with that change.

 Hmm.  It works well for me for doing traces.

It definitely works, it just always seems to have some-or-another
weird artifact.

But, with the lock info added, the locks that show big ugly gaping
multi-second lock acquire bars are: unp_mtx and so_rcv_sx.  I'm not
100% confident in this data yet, so I will try to get more data to
confirm, but if that offers any clues about where to look, I'm all
ears.

I'm also a bit hazy on what the dark grey vs. light grey background is about.

Thanks!
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Intermittent system hangs on 7.2-RELEASE-p1

2009-09-11 Thread Linda Messerschmidt
OK, I have learned that ktrdump looks up the name of the process
associated with a particular KSE at the the time of the dump, so if
it's changed since tracing stopped, it will blissfully blame the wrong
process.  I understand why that's the case, but it still sucks for
troubleshooting. :(

This time, pf task mtx and vnode_free_list are the locks getting
the blame.  The processes fingered are an httpd ( (the root parent
of the one doing the work, which does nothing but select() for 1s and
wait to see if its children died), and vnlru.  No correlation at all
to the previous results, and this machine is now utterly quiescent
except for the httpd process and the PHP exerciser.  Hard to imagine
vnlru has 1s worth of running to do on a machine with 949 total vnodes
in use.

A third run produced a 997ms lock acquire for buffer daemon lock,
a 497ms one for ip6qlock (no, there's no IPv6 in use on this machine),
and an 8s (!!!) one on unp_mtx. bufdaemon had a 997s running bar,
but according to the raw TSC values, that happened on the same CPU
1.999s *after* the 997ms buffer daemon lock acquire.

I really don't know where to go from here.  There's so little
consistency that I'm just not sure if the data is bad, the tool is
bad, the operator is bad, or there's some problem so fundamentally
horrible that all I'm seeing is random side effects.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Intermittent system hangs on 7.2-RELEASE-p1

2009-09-11 Thread Julian Elischer

Linda Messerschmidt wrote:

OK, I have learned that ktrdump looks up the name of the process
associated with a particular KSE at the the time of the dump, so if
it's changed since tracing stopped, it will blissfully blame the wrong
process.  I understand why that's the case, but it still sucks for
troubleshooting. :(

This time, pf task mtx and vnode_free_list are the locks getting
the blame.  The processes fingered are an httpd ( (the root parent
of the one doing the work, which does nothing but select() for 1s and
wait to see if its children died), and vnlru.  No correlation at all
to the previous results, and this machine is now utterly quiescent
except for the httpd process and the PHP exerciser.  Hard to imagine
vnlru has 1s worth of running to do on a machine with 949 total vnodes
in use.

A third run produced a 997ms lock acquire for buffer daemon lock,
a 497ms one for ip6qlock (no, there's no IPv6 in use on this machine),
and an 8s (!!!) one on unp_mtx. bufdaemon had a 997s running bar,
but according to the raw TSC values, that happened on the same CPU
1.999s *after* the 997ms buffer daemon lock acquire.

I really don't know where to go from here.  There's so little
consistency that I'm just not sure if the data is bad, the tool is
bad, the operator is bad, or there's some problem so fundamentally
horrible that all I'm seeing is random side effects.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


does the system have a serial console? how about a normal console 
/keyboard?


how often deos it hang? and for  how long?
is there a chance that you could notice when it is hung and hit 
CTLLATESC and drop it into the debugger IN teh hung state?


It is possible if you have a serial port to make a program that sends 
a char back and forth and when the machine hangs, sends teh magic 
sequence. (I think it's CRtildeCTL-D for serial debugger break,

but I'm sure you can look up the kernel options and the chars in google.)

if you can drop the machine into DDB (teh kernel debugger) in teh
hung state, then there are lots of comands you can do to find out
what is wrong. jhb actually gave a short talk that I videod and put
on youtube on the topic.

ps will show you what is actually running on which CPU and you an see 
what locks all the other processes are waiting on.

then you can examine those locks and see who owns them.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Intermittent system hangs on 7.2-RELEASE-p1

2009-09-11 Thread Linda Messerschmidt
On Sat, Sep 12, 2009 at 12:06 AM, Julian Elischer jul...@elischer.org wrote:
 does the system have a serial console? how about a normal console /keyboard?

It has an IP KVM.

 how often deos it hang? and for  how long?

Well, this is interesting.  I got really frustrated with the other
approach, so I thought I'd thin a machine down absolutely as far as I
could, eliminate every possible source of delay, and see what happens.
 I killed everything... cron, RPC, NFS, devd, gmon, nrpe, everything.
The Apache and its exerciser are now the only things running on the
machine, and the Apache is only touching an md0 swap device mounted on
/mnt.  I *still* get the hangs.

It hangs for all sorts of different periods, but the duration of the
stall is approximately inversely proportional to the chance of seeing
it.  To get a short delay, you need wait only a little bit.  If you
want a 2-3 second delay, you may have to wait 15-20 minutes.

*However* in order to answer your question, I changed up the test
program, which up til now has been cycling requests every 50 ms until
it gets one 2s, at which point it sysctls to stop ktr and aborts.

Now it prints the timestamp of all too long requests.  But I also
dropped the threshold for too long from 2s to 100ms, since with
everything on RAM disk, there's no longer any reason to expect a
request to take more than 1-2ms in the worst case.

The results are pretty profound:

1252729876: request 82 131ms
1252729883: request 210 388ms
1252729890: request 338 380ms
1252729897: request 466 388ms
1252729904: request 594 404ms
1252729919: request 849 810ms
1252729926: request 977 386ms
1252729933: request 1105 370ms
1252729940: request 1233 366ms
1252729947: request 1361 400ms
1252729961: request 1617 746ms
1252729968: request 1744 477ms
1252729975: request 1872 388ms
1252729982: request 2000 380ms
1252729989: request 2128 384ms
1252729996: request 2256 395ms

It goes on and on like this, I get a 380-400ms stall every seven
seconds.  I have had a few come back higher, in the 750-850ms range,
usually after missing a beat:

1252729897: request 466 388ms
1252729904: request 594 404ms
1252729919: request 849 810ms
1252729926: request 977 386ms

1252730010: request 2512 416ms
1252730017: request 2640 390ms
1252730031: request 2896 774ms
1252730038: request 3023 431ms

1252730454: request 10568 378ms
1252730461: request 10696 397ms
1252730475: request 10952 733ms
1252730482: request 11080 366ms

So far, nothing over 1s.

So what happens every seven seconds??
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Intermittent system hangs on 7.2-RELEASE-p1

2009-09-11 Thread Julian Elischer

Linda Messerschmidt wrote:

On Sat, Sep 12, 2009 at 12:06 AM, Julian Elischer jul...@elischer.org wrote:

does the system have a serial console? how about a normal console /keyboard?


It has an IP KVM.


how often deos it hang? and for  how long?


Well, this is interesting.  I got really frustrated with the other
approach, so I thought I'd thin a machine down absolutely as far as I
could, eliminate every possible source of delay, and see what happens.
 I killed everything... cron, RPC, NFS, devd, gmon, nrpe, everything.
The Apache and its exerciser are now the only things running on the
machine, and the Apache is only touching an md0 swap device mounted on
/mnt.  I *still* get the hangs.


ok now we need to describe the hang..  if you can predictably get a 
hang every 7 seconds does this mean that it doesn't respond to 
keyboard for a moment every 7 seconds? or that it doesn't accept 
packets every 7 seconds? if you lean on the A key, do you see echo 
stop every 7 seconds for a moment?


Or is it just the apache process that hangs?

Does the watching process that you refer to below also hang?
would it hang if it tried to access the disk?
if the watching process is on the same machine, does it only trigger 
AFTER teh request has taken a ling time or could it time out with a 
select DURING the delayed response? (another way of asking how hung

is 'hung'?





It hangs for all sorts of different periods, but the duration of the
stall is approximately inversely proportional to the chance of seeing
it.  To get a short delay, you need wait only a little bit.  If you
want a 2-3 second delay, you may have to wait 15-20 minutes.

*However* in order to answer your question, I changed up the test
program, which up til now has been cycling requests every 50 ms until
it gets one 2s, at which point it sysctls to stop ktr and aborts.

Now it prints the timestamp of all too long requests.  But I also
dropped the threshold for too long from 2s to 100ms, since with
everything on RAM disk, there's no longer any reason to expect a
request to take more than 1-2ms in the worst case.

The results are pretty profound:

1252729876: request 82 131ms
1252729883: request 210 388ms
1252729890: request 338 380ms
1252729897: request 466 388ms
1252729904: request 594 404ms
1252729919: request 849 810ms
1252729926: request 977 386ms
1252729933: request 1105 370ms
1252729940: request 1233 366ms
1252729947: request 1361 400ms
1252729961: request 1617 746ms
1252729968: request 1744 477ms
1252729975: request 1872 388ms
1252729982: request 2000 380ms
1252729989: request 2128 384ms
1252729996: request 2256 395ms

It goes on and on like this, I get a 380-400ms stall every seven
seconds.  I have had a few come back higher, in the 750-850ms range,
usually after missing a beat:

1252729897: request 466 388ms
1252729904: request 594 404ms
1252729919: request 849 810ms
1252729926: request 977 386ms

1252730010: request 2512 416ms
1252730017: request 2640 390ms
1252730031: request 2896 774ms
1252730038: request 3023 431ms

1252730454: request 10568 378ms
1252730461: request 10696 397ms
1252730475: request 10952 733ms
1252730482: request 11080 366ms

So far, nothing over 1s.

So what happens every seven seconds??


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org