Re: Intermittent system hangs on 7.2-RELEASE-p1

Linda Messerschmidt Fri, 11 Sep 2009 23:53:15 -0700

On Sat, Sep 12, 2009 at 1:47 AM, Julian Elischer <jul...@elischer.org> wrote:
> ok now we need to describe the hang..  if you can predictably get a hang
> every 7 seconds does this mean that it doesn't respond to keyboard for a
> moment every 7 seconds?


It's possible.

> or that it doesn't accept packets every 7 seconds?

It appears that it accepts & responds to at least pings; I was able to
do an every-0.1-seconds ping through a bevy of 300-1900ms stalls with:

2323 packets transmitted, 2323 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.120/1.019/5.979/0.288 ms

As best as I could tell, schedgraph also showed that the clock
interrupt and the em0 interrupt always got serviced on time.  Pretty
much seems like its userspace that's getting put on hold.

> Or is it just the apache process that hangs?

This is where I started from.  In the original post (way long ago
now), I described how pretty much every process on the system went
into the kernel for something and stalled there, and then when the
stall ends, they all unblock at once.  I posted some examples via
ktrace that I sadly no longer have the source data for.

> Does the watching process that you refer to below also hang?

I don't think I can say for sure.  I observe visual stalls from time
to time in the output if I have it show every request where there is
no stall shown, which could either indicate that a stall occurred
outside the request or that my shoddy Internet connection has 100ms
latency and consistent 1% packet loss, which it does.

I did write a short C program that just select()s on stdin for 100ms
over and over and aborts if it takes more than 125ms to go through the
loop; it never aborts, even through 1s+ stalls and the loop times it
reports are consistently 110ms regardless of what else is going on,
which I don't think is unexpected.  However, I'm not sure why that
differs from the behavior of the "master" Apache processes, which
select() for 1 second all day long, but do appear to be affected.
Maybe because they are selecting a network socket instead of a tty?  I
don't know.

Also, if I disable NTP, the system does not appear to lose time during
the stalls, which fits with the consistent clock interrupts I saw.

> would it hang if it tried to access the disk?

By using the md device, I believe I have removed the disk from the
equation; neither process is accessing it.

Even without doing that, if I leave iostat -w 1 running alongside the
test, there's no correlation between the tiny amount of disk activity
there is and observed stalls.

> if the watching process is on the same machine, does it only trigger AFTER
> teh request has taken a ling time or could it time out with a select DURING
> the delayed response? (another way of asking "how hung
> is 'hung'?"

It's just a PHP script using libcurl to request the file.  I only
moved it to the same machine in order to have it be able to write the
sysctl to stop the KTR traces I was doing.

If you're asking could the check script be modified to time out after,
say, 1 second, and if so, would it return during the hang or after it?
 I don't know.  My guess based on the earlier ktrace output is that it
would time out, but not return until the hang ended.  I'll see if I
the curl lib exposes a configurable timeout and try it.
_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Intermittent system hangs on 7.2-RELEASE-p1

Reply via email to