Re: 8.0: OpenSSL stat()'s NLS 500+ times causing extreme system load

2009-12-17 Thread Linda Messerschmidt
On Thu, Dec 17, 2009 at 2:05 AM, Jonathan McKeown j.mcke...@ru.ac.za wrote:
 It can also be enabled separately in nagios's main config file -
 child_processes_fork_twice is the option to look for.

Actually I had never seen that before. :)  I added this setting
immediately and it definitely cut the CPU usage down, but the load
average went way up.  No doubt that's because a lot of processes that
don't live long enough for load average accounting for no longer
exist.

I'm a lot more interested in CPU usage than load average, so it's a
big win for me. :)

Thanks!
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: 8.0: OpenSSL stat()'s NLS 500+ times causing extreme system load

2009-12-16 Thread Linda Messerschmidt
On Tue, Dec 15, 2009 at 12:53 PM, Dan Nelson dnel...@allantgroup.com wrote:
 It's defined in src/lib/libc/Makefile, so you should be able to remove that
 line, rebuild libc and reinstall, and see whether your performance issue
 goes away.

I tried that and as you predicted, all the bogus stat calls went away.

Unfortunately the performance issue did not. :(  Back to the drawing
board for me!

Upon further inspection, it seems as though for each check, Nagios
spawns a process that spawns a process that spawns a process that runs
the check.  I did ktrace -i -t w -p (nagiospid) on Nagios for 30
seconds and the ktrace output contained records from 2365 different
processes spawned in that 30 seconds.  During that time, I would
expect about 800 checks to have run, so it does seem like it's right
at 3 processes per check.

I just don't think the system can keep up with all that fork()ing
without going all out; it's just a limit of the Nagios plugin
architecture.

But thank you very much for point me in the right direction!
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: 8.0: OpenSSL stat()'s NLS 500+ times causing extreme system load

2009-12-16 Thread Jonathan McKeown
On Tuesday 15 December 2009 23:24:16 Linda Messerschmidt wrote:
 On Tue, Dec 15, 2009 at 12:53 PM, Dan Nelson dnel...@allantgroup.com 
wrote:
  It's defined in src/lib/libc/Makefile, so you should be able to remove
  that line, rebuild libc and reinstall, and see whether your performance
  issue goes away.

 I tried that and as you predicted, all the bogus stat calls went away.

 Unfortunately the performance issue did not. :(  Back to the drawing
 board for me!

 Upon further inspection, it seems as though for each check, Nagios
 spawns a process that spawns a process that spawns a process that runs
 the check.  I did ktrace -i -t w -p (nagiospid) on Nagios for 30
 seconds and the ktrace output contained records from 2365 different
 processes spawned in that 30 seconds.  During that time, I would
 expect about 800 checks to have run, so it does seem like it's right
 at 3 processes per check.

 I just don't think the system can keep up with all that fork()ing
 without going all out; it's just a limit of the Nagios plugin
 architecture.

You've probably already spotted this, but this behaviour is documented in 
largeinstallationtweaks.html:

``Normally Nagios will fork() twice when it executes host and service checks. 
This is done to (1) ensure a high level of resistance against plugins that go 
awry and segfault and (2) make the OS deal with cleaning up the grandchild 
process once it exits. The extra fork() is not really necessary, so it is 
skipped when you enable this option. As a result, Nagios will itself clean up 
child processes that exit (instead of leaving that job to the OS). This 
feature should result in significant load savings on your Nagios 
installation.''

It can also be enabled separately in nagios's main config file - 
child_processes_fork_twice is the option to look for.

Jonathan
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


8.0: OpenSSL stat()'s NLS 500+ times causing extreme system load

2009-12-10 Thread Linda Messerschmidt
We have a Nagios server (ports/net-mgmt/nagios) that has a lot of
check_nrpe2 (ports/net-mgmt/nrpe2) checks.

We recently upgraded the server it runs on to 8.0-STABLE (r199975).
The performance has never been great, but now it's really atrocious
and I'm trying to figure out what's going on.

The machine (a dual-core Nehalem) has a load average of 5 - 10 at all
times, and top shows 100% CPU usage, 75% system CPU usage.  No
process has more than a few % CPU though.

This is due to the large number of very short-lived processes doing
individual Nagios checks that don't live long enough to appear in top.

I investigated in some more detail with ktrace and found that each
execution of check_nrpe2 performs 520 stat() calls.  The bulk of them
look like this:

 81915 check_nrpe2 CALL  stat(0x7fbfde28,0x7fbfddc4)
 81915 check_nrpe2 NAMI  /usr/share/nls/C/libc.cat
 81915 check_nrpe2 RET   stat -1 errno 2 No such file or directory
 81915 check_nrpe2 CALL  stat(0x7fbfde28,0x7fbfddc4)
 81915 check_nrpe2 NAMI  /usr/share/nls/libc/C
 81915 check_nrpe2 RET   stat -1 errno 2 No such file or directory
 81915 check_nrpe2 CALL  stat(0x7fbfde28,0x7fbfddc4)
 81915 check_nrpe2 NAMI  /usr/local/share/nls/C/libc.cat
 81915 check_nrpe2 RET   stat -1 errno 2 No such file or directory
 81915 check_nrpe2 CALL  stat(0x7fbfde28,0x7fbfddc4)
 81915 check_nrpe2 NAMI  /usr/local/share/nls/libc/C
 81915 check_nrpe2 RET   stat -1 errno 2 No such file or directory
 81915 check_nrpe2 CALL  stat(0x7fbfde28,0x7fbfddc4)

kdump also shows 70 calls to getpid, which seems excessive.  (About 50
of them appear to be in a tight loop.)

The check_nrpe2 program simply opens an SSL socket to a remote server,
sends a short request and gets a short response.  It is a pretty
simple program. (~22k of source)

The calls to getpid() bother me a bit, but I think the NLS is the real problem:

$ kdump -E -t n | fgrep /nls/ | head -1
 81915 check_nrpe2 0.016815 NAMI  /usr/share/nls/C/libc.cat
$ kdump -E -t n | fgrep /nls/ | tail -1
 81915 check_nrpe2 0.135663 NAMI  /usr/local/share/nls/libc/C
$ kdump -E | tail -1
 81915 check_nrpe2 0.222510 CALL  exit(0x1)
$ kdump -E -t n | fgrep /nls/ | wc
 5082540   32004

So this program spends over half its life looping over 508 stat()
calls looking for a nonexistent libc.cat file.  And then another chunk
(probably a lot smaller, but not measured) looping over getpid().

Both appear to be related to SSL; if I set up nrpe not to use it, both
excesses go away and the program finishes in about half the time,
using about half the CPU resources.

To confirm that it was SSL-related, I tried:

$ ktrace openssl s_client -connect x2:5666

And I got the exact same stat()  getpid() behavior.

Obviously there is some small CPU overhead associated with SSL.  This
is not about that.  This is about the system overhead induced by
calling stat 500+ times on a directory that doesn't exist.

This gets a little worse.  Because there are several checks running at
any given time, there is a lot of contention to VFS lookup this
handful of paths.  That's an area where FreeBSD has known SMP
performance issues I've seen discussed elsewhere, and this is a
pathological worst case.  The net result, a dual core machine is
brought to its knees by a relatively simple Nagios setup.

Anyway, long story short, why is OpenSSL doing this and how can we make it stop?

Thanks for any suggestions!
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org