[EMAIL PROTECTED] wrote: > I run a nagios installation with 522 servers and 4654 service checks. > > When adding or removing clients, it happens that about half or perhaps > 2/3 of the service checks loose all status retention. What is more > concerning is that they also go back to "initial state" eg Notifications > are turned off!! This is bad. >
I'll clarify a bit here for history reasons, so that people reading the ML archives knows what's going on. I've gotten the details from our support staff. "Adding a client" in this case means the equivalent of running /etc/init.d/nagios reload or, in plaintext, sending SIGHUP to Nagios. > It does'nt happen every time, and it is'nt the same servers every time. > Does it happen with services or with hosts? If it's random, does it more usually happen with hosts/services that alphabetically sort last? Apart from that, I'll need some more info to properly determine what's going wrong here. What OS type/version are you using? 64 or 32-bit? Multi-processor or single? What version of glibc are you using (actually, what version of libpthread, but one can be inferred from the other)? If you're running this on VMWare on a guest-OS emulating multiple CPU's, I'm *guessing* you're running into an issue of Nagios not properly checking for received signals before starting to write the retention file, so the thread responsible for writing it gets killed by a signal delivered to the controller thread. If you're running Nagios in VMWare (a big nono as most know), this is more likely to happen. You could try sending the RESTART_PROCESS command to Nagios' command- file instead, but you probably want to stagger it a bit so you don't spam the poor FIFO in case you get lots of reload-requests at in a short timeframe, like touching a file and then reloading once every five minutes (from a cron-job) if the file exists (make sure to remove the file after restarting, or you'll be wasting cycles at a tremenduous rate). Needless to say, we don't have this problem and I haven't heard from anyone else that suffers from it either, which suggests to me that you're doing something that isn't quite normal. Having fired up our stress-test config (12000 hosts, 60000 services, running a plugin that emulates extremely skittish behaviour and submitting random commands every now and then) on one of our servers, I've failed to reproduce this problem. > Very strange. It seems to me like some kind of buffer overflow. It's not a buffer overflow. A buffer overflow would have left your system riddled with core-dumps and nagios would not have continued running after receiving the SIGHUP. > It started when I upgraded from 2.9 to 3.0.4. > Strange. Given that there are no changes in the core between 3.0.4 and 3.0.5, I don't think it's worth upgrading to see if that solves the problem (although you probably want to use 3.0.5, or the even more fixed 3.0.5p1 from http://www.op5.org/src/nagios-3.0.5p1.tar.gz anyway for the security fixes they add). If you figure out what it is, or if you can give me enough information to reproduce it, I'll see what I can do to fix this. We're just about to ship a release right now though, so I won't have time to do anything about it until monday at the earliest. Good luck. -- Andreas Ericsson [EMAIL PROTECTED] OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Nagios-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
