On Wed, 2006-05-24 at 04:53 -0400, Morris, Patrick wrote: > How are you guys running the nsca daemon? I've got systems that perform > thousands of checks with no problem. > > I'm looking at a system right now that submits over 5300 checks to a > central server running nsca via xinetd, and it has a average service > latency of .153 secs.
Are you not referring to the server end? I am running nsca as a daemon. We are referring to the client end that sends the results. Greg > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Greg Cope > Sent: Wednesday, May 24, 2006 1:47 AM > To: Jacob Ritorto > Cc: nagios-users@lists.sourceforge.net > Subject: Re: [Nagios-users] Re: How to reduce a very high latency number > > Jacob, > > I noticed the same thing today. > > We run a few distributed servers that do about 150 checks (at the > moment) and submit this to our central server. > > That's allot of send_nsca processes that get spawned. > > I like you fix! > > send_nsca would not appear to be scallable for those running lots of > passive checks with distributed systems. > > Greg > > On Tue, 2006-05-23 at 09:48 -0400, Jacob Ritorto wrote: > > Greetings, > > A colleague of mine (poctum) and I ran into something like this > > > while using nsca and have crafted a similar solution. We observed > > that send_nsca was sending only one result to the central Nagios > > server per connection. Testing revealed that send_nsca was capable of > > > handling thousands of results per connection. Sending only one at a > > time was resulting in lots of dropped data because there were > > nominally about 5 results derived per second. We enabled > > aggregate_status_updates in the nagios.cfg file, but this yielded no > > improvement in the result submissions. BTW, this is Nagios-2.2 and > > nsca-2.6 on Solaris 10. Our workaround is a quick and dirty but > > efficient solution. It may not be as refined as trask's and relies on > > > nuances of unix file handling algorithms to get the job done. That > > said, it's working perfectly for us. As this seems to work well, but > > may violate Ethan's design intentions, your feedback/input is > > requested. Deploy at your own risk. > > > > Jacob Ritorto, Lead UNIX Server Operations Engineer InnovationsTech > > > > Here's our solution: > > > > 1) Altered last line in > > /opt/nagios/libexec/eventhandlers/submit_check_result thusly. It > > basically concatenates check results to a temp file. > > > > #/bin/printf "%s\t%s\t%s\t%s\n" "$1" "$2" "$return_code" "$4" | > > /opt/nagios/bin/send_nsca 172.16.x.x -c /opt/nagios/etc/send_nsca.cfg > > > > /bin/printf "%s\t%s\t%s\t%s\n" "$1" "$2" "$return_code" "$4" >> > > /opt/nagios/var/results.waiting > > > > > > 2) Created a daemon process called reap (managed by smf, but it has > > been up for a month so far, so may be ok as an init.d script) to pull > > aside the aforementioned temp file (results.waiting) every five > > seconds and send the bits off to the central Nagios server (note that > > original file is re-created immediately via step 1 above). This > > probably only works perfectly on unix & unix-like systems due to the > > nature of files hanging around intact until the last program > > referencing them has exited. It's been some time, but the last I > > checked, DOS/WINxxxx doesn't treat files this way. Here's the simple > > little reap daemon: > > > > # cat /opt/nagios/bin/reap > > #!/usr/bin/tcsh > > while (1) > > sleep 5 > > mv /opt/nagios/var/results.waiting /opt/nagios/var/results.sending > > cat /opt/nagios/var/results.sending | /opt/nagios/bin/send_nsca > > 172.16.x.x -c /opt/nagios/etc/send_nsca.cfg >/dev/null end > > > > > > Summary: Slave Nagios servers now store up check results in the temp > > file for 5 seconds, then they get shipped off to nsca on the central > > Nagios machine in one swoop instead of one-at-a-time. > > > > > > *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~ > > > > > > > > From: Trask <[EMAIL PROTECTED]> > > Re: How to reduce a very high latency number > > 2006-05-23 03:50 > > > > On 5/22/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > > [EMAIL PROTECTED] schrieb am 17.05.2006 > 20:09:16: > > > > > > To me this is obviously a performance issue related to hardware. > > > Your machines have way too few RAM. It is totally not possible to > > > run 1800 checks on a 512MB machine in a timely manner. > > > > > > > I figured this out this past Saturday. It is not any lack of the > > hardware. I was seeing negligible load nor an excessive use of > > memory. No configuration change I made seemed to have any appreciable > > > effect on the latency times I was getting. I ended up doing a "top" > > with 1 second intervals and just watching it for awhile. I noticed > > that sometimes there would be a good number of nagios processes > > 20-30-40 or so, but the majority of the time there were only 2, 3 or 4 > > > processes. Although I do not know exactly *why* this was happening, > > it ends up the during the time where there was 2-4 processes running, > > 2 of them were always the"submit_passive_check" script and > > "send_nsca". It appears that this is being done serially (ie not in > > parallel) and ends up blocking subsequent checks until they are done. > > I would see these 2 processes running (with steadily increasing PIDs) > > for up to a minute and then a short-lived (4-5 seconds) "explosion" of > > > nagios processes (service/host checks). After this flurry of > > activity, it would be another 60 seconds or so of just 2-4 processes. > > > > I resolved this problem by changing by "submit_passive_check" script. > > Below are some sample scripts, both old and new. The short of it is > > like this: Previously, the "submit_passive_check" script did a printf > > > of the data in the appropriate format and piped it to the "send_nsca" > > command (in a shell script). I have eliminated this bottleneck by > > having "submit_passive_check" redirect its output to a named pipe and > > then having another script feed "send_nsca" with that data as it comes > > > in to the named pipe. > > > > Latency times have dropped from the 600-700 seconds to 0.2 seconds on > > the worst server and from 45-55 seconds to 0.06 on the 2nd to worst. > > That's more like it! > > > > Below are a few scripts w/ notes as to what each one is. Thanks to > > everyone who offered help. ------------------------------------------------------- All the advantages of Linux Managed Hosting--Without the Cost and Risk! Fully trained technicians. The highest number of Red Hat certifications in the hosting industry. Fanatical Support. Click to learn more http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642 _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null