Hi Mark, I have been having similar problems with my distributed setup. The OCSP daemon reduced the latency in returning check results greatly, but I still am seeing (seemingly) random services go stale. I'm still try to track down the problem, recreating it on a small scale has so far been unsuccessful. I will let you (and the list) know how my investigations go. As for determining the execution time of a particular check, this can be found in retention.dat. The field is check_execution_time=
On 24-Jan-08, at 5:13 PM, Frost, Mark {PBG} wrote: > > >> -----Original Message----- >> From: Thomas Guyot-Sionnest [mailto:[EMAIL PROTECTED] >> Sent: Thursday, January 24, 2008 3:33 AM >> To: Frost, Mark {PBG} >> Cc: Nagios Users >> Subject: Re: [Nagios-users] Problem with high latencies after >> going distributed >> >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Some heavily broken intending there (looks like my mail client gets >> confused)... don't trust the number of ">"! >> >> On 23/01/08 10:47 PM, Frost, Mark {PBG} wrote: >>> >>> >>>> -----Original Message----- >>>> From: Thomas Guyot-Sionnest [mailto:[EMAIL PROTECTED] >>>> Sent: Wednesday, January 23, 2008 10:24 PM >>>> To: Frost, Mark {PBG} >>>> Cc: Nagios Users >>>> Subject: Re: [Nagios-users] Problem with high latencies after >>>> going distributed >>> I don't think so. I remember an email from Ton Voon some time >>> ago asking >>> Ethan why the oc[hs]p command are run serially but I don't recall if >>> there was a reply or what else was said... >>> >>> I believe it's either documented in the official doc or some >>> user-contributed doc that the oc[hs]p commands should return >> as soon as >>> possible. It's usually done in Perl using a fork: >>> >>> if (fork==0) { >>> # send stuff via NSCA here... >>> } >>> exit(0); >>> >>> >>>> I guess what I'm thinking here is that unlike a custom >> check, I can't >>>> see most >>>> people needing to customize the passive check result >> process. All the >>>> solutions I've >>>> seen seem to include a named pipe. So why couldn't Nagios support >>>> making the ocsp/ochp >>>> "commands" just named pipes instead. Then instead of a standalone >>>> send_nsca binary, >>>> have the nsca source build a send_nscaD binary (I'm making >> that up) that >>>> reads from the >>>> pipe that nagios writes to and sends directly to nsca on the >>>> server. >>>> That sort of >>>> eliminates the middle-man in the process of reporting passive check >>>> results. >>> >>>> I know, I know, I'm free to write the send_nscaD.c code and >> send it to >>>> Ethan :-) >> >> Well... I was thinking about partly re-writing nsca as an event-based >> daemon (supporting only the --single mode, but that would be really >> scalable) using libevent, allowing to pass along the timestamp >> (this is >> a recent feature request) and supporting multi-line responses (for >> Nagios 3) in the process, and finally suggesting this as a base for a >> NSCA v3... I'm not even sure if I would have enough time but since my >> main objective it to learn I wouldn't loose anything trying :). >> >> In the unlikely event that I write it, In the same step I could >> surely >> to a C version of OCP_Daemon supporting natively the "NSCA v3" >> protocol >> (it wouldn't be hard)... >> >> I'll have to think about it... I quess the only sane separator to >> write >> multiple multi-line results on a pipe would be \000 (NULL), so there >> would be 3 mode of operation for send_nsca (and two for nsca_sendd >> (don't you think it sounds better reversed?)): >> send_nsca: compatible (v2 behavior), Single check (additional >> lines are >> taken as additional output) and multi-check (NULL separated) >> nsca_sendd: single-line (one check/line, OCP_Daemon style) and >> multi-line "NULL-separated). >> >>> I don't know how many people use OCP_Daemon but I had reports >>>>> from a few >>> people that greatly reduced their latency using it and I >>> haven't had any >>> bug reported yet. I believe it's well documented as well, but If you >>> have any feedback on this I'll be happy to get it. >>> >>>> I'm playing with it a bit and have so far had good results. >> I'll have >>>> some >>>> feedback after I've played with it a bit longer. Thanks >> for writing it >>>> and >>>> writing up the docs for it as well! >> >> Pass the thanks over to Ethan who sent me a Nagios NSA t-shirt >> for it ;) >> >> Thomas > > I can see that using the OCP Daemon script cut down on my latencies > quite a lot. Unfortunately, > I'm still seeing some "stale" checks on the master server that I can't > explain. I'm starting to > get the feeling that going distributed isn't all it's cracked up to > be. > I haven't seen mention in > the docs of the caveats with oc[sh]p and latencies (my books sure > don't > mention it) and even the > fact that the supplied submit_service_check script in the distribution > from Ethan is a shell > script that pipes to send_nsca. I'm not all that excited about having > to do a workaround > for this issue. > > While the OCP_Daemon seems to help me, I'm a little uncomfortable > running it as a solution to our > issue. First, we don't normally have root access on our boxes so > recreating the FIFOs could be > a problem (or at least a wait). I'm also concerned about requiring > another process external to > Nagios as part of the process. If OCP_Daemon dies at some point, my > distributed nodes are hosed. > I had a few issues with correctly starting Nagios and OCP_Daemon in > the > right order when playing > with it last night. Once I got it all going, it worked well but I'm > thinking of having to explain > this to someone here who isn't the Nagios person. > > I was thinking of your fork/exec comment above. What if one were to > rewrite the "glue" shell > script (the one that takes the output from Nagios and pipes it to > send_nsca) and do something > similar, but write it in C? Additionally, have the parent fork and > exit > (causing Nagios to > think the oc[sh]p completed very quickly) then have the child go on > and > send output to send_nsca > separately. For my setup, this has the advantage of not being a > separate process that I need to > make sure continues to run. It also doesn't require synchronizing > listeners on both ends of a pipe > or else one process would hang. It would almost be even better, it > seems to me, if this script > could do the send_nsca functionality (again, as the child) instead of > even having to call send_nsca. > The biggest drawback I can see there is that you can't edit the C > program to show destination server, > etc. You'd just about have to pile on a ton of command line > options or > have a config file for it. > > Just thinking out loud. > > On a related note, I see that according to my performance stats, some > checks are still taking a > very long time to run. Is there some easy way I can see check > execution > time per check and track > down which checks are taking such a long time? > > Thanks > > Mark > > ---------------------------------------------------------------------- > --- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Nagios-users mailing list > Nagios-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nagios-users > ::: Please include Nagios version, plugin version (-v) and OS when > reporting any issue. > ::: Messages without supporting info will risk being sent to /dev/null ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null