Re: [Nagios-users] trying to fix problem with excessive latency

Frost, Mark {PBC} Wed, 19 May 2010 08:47:17 -0700

> -----Original Message-----
> From: Corey Hickey [mailto:bugfood...@fatooh.org] 
> Sent: Tuesday, May 18, 2010 9:30 PM
> To: nagios-users@lists.sourceforge.net
> Subject: [Nagios-users] trying to fix problem with excessive latency
> 
> Hello,
> 
> I have inherited maintenance of a medium-sized Nagios installation. We 
> currently have 649 hosts and 5415 services. Our setup works nicely, with 
> one exception: Nagios falls behind on host/service checks. Our usual 
> latency once Nagios has been running for a while is about 190-200 
> seconds. Our Nagios host is reasonably powerful and isn't struggling; it 
> seems that Nagios itself is limited somehow.
>


<snip>

> Active Service Execution Time:          0.020 / 120.007 / 0.847 sec
> Active Host Execution Time:             0.020 / 11.019 / 0.069 sec
> 

<snip>

> I have a feeling I'm missing something.... I would appreciate any 
> suggestions.
> 
> Thanks,
> Corey

Corey,

I'm not an expert, but I'll relay some of my own experiences here.  I did
find that switching on large_installation_tweaks did indeed make a big 
difference
with our latencies.

We also were doing the pre-Nagios 3.2 practice of not doing active host checks. 
 As
the tuning guide recommends, it's actually more efficient to do active checks 
and then
enable the cached check results.   When we did that, we found that the host 
that we
were seeing latency issues on leveled out on latencies.  (It's good to graph 
those values,
by the way).  They were still high-ish, but the active host checks caused them
to stop increasing over time.

But additionally, we found that long running checks were also messing up 
latencies.
As I understand it, if Nagios schedules a check and then it takes a lot longer 
than Nagios
expects it to to return, that can mess up scheduling the other checks.  I see 
you've got
some check(s) that ran at a max of 120 seconds.  When I started seeing some 
latency
problems I also saw that I had a service check or two that was running for 
several minutes.
I tracked that down and changed the check so that it completed (or timed out, 
really)
more quickly returning status back to Nagios in a matter of seconds rather than 
minutes.
The latency plummeted after that.  In general, our policy is that most checks 
should
complete in under 30 seconds, preferably under 10.

In the same vein, I'm not quite sure how you could have any host checks that 
would take
11 seconds to execute.  Are you doing multiple pings/fpings to check that a 
host is up?  Typically you can get away with just a single fping rather than a 
series of 10 to tell
you that a host is not reachable.

Hope that helps.

Mark

------------------------------------------------------------------------------

_______________________________________________
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] trying to fix problem with excessive latency

Reply via email to