Re: [Shinken-devel] Problem with timeouts

Felipe openglx Tue, 12 May 2015 14:48:07 -0700

The devs will be able to give more specifics (maybe even confirm if 2.4
performs better for your case?) but I faced similar issues with timeout
because of the time it took to "slice and dice" the amount of objects.
If you can enable debug mode on all nodes and provide some captures it
would be great.

How is the interdependency of your hosts? Is there a lot of parenting
relation between hosts? And the ratio of services/host?

As far as I know from the internals, the number of objects isn't the only
factor: how they relate to eachother will increase the number of "broks"
that the daemons have to handle.
In one of my production environments we have near 30k hosts with ping-only,
what brings the number of broks to be smaller than in a environment with 3k
hosts with 30k services (one of my other prod env).

What I want to say is: your environment is considered mid-sized for Shinken
and you have some very beefy servers, so it looks like some tuning to be
done.
If thread pool didn't help we need to check what the daemons are doing that
they "forget" to reply ping.

Could you confirm OS, Python version, and anything else you think is
relevant?

Have you checked that firewall isn't blocking a return path? I remember one
case where a node received the ping request but the pong reply wasn't being
received due more rigorous firewalling...
You mentioned that with 2k hosts it worked, just wanted to double check
that.
Maybe bring the system up with 0 hosts just to see if everyone is happy?

(Sorry that my e-mail is a bit confusing, it's bed time but felt the need
to assist on the troubleshooting.... tomorrow I will try to make it
clearer).

Regards

On 12 May 2015 at 22:26, David Good <dg...@willingminds.com> wrote:

>
> On 5/12/15 2:08 PM, Felipe openglx wrote:
>
>  What is the latency between your nodes?
>
>
> The ones that are having trouble are all on the same switch -- latency is
> 0.5ms or less between them.  I have 4 other remote servers for checking
> access to our website from various internet locations.  They each run a
> scheduler and poller and each is in its own realm.  I'm not having any
> difficulty with them, though.
>
>  Have you restarted the scheduler after changing that setting?
>
>
> Yes.  I restarted everything on all servers.
>
>  Are you using CherryPy ?
>
>
> I think so:
>
> [1431461784] INFO: [Shinken] Initializing a CherryPy backend with 50
> threads
>
>
>  I don't think it's benefitial in having too many schedulers unless you
> have a pretty good retention between them set up. I'd recommend two plus
> one spare for a setup your size.
>
>
> OK.  I was unsure how much capacity would be needed and was offered 5
> servers so I went with what they gave me :-)
>
> I can make the master server run only arbiter/broker/receiver/reactionner
> and then have two others run pollers and schedulers, one be a spare poller
> and scheduler and the last be a spare arbiter/broker/receiver/reactionner.
>
> What I don't understand though, is that my current setup worked fine when
> I had around 2200 hosts being monitored and only started having major
> issues after I increased it to 3300.  It seems odd that in order to handle
> the extra load I need to take away some of the servers.
>
>
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Shinken-devel mailing list
> Shinken-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/shinken-devel
>
>

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

_______________________________________________
Shinken-devel mailing list
Shinken-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/shinken-devel

Re: [Shinken-devel] Problem with timeouts

Reply via email to