Re: [Shinken-devel] Problem with timeouts

David Good Wed, 13 May 2015 11:31:43 -0700

According to the logs, from the time the arbiter logs its version and copyright until it starts dispatching configurations takes a bit over 2 minutes (145 seconds):

[1431476926] INFO: [Shinken] Shinken 2.2
[1431476926] INFO: [Shinken] Copyright (c) 2009-2014:
...
[1431476927] INFO: [Shinken] Begin to dispatch configurations to satellites
...
[1431477071] INFO: [Shinken] Dispatching Realm All

I haven't setup the automatic reload time yet. I'm thinking probably once an hour, but only if the configuration has changed. No point in reloading if the configuration is the same.

On 5/12/15 7:11 PM, Andy Xie wrote:

I know it is irrelevant to the problem, but can you tell me what is the time of the reload operation when you do "shinken reload", we have about 2k hosts and 2w services and the reload time is about 2min 30sec, this is only the arbiter reload time, when the configuration is sent to all the daemon and all the satellites begin to work properly, it will take about another 3min, which is totally added up to almost 5min. That is too long.

Another one question, how do you set your configuration reload time, every 5 mins or every 1h or even every 1day.

BTW, i have some redispatch in arbiter too. Because we only have one master broker, the other one is slave. Allmost every time i reload the configuration the configuration corresponding to broker need redispatch, however i can tell that even the dispatch for broker has spend for more than 80sec it still return True for the transmission. and when it check the first dispatch in "check_bad_dispatch" it will said that broker configuration may be lost then a redispatch is issued. Sometimes the same for reactionner too.

I currently have some questions that the configuration for scheduler is much bigger that the configuration for reactionner and broker, however, the dispatch for scheduler is fine and the dispatch for reactionner and broker is not.

++++++

Ning Xie

2015-05-13 7:16 GMT+08:00 David Good <dg...@willingminds.com>:

On 5/12/15 2:46 PM, Felipe openglx wrote:
> The devs will be able to give more specifics (maybe even confirm if
> 2.4 performs better for your case?) but I faced similar issues with
> timeout because of the time it took to "slice and dice" the amount of
> objects.
> If you can enable debug mode on all nodes and provide some captures it
> would be great.

OK -- I'll see about setting that up.
>
> How is the interdependency of your hosts? Is there a lot of parenting
> relation between hosts? And the ratio of services/host?

There is some parenting, but not that much -- about 1000 of the hosts
have parents (vm guests on hypervisors). When I started having trouble,
there were a total of 3300 hosts and 30000 services
>
> As far as I know from the internals, the number of objects isn't the
> only factor: how they relate to eachother will increase the number of
> "broks" that the daemons have to handle.
> In one of my production environments we have near 30k hosts with
> ping-only, what brings the number of broks to be smaller than in a
> environment with 3k hosts with 30k services (one of my other prod env).
>
>
> What I want to say is: your environment is considered mid-sized for
> Shinken and you have some very beefy servers, so it looks like some
> tuning to be done.
> If thread pool didn't help we need to check what the daemons are doing
> that they "forget" to reply ping.
>
> Could you confirm OS, Python version, and anything else you think is
> relevant?

Scientific Linux 6.5 (pretty much the same as RHEL or CentOS 6.5).
Python 2.6.6 (installed from system-provided RPM -
python-2.6.6-51.el6.x86_64).

We do have a *lot* of hostgroups included in host definitions. Probably
averaging around 10 per host. This would've probably at least doubled
with the new host config generation scheme.

>
> Have you checked that firewall isn't blocking a return path? I
> remember one case where a node received the ping request but the pong
> reply wasn't being received due more rigorous firewalling...
> You mentioned that with 2k hosts it worked, just wanted to double
> check that.
> Maybe bring the system up with 0 hosts just to see if everyone is happy?

No firewalls and the servers are all on the same switch and are even on
adjacent ports. None of them are running iptables.

It was happy before I made some major changes in how the host/service
configuration is generated (host definitions are now generated from an
internal inventory database) which resulted in adding another 1000 or so
hosts. It has since been decided that about 1000 servers don't need to
be monitored so now I'm down to about 2200 servers. We'll see if that
helps any.
>
>
> (Sorry that my e-mail is a bit confusing, it's bed time but felt the
> need to assist on the troubleshooting.... tomorrow I will try to make
> it clearer).
>

No problem. I really appreciate the help.

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Shinken-devel mailing list
Shinken-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/shinken-devel
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Shinken-devel mailing list
Shinken-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/shinken-devel

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y

_______________________________________________
Shinken-devel mailing list
Shinken-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/shinken-devel

Re: [Shinken-devel] Problem with timeouts

Reply via email to