On 19 May 2010, at 14:21, Toni Van Remortel wrote:

I upgraded my Opsview yesterday to 3.7.0
After I made my changes in the contacts (merging all separate ‘contacts’ for each person into 1 contact with some profiles connected to it), the reload of the entire system went from 3 minutes 50 seconds to almost 30 minutes.

A 10x increase is very extreme. Can you send the output of /usr/local/ nagios/var/log/create_and_send_configs.debug?

When I watch the processes on the master server, I do see the nagconfgen.pl scripts go on high speed, followed by a minute of 100% cpu usage by the Nagios process. Then it goes quit on the master server, but the reload indicator states it is still busy.

This is opsviewd.log:
[2010/05/19 14:59:54] [slave_node_event_handler] [INFO] Starting
[2010/05/19 14:59:54] [slave_node_event_handler] [INFO] Only running on HARD state change - currently SOFT
[2010/05/19 14:59:54] [slave_node_event_handler] [INFO] Finished
[2010/05/19 15:00:28] [slave_node_event_handler] [INFO] Starting
[2010/05/19 15:00:28] [slave_node_event_handler] [INFO] Only running on HARD state change - currently SOFT
[2010/05/19 15:00:28] [slave_node_event_handler] [INFO] Finished
[2010/05/19 15:00:52] [slave_node_event_handler] [INFO] Starting
[2010/05/19 15:00:52] [slave_node_event_handler] [INFO] Only running on HARD state change - currently SOFT
[2010/05/19 15:00:52] [slave_node_event_handler] [INFO] Finished
[2010/05/19 15:01:26] [slave_node_event_handler] [INFO] Starting
[2010/05/19 15:01:26] [slave_node_event_handler] [INFO] Only running on HARD state change - currently SOFT
[2010/05/19 15:01:26] [slave_node_event_handler] [INFO] Finished
[2010/05/19 15:01:54] [slave_node_event_handler] [INFO] Starting
[2010/05/19 15:01:54] [slave_node_event_handler] [INFO] Only running when OK - state is currently CRITICAL
[2010/05/19 15:01:54] [slave_node_event_handler] [INFO] Finished
[2010/05/19 15:02:16] [slave_node_event_handler] [INFO] Starting
[2010/05/19 15:02:16] [slave_node_event_handler] [INFO] Only running when OK - state is currently CRITICAL
[2010/05/19 15:02:16] [slave_node_event_handler] [INFO] Finished
[2010/05/19 15:04:02] [import_runtime] [INFO] Starting
[2010/05/19 15:04:02] [import_runtime] [INFO] Importing for 2010-05-19 12:00:00 [2010/05/19 15:04:02] [import_runtime] [INFO] Importing all results and performance data [2010/05/19 15:04:24] [import_runtime] [INFO] Importing downtime starts
[2010/05/19 15:04:24] [import_runtime] [INFO] Importing downtime ends
[2010/05/19 15:04:24] [import_runtime] [INFO] Checking for incorrect downtimes [2010/05/19 15:04:24] [import_runtime] [INFO] Caculating relevant downtimes
[2010/05/19 15:04:24] [import_runtime] [INFO] Importing notifications
[2010/05/19 15:04:24] [import_runtime] [INFO] Importing acknowledgements
[2010/05/19 15:04:24] [import_runtime] [INFO] Importing state history
[2010/05/19 15:04:25] [import_runtime] [INFO] Calculating hourly availability
[2010/05/19 15:04:31] [import_runtime] [INFO] Finished import for hour
[2010/05/19 15:04:32] [import_runtime] [INFO] Finished
[2010/05/19 15:14:06] [create_and_send_configs] [INFO] Ending overall with error=0

There's an import into ODW in the middle (import_runtime) which is irrelevant. There's lots of event handler calls to slave_node_event_handler - I assume this is during the transfer. I'm guessing this means that the Slave-node checks to the slaves are having an issue. Is the line between the master and the slave saturated? Lots of errors causing re-transfers?



I guess this is because the config files for the contacts are now huge:
-rw-r----- 1 nagios nagios 2.2M 2010-05-19 14:47 contactgroups.cfg
-rw-r----- 1 nagios nagios  13M 2010-05-19 14:47 contacts.cfg

Did you make any other changes such as adding new host groups or adding new service groups? I was expecting that 3.7.0 would decrease the size of the contacts and contactgroup pages (because we've removed all the distprofile and masterprofile configurations).


Yes my slaves are reachable over slow lines, that’s the idea of a slave.

How are these configs copied to the slave? Entire copy with scp? Wouldn’t rsync be much much better? After all, most changes are small in between reloads.

We compress the config files and then we send using scp, then there's a post job on the other end which extracts and places accordingly.

We could switch to rsync but that would involve quite a bit of development work (there's some post processing that we do as well for slave node specific configurations).

Ton

_______________________________________________
Opsview-users mailing list
[email protected]
http://lists.opsview.org/lists/listinfo/opsview-users

Reply via email to