Re: [opsview-users] Extreme slowdown in reload after upgrade to 3.7.0

Ton Voon Thu, 20 May 2010 01:41:42 -0700


On 19 May 2010, at 14:21, Toni Van Remortel wrote:

I upgraded my Opsview yesterday to 3.7.0
After I made my changes in the contacts (merging all separate‘contacts’ for each person into 1 contact with some profilesconnected to it), the reload of the entire system went from 3minutes 50 seconds to almost 30 minutes.

A 10x increase is very extreme. Can you send the output of /usr/local/nagios/var/log/create_and_send_configs.debug?

When I watch the processes on the master server, I do see thenagconfgen.pl scripts go on high speed, followed by a minute of 100%cpu usage by the Nagios process. Then it goes quit on the masterserver, but the reload indicator states it is still busy.
This is opsviewd.log:
[2010/05/19 14:59:54] [slave_node_event_handler] [INFO] Starting
[2010/05/19 14:59:54] [slave_node_event_handler] [INFO] Only runningon HARD state change - currently SOFT
[2010/05/19 14:59:54] [slave_node_event_handler] [INFO] Finished
[2010/05/19 15:00:28] [slave_node_event_handler] [INFO] Starting
[2010/05/19 15:00:28] [slave_node_event_handler] [INFO] Only runningon HARD state change - currently SOFT
[2010/05/19 15:00:28] [slave_node_event_handler] [INFO] Finished
[2010/05/19 15:00:52] [slave_node_event_handler] [INFO] Starting
[2010/05/19 15:00:52] [slave_node_event_handler] [INFO] Only runningon HARD state change - currently SOFT
[2010/05/19 15:00:52] [slave_node_event_handler] [INFO] Finished
[2010/05/19 15:01:26] [slave_node_event_handler] [INFO] Starting
[2010/05/19 15:01:26] [slave_node_event_handler] [INFO] Only runningon HARD state change - currently SOFT
[2010/05/19 15:01:26] [slave_node_event_handler] [INFO] Finished
[2010/05/19 15:01:54] [slave_node_event_handler] [INFO] Starting
[2010/05/19 15:01:54] [slave_node_event_handler] [INFO] Only runningwhen OK - state is currently CRITICAL
[2010/05/19 15:01:54] [slave_node_event_handler] [INFO] Finished
[2010/05/19 15:02:16] [slave_node_event_handler] [INFO] Starting
[2010/05/19 15:02:16] [slave_node_event_handler] [INFO] Only runningwhen OK - state is currently CRITICAL
[2010/05/19 15:02:16] [slave_node_event_handler] [INFO] Finished
[2010/05/19 15:04:02] [import_runtime] [INFO] Starting
[2010/05/19 15:04:02] [import_runtime] [INFO] Importing for2010-05-19 12:00:00[2010/05/19 15:04:02] [import_runtime] [INFO] Importing all resultsand performance data[2010/05/19 15:04:24] [import_runtime] [INFO] Importing downtimestarts
[2010/05/19 15:04:24] [import_runtime] [INFO] Importing downtime ends
[2010/05/19 15:04:24] [import_runtime] [INFO] Checking for incorrectdowntimes[2010/05/19 15:04:24] [import_runtime] [INFO] Caculating relevantdowntimes
[2010/05/19 15:04:24] [import_runtime] [INFO] Importing notifications
[2010/05/19 15:04:24] [import_runtime] [INFO] Importingacknowledgements
[2010/05/19 15:04:24] [import_runtime] [INFO] Importing state history
[2010/05/19 15:04:25] [import_runtime] [INFO] Calculating hourlyavailability
[2010/05/19 15:04:31] [import_runtime] [INFO] Finished import for hour
[2010/05/19 15:04:32] [import_runtime] [INFO] Finished
[2010/05/19 15:14:06] [create_and_send_configs] [INFO] Endingoverall with error=0

There's an import into ODW in the middle (import_runtime) which isirrelevant. There's lots of event handler calls toslave_node_event_handler - I assume this is during the transfer. I'mguessing this means that the Slave-node checks to the slaves arehaving an issue. Is the line between the master and the slavesaturated? Lots of errors causing re-transfers?

I guess this is because the config files for the contacts are nowhuge:
-rw-r----- 1 nagios nagios 2.2M 2010-05-19 14:47 contactgroups.cfg
-rw-r----- 1 nagios nagios  13M 2010-05-19 14:47 contacts.cfg

Did you make any other changes such as adding new host groups oradding new service groups? I was expecting that 3.7.0 would decreasethe size of the contacts and contactgroup pages (because we've removedall the distprofile and masterprofile configurations).

Yes my slaves are reachable over slow lines, that’s the idea of aslave.
How are these configs copied to the slave? Entire copy with scp?Wouldn’t rsync be much much better? After all, most changes aresmall in between reloads.

We compress the config files and then we send using scp, then there'sa post job on the other end which extracts and places accordingly.

We could switch to rsync but that would involve quite a bit ofdevelopment work (there's some post processing that we do as well forslave node specific configurations).

Ton

_______________________________________________
Opsview-users mailing list
[email protected]
http://lists.opsview.org/lists/listinfo/opsview-users

Re: [opsview-users] Extreme slowdown in reload after upgrade to 3.7.0

Reply via email to