Hi David,

Thank you for the detailed description.
I'll use HA not in different locations. Maybe simply on different switches/racks. That could be enough to bypass single point of failures on hardware/network infrastructure. The rest is enough redundant. I'm trying to search the best solution on HA of the servers. I'll use maybe CentOS but was a little on doubt with the shared filesystem on DRBD. Haven't a lot of experiences on that and hope not to find problems like inconsistence of the drbd partition. Actually I use a simply Nagios installation with an rsync on cron-basis every 5 minuts. These servers are without opsview. Migrating to opsview I was looking for something better. NFS is more stable? the shared partition should be mount ONLY by the active-master. About the load: well, now we have 2000 services across 800 hosts. The new opsview installation should be used to monitor more or less 6-7000 services and 2000 hosts to merge different monitoring infrastructures. To prevent most of the load on masters, I'll introduce different slaves (havent tested the cluster slaves, hope they works :)). This should reduce active checks on masters.
Last but not least, MySQL will resiede on external MySQL Cluster server.

Also any other suggestion is welcome!

Experiences are always good point to start!

Thank's

Simon

David LaPorte ha scritto in data 17/03/2010 15.28:
We currently run OpsView with heartbeat and DRBD on two geographically diverse 
servers.  We're currently monitoring 702 hosts and 2184 services and the drbd 
replication is using about 12 mbit of bandwidth which is fairly consistent.  We 
keep /usr/local/nagios/, /usr/local/opsview-web/, /usr/local/opsview-reports/, 
/etc/httpd/, and /var/lib/mysql/ on our drbd mount point.

One trick I used was to mirror the relevant main routing table entries in the 
local table with the source address of the shared heartbeat IP.  This being 
triggered by heartbeat through a script.  This is to make sure that checks are 
coming from a predictable source for ACL/config purposes.  I think that some 
OSs don't let you modify the local routing table but CentOS5 does.  The reason 
for using the local routing table is that it has a higher preference than the 
main routing table and this way you don't have to mess with modifying routes on 
hb takeover and give them up on standby.  When you lose the IP address, the 
routes you added to the local table are dropped automatically causing traffic 
to hit the main table and your original routes.

Right now, our single points of failure are the locations themselves.  We 
monitor hosts in public IP space through a default route at one location and 
have a static route to the other physical location for monitoring hosts on 
private IP space.  It's not the worst because if we lose the location where a 
majority of the private IP space is located, we won't be able to monitor much 
of it anyway.  :)  The problem is if we lose the location where the default 
(and public) route is located then we unnecessarily cut off access to our 
ability to monitor hosts in public IP space.  We're looking at more dynamic 
options for that.

Also, I understand that running heartbeat over this kind of distance is 
considered bad practice.  And I could write a script to manually take down 
services and bring up services and just run the script on each server for 
manual failover.  But I like heartbeat scheduling things for me so I just set 
the timeout values really high (like, a day or two) and wrote checks to alert 
me in the unlikely scenario that late heartbeats are detected so I can prevent 
a split brain scenario before it happens.  Plus obvious checks for the 
individual servers in the cluster with checks for OpsView itself residing in 
the host that checks the shared IP(s).


-David


_______________________________________________
Opsview-users mailing list
[email protected]
http://lists.opsview.org/lists/listinfo/opsview-users

Reply via email to