We currently run OpsView with heartbeat and DRBD on two geographically diverse servers. We're currently monitoring 702 hosts and 2184 services and the drbd replication is using about 12 mbit of bandwidth which is fairly consistent. We keep /usr/local/nagios/, /usr/local/opsview-web/, /usr/local/opsview-reports/, /etc/httpd/, and /var/lib/mysql/ on our drbd mount point.
One trick I used was to mirror the relevant main routing table entries in the local table with the source address of the shared heartbeat IP. This being triggered by heartbeat through a script. This is to make sure that checks are coming from a predictable source for ACL/config purposes. I think that some OSs don't let you modify the local routing table but CentOS5 does. The reason for using the local routing table is that it has a higher preference than the main routing table and this way you don't have to mess with modifying routes on hb takeover and give them up on standby. When you lose the IP address, the routes you added to the local table are dropped automatically causing traffic to hit the main table and your original routes. Right now, our single points of failure are the locations themselves. We monitor hosts in public IP space through a default route at one location and have a static route to the other physical location for monitoring hosts on private IP space. It's not the worst because if we lose the location where a majority of the private IP space is located, we won't be able to monitor much of it anyway. :) The problem is if we lose the location where the default (and public) route is located then we unnecessarily cut off access to our ability to monitor hosts in public IP space. We're looking at more dynamic options for that. Also, I understand that running heartbeat over this kind of distance is considered bad practice. And I could write a script to manually take down services and bring up services and just run the script on each server for manual failover. But I like heartbeat scheduling things for me so I just set the timeout values really high (like, a day or two) and wrote checks to alert me in the unlikely scenario that late heartbeats are detected so I can prevent a split brain scenario before it happens. Plus obvious checks for the individual servers in the cluster with checks for OpsView itself residing in the host that checks the shared IP(s). -David On Mar 17, 2010, at 9:03 AM, Simone Felici wrote: > > Hi, > > Are there some experiences on the community with OPSView Master in High > Availability? > I've read the documentation on > http://docs.opsview.org/doku.php?id=opsview-community:hamaster but here it's > used the old > version of heartbeat. > Meanwhile there is openais/corosync, drbd instead of NFS, and so forth. > I ask only if someone has found a good and tested solution to follow and test > :) > In addiction some informations on how much hosts/services are monitored... > > Thanks a lot to all for the attention, > > Simon > _______________________________________________ > Opsview-users mailing list > [email protected] > http://lists.opsview.org/lists/listinfo/opsview-users _______________________________________________ Opsview-users mailing list [email protected] http://lists.opsview.org/lists/listinfo/opsview-users
