Hi David,
Thank you for the detailed description.
I'll use HA not in different locations. Maybe simply on different switches/racks. That could be enough to bypass single
point of failures on hardware/network infrastructure. The rest is enough redundant.
I'm trying to search the best solution on HA of the servers. I'll use maybe CentOS but was a little on doubt with the
shared filesystem on DRBD. Haven't a lot of experiences on that and hope not to find problems like inconsistence of the
drbd partition.
Actually I use a simply Nagios installation with an rsync on cron-basis every 5 minuts. These servers are without
opsview. Migrating to opsview I was looking for something better. NFS is more stable? the shared partition should be
mount ONLY by the active-master.
About the load: well, now we have 2000 services across 800 hosts. The new opsview installation should be used to monitor
more or less 6-7000 services and 2000 hosts to merge different monitoring infrastructures. To prevent most of the load
on masters, I'll introduce different slaves (havent tested the cluster slaves, hope they works :)). This should reduce
active checks on masters.
Last but not least, MySQL will resiede on external MySQL Cluster server.
Also any other suggestion is welcome!
Experiences are always good point to start!
Thank's
Simon
David LaPorte ha scritto in data 17/03/2010 15.28:
We currently run OpsView with heartbeat and DRBD on two geographically diverse
servers. We're currently monitoring 702 hosts and 2184 services and the drbd
replication is using about 12 mbit of bandwidth which is fairly consistent. We
keep /usr/local/nagios/, /usr/local/opsview-web/, /usr/local/opsview-reports/,
/etc/httpd/, and /var/lib/mysql/ on our drbd mount point.
One trick I used was to mirror the relevant main routing table entries in the
local table with the source address of the shared heartbeat IP. This being
triggered by heartbeat through a script. This is to make sure that checks are
coming from a predictable source for ACL/config purposes. I think that some
OSs don't let you modify the local routing table but CentOS5 does. The reason
for using the local routing table is that it has a higher preference than the
main routing table and this way you don't have to mess with modifying routes on
hb takeover and give them up on standby. When you lose the IP address, the
routes you added to the local table are dropped automatically causing traffic
to hit the main table and your original routes.
Right now, our single points of failure are the locations themselves. We
monitor hosts in public IP space through a default route at one location and
have a static route to the other physical location for monitoring hosts on
private IP space. It's not the worst because if we lose the location where a
majority of the private IP space is located, we won't be able to monitor much
of it anyway. :) The problem is if we lose the location where the default
(and public) route is located then we unnecessarily cut off access to our
ability to monitor hosts in public IP space. We're looking at more dynamic
options for that.
Also, I understand that running heartbeat over this kind of distance is
considered bad practice. And I could write a script to manually take down
services and bring up services and just run the script on each server for
manual failover. But I like heartbeat scheduling things for me so I just set
the timeout values really high (like, a day or two) and wrote checks to alert
me in the unlikely scenario that late heartbeats are detected so I can prevent
a split brain scenario before it happens. Plus obvious checks for the
individual servers in the cluster with checks for OpsView itself residing in
the host that checks the shared IP(s).
-David
_______________________________________________
Opsview-users mailing list
[email protected]
http://lists.opsview.org/lists/listinfo/opsview-users