Hi,
it took me a while, mainly for the reason that we were redesigning the
cluster feature in the past weeks to solve some long standing issues,
and we weren't satisfied with the previous implementation either.
On 27.04.2014 16:40, Max Schubert wrote:
Hi all,
We've been very successful with a custom built distributed Nagios
architecture at my job that consists of:
* Central config
* DB with pollers defined for the config
* automation that takes
* Imports the config into a DB
* Divides it into smaller Nagios configs by ( hosts / pollers ) -
each poller getting as equal a distrubtion of hosts as we can do
* Writes out an objects.cache for each poller including all related
config deps
* pushes the smaller configs out to each poller over scp / ssh
* restarts all pollers
* Pollers stream data to an instance-specific DB ( so X pollers -> 1 DB )
* Centralized UI for viewing all results and doing command and control
on them without caring where the poller physically is
We've got 7 clusters working this way ( each cluster has it's own DB
by team or project and it's own UI ) and the process is totally
self-service. Teams maintain their own configs, check them into SVN,
then run a deploy command to push out tot their pollers. They
maintain the pollers, we maintain the backend arch for notifications,
metrics streaming, and the UIs.
This system is monitoring over 100k nodes and 400k active service
checks every 5 minutes. Works great for static, "pet" architectures
where there are lots of VMs or hardware hosts that are maintained and
cared for and that don't change often.
Not so good for dynamic envs!
Was tweeting a little with Michael ( thanks Michael! ) just an hour
or so ago about how to use our knowledge and experience with Nagios /
Icinga to have this work for a more dynamic env - cloud VMs or docker
images, where hostnames and IP addresses are dynamically generated and
the instance is use once then trash.
My first thought on how this could work.
* We're moving to HA proxy ( that will be our "pet" host in our new
architectures - our current monitoring arch will work fine for
monitoring those ).
* Each HA proxy will have to be bounced through automation ( zookeeper
most llikely ) when a node is brought up or down and have it's config
re-written to add / delete nodes
* At that point we could also add / delete nodes from a mini Icinga
instance on each proxy that would serve as the health checker for the
nodes and also then stream results ( being intentionally generic here
) to our back end for notification and alarming.
* The configs would be tiny and the host portions of the configs would
be maintained local only - we'd just have to push new service checks /
host groups etc as needed
What kinds of approaches do you all take with these more dynamic
environments? If you don't use Icinga for that layer of monitoring,
what do you use?
I think what will surely be deemed our "classic" approach will
continue to work fine for embedded devices / applicance as there's no
other choice there :p.
Basically you should look into the cluster model which is to be released
with 2.0 and its zone concept. The previously discussed idea on twitter
works in a similar fashion, but the zone models adds the capability of
doing load distribution and high availability directly in one zone -
which could be one check satellite, or multiple master, or multiple
checker instances for instance.
http://docs.icinga.org/icinga2/snapshot/chapter-4.html#distributed-monitoring-and-high-availability
Basically your setup could be divided into such zones and could look
like on your config master (assuming you like HA, making it 2 nodes in
the zone electing their zone master - the one which exclusively runs the
DB IDO feature, until there's a failover condition met when then the
secondary node takes over DB IDO).
Note: Default port ist 5665 if not given. Required for endpoints and the
ApiListener.
# icinga2-enable-feature api
# vim /etc/icinga2/features-enabled/api.conf
add 'accept_config = true' to all nodes
# service icinga2 restart
Regarding the master zone - both nodes have DB IDO enabled and configured.
object Endpoint "master1" { host = "192.168.2.30" }
object Endpoint "master1" { host = "192.168.2.40" }
object Zone "config-ha-master" {
endpoints = [ "master1", "master2" ]
}
The pollers could work in 2 seperate scenarios
1) a generic poller zone where all involved nodes receive the same
configuration and do load-balanced checks sharing the load
2) smaller poller zones where only a few or one poller act as check
satellite.
ad 1)
If you're planning to implement 1) all pollers must be able to see each
other in order to replicate all events and to further determine how many
checkers are available to calculating check load distribution among
themselves (modulo n). That isn't always possible and rather makes sense
in your local network.
Could look like
object Endpoint "p1" { host = "192.168.3.10" }
object Endpoint "p2" { host = "192.168.3.20" }
object Endpoint "p3" { host = "192.168.3.30" }
object Zone "pollers" {
endpoints = [ "p1", "p2", "p3" ]
parent = "config-ha-master"
}
The configuration on the master would look like the following in
/etc/icinga2 (note: zone names must match the directory names)
zones.d/
config-ha-master
local.conf
pollers
checks.conf
ad 2)
If you see the pollers in their own (network) location, they should get
a zone for their own. Their local zone and endpoint configuration must
only see the parent "config-ha-master" zone and all involved endpoints.
That's important for the general connection attempts and further
communication.
Imagine that p1, p2 are a seperate zone, while p3 and p4 do some load
balancing in the third zone.
object Endpoint "p1" { host = "192.168.3.10" }
object Endpoint "p2" { host = "192.168.3.20" }
object Endpoint "p3" { host = "192.168.3.30" }
object Endpoint "p4" { host = "192.168.3.40" }
object Zone "poller1" {
endpoints = [ "p1" ]
parent = "config-ha-master"
}
object Zone "poller2" {
endpoints = [ "p2" ]
parent = "config-ha-master"
}
object Zone "poller3" {
endpoints = [ "p3", "p4" ]
parent = "config-ha-master"
}
That could be organized like the following in /etc/icinga2 on the config
master. (Note: Multiple instances elect their own active zone master
which takes care of the primary message routing, and also runs the DB
IDO HA feature. The configuration is synced among all nodes in the zone
trusting each other in /var/lib/icinga2/api/zones...)
zones.d/
config-ha-master/
local.conf
poller1/
checks.conf
poller2
checks.conf
poller3
shared-checks.conf
Regarding the dynamic environment - you would still organize your
configuration on the master in different zones. The zones only get the
configuration from their directory and nothing else (poller1 doesn't see
anything from poller2 zone for example).
If you are planning to dynamically add a new poller for checks to an
existing zone, you need to do the following
- add the instance itself. enable the apilistener, install the ssl
certificates.
- add a new Endpoint config object, and deploy that to all zone pollers
and the master
- while at it, add the new endpoint to that zone config, and deploy that
to all zone pollers and the master
- reload the zone pollers and the master to see the new instance
The reload btw behaves now like a real reload thanks to Gerd's work - it
forks a child verifying the configuration, and if everything went fine,
the old parent process is killed and the checkresults/history is read
from the state file in order not to loose any important information.
That way the reload as is takes seconds (not noticable from the shell).
The overall question is - how many pollers do you really need with
Icinga 2. It would be interesting to throw all the 100k hosts and 400k
services into one single box and with plenty of hardware and try to
scale it for high performance.
But sometimes pollers truly solve the problem of a real distributed
architecture not available with Nagios 3/4 or Icinga 1.x. By 'real' I
mean at least check load distribution and integrated replication of
events. HA features and advanced zone capabilities like configuration
synchronisation are just a bonus, because we can do it with Icinga 2.
To get an idea what I am talking about, you can try the simple 2 cluster
node scenario I've been building for demo cases (inherited from the
original Netways cebit demo setup) available as Vagrant boxes.
Details at
https://git.icinga.org/?p=icinga-vagrant.git;a=blob;f=icinga2x-cluster/README;hb=HEAD
kind regards,
Michael
--
DI (FH) Michael Friedrich
[email protected] || icinga open source monitoring
https://twitter.com/dnsmichi || lead core developer
[email protected] || https://www.icinga.org/team
irc.freenode.net/icinga || dnsmichi
_______________________________________________
icinga-users mailing list
[email protected]
https://lists.icinga.org/mailman/listinfo/icinga-users