Hi,

it took me a while, mainly for the reason that we were redesigning the cluster feature in the past weeks to solve some long standing issues, and we weren't satisfied with the previous implementation either.

On 27.04.2014 16:40, Max Schubert wrote:
Hi all,

We've been very successful with a custom built distributed Nagios architecture at my job that consists of:
* Central config
* DB with pollers defined for the config
* automation that takes
  * Imports the config into a DB
* Divides it into smaller Nagios configs by ( hosts / pollers ) - each poller getting as equal a distrubtion of hosts as we can do * Writes out an objects.cache for each poller including all related config deps
  * pushes the smaller configs out to each poller over scp / ssh
  * restarts all pollers
* Pollers stream data to an instance-specific DB ( so X pollers -> 1 DB )
* Centralized UI for viewing all results and doing command and control on them without caring where the poller physically is

We've got 7 clusters working this way ( each cluster has it's own DB by team or project and it's own UI ) and the process is totally self-service. Teams maintain their own configs, check them into SVN, then run a deploy command to push out tot their pollers. They maintain the pollers, we maintain the backend arch for notifications, metrics streaming, and the UIs.

This system is monitoring over 100k nodes and 400k active service checks every 5 minutes. Works great for static, "pet" architectures where there are lots of VMs or hardware hosts that are maintained and cared for and that don't change often.

Not so good for dynamic envs!

Was tweeting a little with Michael ( thanks Michael! ) just an hour or so ago about how to use our knowledge and experience with Nagios / Icinga to have this work for a more dynamic env - cloud VMs or docker images, where hostnames and IP addresses are dynamically generated and the instance is use once then trash.

My first thought on how this could work.
* We're moving to HA proxy ( that will be our "pet" host in our new architectures - our current monitoring arch will work fine for monitoring those ). * Each HA proxy will have to be bounced through automation ( zookeeper most llikely ) when a node is brought up or down and have it's config re-written to add / delete nodes * At that point we could also add / delete nodes from a mini Icinga instance on each proxy that would serve as the health checker for the nodes and also then stream results ( being intentionally generic here ) to our back end for notification and alarming. * The configs would be tiny and the host portions of the configs would be maintained local only - we'd just have to push new service checks / host groups etc as needed

What kinds of approaches do you all take with these more dynamic environments? If you don't use Icinga for that layer of monitoring, what do you use?

I think what will surely be deemed our "classic" approach will continue to work fine for embedded devices / applicance as there's no other choice there :p.

Basically you should look into the cluster model which is to be released with 2.0 and its zone concept. The previously discussed idea on twitter works in a similar fashion, but the zone models adds the capability of doing load distribution and high availability directly in one zone - which could be one check satellite, or multiple master, or multiple checker instances for instance.

http://docs.icinga.org/icinga2/snapshot/chapter-4.html#distributed-monitoring-and-high-availability

Basically your setup could be divided into such zones and could look like on your config master (assuming you like HA, making it 2 nodes in the zone electing their zone master - the one which exclusively runs the DB IDO feature, until there's a failover condition met when then the secondary node takes over DB IDO).

Note: Default port ist 5665 if not given. Required for endpoints and the ApiListener.

# icinga2-enable-feature api
# vim /etc/icinga2/features-enabled/api.conf
add 'accept_config = true' to all nodes
# service icinga2 restart

Regarding the master zone - both nodes have DB IDO enabled and configured.

object Endpoint "master1" { host = "192.168.2.30" }
object Endpoint "master1" { host = "192.168.2.40" }

object Zone "config-ha-master" {
  endpoints = [ "master1", "master2" ]
}

The pollers could work in 2 seperate scenarios

1) a generic poller zone where all involved nodes receive the same configuration and do load-balanced checks sharing the load 2) smaller poller zones where only a few or one poller act as check satellite.

ad 1)

If you're planning to implement 1) all pollers must be able to see each other in order to replicate all events and to further determine how many checkers are available to calculating check load distribution among themselves (modulo n). That isn't always possible and rather makes sense in your local network.

Could look like

object Endpoint "p1" { host = "192.168.3.10" }
object Endpoint "p2" { host = "192.168.3.20" }
object Endpoint "p3" { host = "192.168.3.30" }

object Zone "pollers" {
  endpoints = [ "p1", "p2", "p3" ]
  parent = "config-ha-master"
}

The configuration on the master would look like the following in /etc/icinga2 (note: zone names must match the directory names)

zones.d/
  config-ha-master
    local.conf
  pollers
    checks.conf


ad 2)

If you see the pollers in their own (network) location, they should get a zone for their own. Their local zone and endpoint configuration must only see the parent "config-ha-master" zone and all involved endpoints. That's important for the general connection attempts and further communication.

Imagine that p1, p2 are a seperate zone, while p3 and p4 do some load balancing in the third zone.

object Endpoint "p1" { host = "192.168.3.10" }
object Endpoint "p2" { host = "192.168.3.20" }
object Endpoint "p3" { host = "192.168.3.30" }
object Endpoint "p4" { host = "192.168.3.40" }


object Zone "poller1" {
  endpoints = [ "p1" ]
  parent = "config-ha-master"
}

object Zone "poller2" {
  endpoints = [ "p2" ]
  parent = "config-ha-master"
}

object Zone "poller3" {
  endpoints = [ "p3", "p4" ]
  parent = "config-ha-master"
}

That could be organized like the following in /etc/icinga2 on the config master. (Note: Multiple instances elect their own active zone master which takes care of the primary message routing, and also runs the DB IDO HA feature. The configuration is synced among all nodes in the zone trusting each other in /var/lib/icinga2/api/zones...)

zones.d/
  config-ha-master/
    local.conf
  poller1/
    checks.conf
  poller2
    checks.conf
  poller3
    shared-checks.conf


Regarding the dynamic environment - you would still organize your configuration on the master in different zones. The zones only get the configuration from their directory and nothing else (poller1 doesn't see anything from poller2 zone for example).

If you are planning to dynamically add a new poller for checks to an existing zone, you need to do the following

- add the instance itself. enable the apilistener, install the ssl certificates. - add a new Endpoint config object, and deploy that to all zone pollers and the master - while at it, add the new endpoint to that zone config, and deploy that to all zone pollers and the master
- reload the zone pollers and the master to see the new instance

The reload btw behaves now like a real reload thanks to Gerd's work - it forks a child verifying the configuration, and if everything went fine, the old parent process is killed and the checkresults/history is read from the state file in order not to loose any important information. That way the reload as is takes seconds (not noticable from the shell).

The overall question is - how many pollers do you really need with Icinga 2. It would be interesting to throw all the 100k hosts and 400k services into one single box and with plenty of hardware and try to scale it for high performance.

But sometimes pollers truly solve the problem of a real distributed architecture not available with Nagios 3/4 or Icinga 1.x. By 'real' I mean at least check load distribution and integrated replication of events. HA features and advanced zone capabilities like configuration synchronisation are just a bonus, because we can do it with Icinga 2.

To get an idea what I am talking about, you can try the simple 2 cluster node scenario I've been building for demo cases (inherited from the original Netways cebit demo setup) available as Vagrant boxes.

Details at https://git.icinga.org/?p=icinga-vagrant.git;a=blob;f=icinga2x-cluster/README;hb=HEAD



kind regards,
Michael


--
DI (FH) Michael Friedrich

[email protected]  || icinga open source monitoring
https://twitter.com/dnsmichi || lead core developer
[email protected]       || https://www.icinga.org/team
irc.freenode.net/icinga      || dnsmichi

_______________________________________________
icinga-users mailing list
[email protected]
https://lists.icinga.org/mailman/listinfo/icinga-users

Reply via email to