Hi all,

We've been very successful with a custom built distributed Nagios
architecture at my job that consists of:
* Central config
* DB with pollers defined for the config
* automation that takes
  * Imports the config into a DB
  * Divides it into smaller Nagios configs by ( hosts / pollers ) - each
poller getting as equal a distrubtion of hosts as we can do
  * Writes out an objects.cache for each poller including all related
config deps
  * pushes the smaller configs out to each poller over scp / ssh
  * restarts all pollers
* Pollers stream data to an instance-specific DB ( so X pollers -> 1 DB )
* Centralized UI for viewing all results and doing command and control on
them without caring where the poller physically is

We've got 7 clusters working this way ( each cluster has it's own DB by
team or project and it's own UI ) and the process is totally self-service.
Teams maintain their own configs, check them into SVN, then run a deploy
command to push out tot their pollers.  They maintain the pollers, we
maintain the backend arch for notifications, metrics streaming, and the UIs.

This system is monitoring over 100k nodes and 400k active service checks
every 5 minutes.  Works great for static, "pet" architectures where there
are lots of VMs or hardware hosts that are maintained and cared for and
that don't change often.

Not so good for dynamic envs!

Was tweeting a little with Michael  ( thanks Michael! ) just an hour or so
ago about how to use our knowledge and experience with Nagios / Icinga to
have this work for a more dynamic env - cloud VMs or docker images, where
hostnames and IP addresses are dynamically generated and the instance is
use once then trash.

My first thought on how this could work.
* We're moving to HA proxy ( that will be our "pet" host in our new
architectures - our current monitoring arch will work fine for monitoring
those ).
* Each HA proxy will have to be bounced through automation ( zookeeper most
llikely ) when a node is brought up or down and have it's config re-written
to add / delete nodes
* At that point we could also add / delete nodes from a mini Icinga
instance on each proxy that would serve as the health checker for the nodes
and also then stream results ( being intentionally generic here ) to our
back end for notification and alarming.
* The configs would be tiny and the host portions of the configs would be
maintained local only - we'd just have to push new service checks / host
groups etc as needed

What kinds of approaches do you all take with these more dynamic
environments?  If you don't use Icinga for that layer of monitoring, what
do you use?

I think what will surely be deemed our "classic" approach will continue to
work fine for embedded devices / applicance as there's no other choice
there :p.

- Max
_______________________________________________
icinga-users mailing list
[email protected]
https://lists.icinga.org/mailman/listinfo/icinga-users

Reply via email to