Hi, We are currently deploying Shinken (with Thruk and livestatus broker module) for one of our customers, to try replacing and improving its legacy monolithic Nagios solution. The Shinken installation is nearly finished and tests are in progress.
We noticed that availability/trend reports generated with Thurk via Livestatus broker module contains strange values that do not reflect the reality. Then I started to play a bit with your re-implementation of the Livestatus module, to look at the scheduler and broker code, and I have few questions about it. So, broker module Livestatus gets all "monitoring log messages" (hosts/services status changes, notifications logs...) from scheduler(s) when it connects to it, and stores them in its own sqlite database to keep all historical data necessary for generating availability/trends reports and more. If I understand correctly the code, all pending "monitoring log messages" that must be sent to brokers are stored by the scheduler in broks instances, and queued in a simple python dict. In the same way, all broks received by the broker are also queued in a simple python dict. And : * The scheduler queue is emptied when a broker (the first) reads the pending broks, so using several brokers on 1 scheduler seems impossible for the moment (not a so big problem for me for the moment) * The scheduler and broker queues are only stored in memory, so if the scheduler or the broker stops or crashes, will all the pending broks be lost ? (more problematic) * When the scheduler queue is full, oldest broks start to be dropped. So if the link scheduler <-> broker is lost during a long time, it seems we will lose log messages and potentially very important monitoring data (like host/services state change). If so, nearly all the work done by scheduler when this link is lost will go to /dev/null (checks are done, but results are lost, in fact only actions made by reactionners will be eventually done). My questions/remarks : * Will the Shinken Livestatus module will be periodically upgraded to follow the original Livestatus progression (new features...) ? * What about the Livestatus sqlite database in one year, in a big monitoring architecture ? big database, slow queries ? * What about using Livestatus on top of Simple log, or perhaps storing the logs in a couchdb/mongodb database instead of a sqlite or any relationnal database ? * I think it's important (in a production context) to secure the broks exchange process between scheduler and broker to become stop-safe, crash-safe and link-loss-safe, because I fear the reason we get some curious results in our availability/trends reports is that we have lost some important broks (like hosts/services state changes messages) while restarting daemons. Is such a feature is planned on the roadmap ? Thank you for all the work you've done ;) Regards, Laurent Guyon Adelux ------------------------------------------------------------------------------ Learn how Oracle Real Application Clusters (RAC) One Node allows customers to consolidate database storage, standardize their database environment, and, should the need arise, upgrade to a full multi-node Oracle RAC database without downtime or disruption http://p.sf.net/sfu/oracle-sfdevnl _______________________________________________ Shinken-devel mailing list Shinken-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/shinken-devel