Hi,

We are currently deploying Shinken (with Thruk and livestatus broker
module) for one of our customers, to try replacing and improving its
legacy monolithic Nagios solution. The Shinken installation is nearly
finished and tests are in progress.

We noticed that availability/trend reports generated with Thurk via
Livestatus broker module contains strange values that do not reflect the
reality.

Then I started to play a bit with your re-implementation of the
Livestatus module, to look at the scheduler and broker code, and I have
few questions about it.


So, broker module Livestatus gets all "monitoring log
messages" (hosts/services status changes, notifications logs...) from
scheduler(s) when it connects to it, and stores them in its own sqlite
database to keep all historical data necessary for generating
availability/trends reports and more.

If I understand correctly the code, all pending "monitoring log
messages" that must be sent to brokers are stored by the scheduler in
broks instances, and queued in a simple python dict. In the same way,
all broks received by the broker are also queued in a simple python
dict. And :

 * The scheduler queue is emptied when a broker (the first) reads the
pending broks, so using several brokers on 1 scheduler seems impossible
for the moment (not a so big problem for me for the moment)

 * The scheduler and broker queues are only stored in memory, so if the
scheduler or the broker stops or crashes, will all the pending broks be
lost ? (more problematic)

 * When the scheduler queue is full, oldest broks start to be dropped.
So if the link scheduler <-> broker is lost during a long time, it seems
we will lose log messages and potentially very important monitoring data
(like host/services state change). If so, nearly all the work done by
scheduler when this link is lost will go to /dev/null (checks are done,
but results are lost, in fact only actions made by reactionners will be
eventually done).


My questions/remarks :

 * Will the Shinken Livestatus module will be periodically upgraded to
follow the original Livestatus progression (new features...) ?

 * What about the Livestatus sqlite database in one year, in a big
monitoring architecture ? big database, slow queries ?

 * What about using Livestatus on top of Simple log, or perhaps storing
the logs in a couchdb/mongodb database instead of a sqlite or any
relationnal database ?

 * I think it's important (in a production context) to secure the broks
exchange process between scheduler and broker to become stop-safe,
crash-safe and link-loss-safe, because I fear the reason we get some
curious results in our availability/trends reports is that we have lost
some important broks (like hosts/services state changes messages) while
restarting daemons. Is such a feature is planned on the roadmap ?


Thank you for all the work you've done ;)


Regards,

Laurent Guyon
Adelux



------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Shinken-devel mailing list
Shinken-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/shinken-devel

Reply via email to